web-scraper/README.md

web-scraper
===========

A simple script that scrapes a website, extracting texts in a CSV file with the format below, and saving images.

| Page      | Tag                             | Text         | Link              | Image                  |
|-----------|---------------------------------|--------------|-------------------|------------------------|
| page path | element tag (h{1,6}, a, p, etc) | text content | link url (if any) | image address (if any) |

## Usage
First, install dependencies (python3):

```
pip install -r requirements
```

Then create a file containing urls of the websites you want to scrape, one line for each website, for example (I'll call this file `test_websites`):

```
https://theread.me
https://theguardian.com
```

Now you are ready to execute the script:

```
python index.py test_websites
                # ^ path to your file
```

After the script is done with it's job, you can find the results in `results/<website_hostname>` folder.

To see available options, try `python index.py -h`.
chore: add readme 2017-04-17 10:23:12 +00:00			`web-scraper`
			`===========`

			`A simple script that scrapes a website, extracting texts in a CSV file with the format below, and saving images.`

			`\| Page \| Tag \| Text \| Link \| Image \|`
			`\|-----------\|---------------------------------\|--------------\|-------------------\|------------------------\|`
			`\| page path \| element tag (h{1,6}, a, p, etc) \| text content \| link url (if any) \| image address (if any) \|`

			`## Usage`
			`First, install dependencies (python3):`

			```
			`pip install -r requirements`
			```

			Then create a file containing urls of the websites you want to scrape, one line for each website, for example (I'll call this file `test_websites`):

			```
			`https://theread.me`
			`https://theguardian.com`
			```

			`Now you are ready to execute the script:`

			```
			`python index.py test_websites`
chore: update 2017-04-17 10:24:35 +00:00			`# ^ path to your file`
chore: add readme 2017-04-17 10:23:12 +00:00			```

chore: mention results 2017-04-17 10:27:37 +00:00			After the script is done with it's job, you can find the results in `results/<website_hostname>` folder.

chore: add readme 2017-04-17 10:23:12 +00:00			To see available options, try `python index.py -h`.