2017-04-17 10:23:12 +00:00
|
|
|
web-scraper
|
|
|
|
===========
|
|
|
|
|
|
|
|
A simple script that scrapes a website, extracting texts in a CSV file with the format below, and saving images.
|
|
|
|
|
|
|
|
| Page | Tag | Text | Link | Image |
|
|
|
|
|-----------|---------------------------------|--------------|-------------------|------------------------|
|
|
|
|
| page path | element tag (h{1,6}, a, p, etc) | text content | link url (if any) | image address (if any) |
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
First, install dependencies (python3):
|
|
|
|
|
|
|
|
```
|
|
|
|
pip install -r requirements
|
|
|
|
```
|
|
|
|
|
|
|
|
Then create a file containing urls of the websites you want to scrape, one line for each website, for example (I'll call this file `test_websites`):
|
|
|
|
|
|
|
|
```
|
|
|
|
https://theread.me
|
|
|
|
https://theguardian.com
|
|
|
|
```
|
|
|
|
|
|
|
|
Now you are ready to execute the script:
|
|
|
|
|
|
|
|
```
|
|
|
|
python index.py test_websites
|
2017-04-17 10:24:35 +00:00
|
|
|
# ^ path to your file
|
2017-04-17 10:23:12 +00:00
|
|
|
```
|
|
|
|
|
2017-04-17 10:27:37 +00:00
|
|
|
After the script is done with it's job, you can find the results in `results/<website_hostname>` folder.
|
|
|
|
|
2017-04-17 10:23:12 +00:00
|
|
|
To see available options, try `python index.py -h`.
|