simple web scraping: extract texts and links as CSV, and save images of multiple websites
.gitignore | ||
helpers.py | ||
index.py | ||
README.md | ||
requirements | ||
test_websites |
web-scraper
A simple script that scrapes a website, extracting texts in a CSV file with the format below, and saving images.
Page | Tag | Text | Link | Image |
---|---|---|---|---|
page path | element tag (h{1,6}, a, p, etc) | text content | link url (if any) | image address (if any) |
Usage
First, install dependencies (python3):
pip install -r requirements
Then create a file containing urls of the websites you want to scrape, one line for each website, for example (I'll call this file test_websites
):
https://theread.me
https://theguardian.com
Now you are ready to execute the script:
python index.py test_websites
# ^ path to your file
After the script is done with it's job, you can find the results in results/<website_hostname>
folder.
To see available options, try python index.py -h
.