T

Mahdi Dibaiee b0273c0cc3 fix: log the page being scraped

feat: delay option

2017-04-19 22:14:54 +04:30

.gitignore

initial commit

2017-04-17 14:47:44 +04:30

helpers.py

initial commit

2017-04-17 14:47:44 +04:30

index.py

2017-04-19 22:14:54 +04:30

README.md

chore: mention results

2017-04-17 14:57:37 +04:30

requirements

initial commit

2017-04-17 14:47:44 +04:30

test_websites

initial commit

2017-04-17 14:47:44 +04:30

web-scraper

A simple script that scrapes a website, extracting texts in a CSV file with the format below, and saving images.

Page	Tag	Text	Link	Image
page path	element tag (h{1,6}, a, p, etc)	text content	link url (if any)	image address (if any)

Usage

First, install dependencies (python3):

pip install -r requirements

Then create a file containing urls of the websites you want to scrape, one line for each website, for example (I'll call this file test_websites):

https://theread.me
https://theguardian.com

Now you are ready to execute the script:

python index.py test_websites
                # ^ path to your file

After the script is done with it's job, you can find the results in results/<website_hostname> folder.

To see available options, try python index.py -h.