fails on non working urls #1
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Dear Mahdi Dibaiee,
Please help me with your program. I really like all of it and would be happy to use it, but I get an error if it tries to process a non working url. I guess there is some version problem. Could you please take a look on my error code?
python3 index.py test_websites --depth 0 --no-image
https://notarealsite61681.com
Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 844, in _validate_conn
conn.connect()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 284, in connect
conn = self._new_conn()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "index.py", line 94, in
scrape(main)
File "index.py", line 48, in scrape
html = requests.get(t).text
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 487, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Thank you for your help!
Gergo
Hi @divingdog, what do you expect to happen in case of a non-working URL?
If you want to avoid this error, you can put a
try
block over this line:https://github.com/mdibaiee/web-scraper/blob/master/index.py#L48
So it becomes something like this:
Dear Mahdi Dibaiee,
Thank you for your fast reply!
I ran into dead urls when working in depth 2. In that case the whole process stopped with the error code. I think it would be nice to just ignore these errors and continue the scraping process.
I am trying to add the new lines to it, but it gives an error.
The code looks like this:
And this is the error I got:
python3 index.py test_websites --depth 0 --no-image
http://obi.hu
https://notarealsite61681.com
Could not connect https://notarealsite61681.com
Traceback (most recent call last):
File "index.py", line 104, in
scrape(main)
File "index.py", line 54, in scrape
soup = BeautifulSoup(html, 'html.parser')
UnboundLocalError: local variable 'html' referenced before assignment
Thank You!
@divingdog Try adding a
return
at the end of theexcept
block:Now it is awesome!
Thank You!