fails on non working urls #1

Closed
opened 2020-09-08 15:59:47 +00:00 by divingdog · 4 comments
divingdog commented 2020-09-08 15:59:47 +00:00 (Migrated from github.com)

Dear Mahdi Dibaiee,

Please help me with your program. I really like all of it and would be happy to use it, but I get an error if it tries to process a non working url. I guess there is some version problem. Could you please take a look on my error code?

python3 index.py test_websites --depth 0 --no-image
https://notarealsite61681.com
Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 844, in _validate_conn
conn.connect()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 284, in connect
conn = self._new_conn()
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "index.py", line 94, in
scrape(main)
File "index.py", line 48, in scrape
html = requests.get(t).text
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 487, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Thank you for your help!
Gergo

Dear Mahdi Dibaiee, Please help me with your program. I really like all of it and would be happy to use it, but I get an error if it tries to process a non working url. I guess there is some version problem. Could you please take a look on my error code? python3 index.py test_websites --depth 0 --no-image https://notarealsite61681.com Traceback (most recent call last): File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 345, in _make_request self._validate_conn(conn) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 844, in _validate_conn conn.connect() File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 284, in connect conn = self._new_conn() File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn self, "Failed to establish a new connection: %s" % e) requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 423, in send timeout=timeout File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen _stacktrace=sys.exc_info()[2]) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/packages/urllib3/util/retry.py", line 376, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',)) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "index.py", line 94, in <module> scrape(main) File "index.py", line 48, in scrape html = requests.get(t).text File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 70, in get return request('get', url, params=params, **kwargs) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/api.py", line 56, in request return session.request(method=method, url=url, **kwargs) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 488, in request resp = self.send(prep, **send_kwargs) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/sessions.py", line 609, in send r = adapter.send(request, **kwargs) File "/home/samoa2/.local/lib/python3.6/site-packages/requests/adapters.py", line 487, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='notarealsite61681.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f0751771e48>: Failed to establish a new connection: [Errno -2] Name or service not known',)) Thank you for your help! Gergo
mdibaiee commented 2020-09-09 10:15:12 +00:00 (Migrated from github.com)

Hi @divingdog, what do you expect to happen in case of a non-working URL?

If you want to avoid this error, you can put a try block over this line:

https://github.com/mdibaiee/web-scraper/blob/master/index.py#L48

So it becomes something like this:

try:
  html = requests.get(t).text
except requests.exceptions.ConnectionError:
  print("Could not connect {}".format(t))
Hi @divingdog, what do you expect to happen in case of a non-working URL? If you want to avoid this error, you can put a `try` block over this line: https://github.com/mdibaiee/web-scraper/blob/master/index.py#L48 So it becomes something like this: ``` try: html = requests.get(t).text except requests.exceptions.ConnectionError: print("Could not connect {}".format(t)) ```
divingdog commented 2020-09-09 12:29:13 +00:00 (Migrated from github.com)

Dear Mahdi Dibaiee,

Thank you for your fast reply!
I ran into dead urls when working in depth 2. In that case the whole process stopped with the error code. I think it would be nice to just ignore these errors and continue the scraping process.

I am trying to add the new lines to it, but it gives an error.

The code looks like this:

 def scrape(url, depth=0):
    data = []
    if args.depth is not None and depth > args.depth: return

    t = url.geturl()

    if t in visited: return

    print(t)

    try:
        html = requests.get(t).text
    except requests.exceptions.ConnectionError:
        print("Could not connect {}".format(t))
    visited.append(t)

    soup = BeautifulSoup(html, 'html.parser')
    elements = soup.find_all(tags)

And this is the error I got:
python3 index.py test_websites --depth 0 --no-image
http://obi.hu
https://notarealsite61681.com
Could not connect https://notarealsite61681.com
Traceback (most recent call last):
File "index.py", line 104, in
scrape(main)
File "index.py", line 54, in scrape
soup = BeautifulSoup(html, 'html.parser')
UnboundLocalError: local variable 'html' referenced before assignment

Thank You!

Dear Mahdi Dibaiee, Thank you for your fast reply! I ran into dead urls when working in depth 2. In that case the whole process stopped with the error code. I think it would be nice to just ignore these errors and continue the scraping process. I am trying to add the new lines to it, but it gives an error. The code looks like this: def scrape(url, depth=0): data = [] if args.depth is not None and depth > args.depth: return t = url.geturl() if t in visited: return print(t) try: html = requests.get(t).text except requests.exceptions.ConnectionError: print("Could not connect {}".format(t)) visited.append(t) soup = BeautifulSoup(html, 'html.parser') elements = soup.find_all(tags) And this is the error I got: python3 index.py test_websites --depth 0 --no-image http://obi.hu https://notarealsite61681.com Could not connect https://notarealsite61681.com Traceback (most recent call last): File "index.py", line 104, in <module> scrape(main) File "index.py", line 54, in scrape soup = BeautifulSoup(html, 'html.parser') UnboundLocalError: local variable 'html' referenced before assignment Thank You!
mdibaiee commented 2020-09-09 15:17:52 +00:00 (Migrated from github.com)

@divingdog Try adding a return at the end of the except block:

try:
  html = requests.get(t).text
except requests.exceptions.ConnectionError:
  print("Could not connect {}".format(t))
  return
visited.append(t)
@divingdog Try adding a `return` at the end of the `except` block: ``` try: html = requests.get(t).text except requests.exceptions.ConnectionError: print("Could not connect {}".format(t)) return visited.append(t) ```
divingdog commented 2020-09-10 15:22:34 +00:00 (Migrated from github.com)

Now it is awesome!

Thank You!

Now it is awesome! Thank You!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: thereadme/web-scraper#1
No description provided.