br tag #2

Open
opened 2020-11-08 17:48:10 +00:00 by Csemid · 2 comments
Csemid commented 2020-11-08 17:48:10 +00:00 (Migrated from github.com)

Dear Mdibaiee!

I have some issues that I can't fix.

Now my tags are looking like this:

tags = ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'li', 'span', 'a', 'img', 'br']

I have added br tag to it.
When the scraper runs this way, it find's all the br tag which is not inside for example in a p tag.
But when br tag is inside a p tag it won't find the text.

gitpic

I the case of what is shown on the pic I can't get any of the text inside br.
Are there any chance that You have an easy workaround for this?

Thank You!
Csemid

Dear Mdibaiee! I have some issues that I can't fix. Now my tags are looking like this: tags = ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'li', 'span', 'a', 'img', **'br'**] I have added br tag to it. When the scraper runs this way, it find's all the br tag which is not inside for example in a p tag. But when br tag is inside a p tag it won't find the text. ![gitpic](https://user-images.githubusercontent.com/74149514/98472205-d9cf2b80-21f1-11eb-9ff3-73e7a34b2461.jpg) I the case of what is shown on the pic I can't get any of the text inside br. Are there any chance that You have an easy workaround for this? Thank You! Csemid
Csemid commented 2020-11-26 09:43:04 +00:00 (Migrated from github.com)

Dear Mdibaiee!

Do you think that you will have time for this issue nowdays?

Thank You!
Csemid

Dear Mdibaiee! Do you think that you will have time for this issue nowdays? Thank You! Csemid
mdibaiee commented 2020-11-26 12:38:34 +00:00 (Migrated from github.com)

Hi @Csemid.

You don't need to add the br tag to the list as br tags themselves don't contain text.

The change necessary on the code is around this line:
https://github.com/mdibaiee/web-scraper/blob/master/index.py#L62

el.string does not contain all of the text inside the p, only the first piece.

To get all of the text, we need something like this:

full_text = ''.join(unicode(child) for child in el.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

You might have to import these:

from bs4 import NavigableString, Comment

Please try it and let me know if it works. If it did work for you, please open a pull-request

Hi @Csemid. You don't need to add the `br` tag to the list as `br` tags themselves don't contain text. The change necessary on the code is around this line: https://github.com/mdibaiee/web-scraper/blob/master/index.py#L62 `el.string` does not contain all of the text inside the `p`, only the first piece. To get all of the text, we need something like this: ``` full_text = ''.join(unicode(child) for child in el.children if isinstance(child, NavigableString) and not isinstance(child, Comment)) ``` You might have to import these: ``` from bs4 import NavigableString, Comment ``` Please try it and let me know if it works. If it did work for you, please open a pull-request
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: thereadme/web-scraper#2
No description provided.