Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: webCrawl can't handle website well recursively #814

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gaord
Copy link

@gaord gaord commented Aug 22, 2023

When using Cheerio Web Scraper and selecting Web Crawl, it is supposed to crawling web pages recursively. There were couple bugs for successfully doing that:
1, extracting urls from webpage with things like default:blank and mailto etc.
2, extracting rules are not well aligned with specification of html element
3, it always returns 10 pages of the website
Furthermore, this commit also adds a few informative logs when debug enabled

When using Cheerio Web Scraper and selecting Web Crawl, it is supposed to crawling web pages recursively. There were couple bugs for successfully doing that:
1, extracting urls from webpage with things like default:blank and mailto etc.
2, extracting rules are not well aligned with specification of html <a> element
3, it always returns 10 pages of the website
Furthermore, this commit also adds a few informative logs when debug enabled

Signed-off-by: Ben Gao <bengao168@msn.com>
@CLAassistant
Copy link

CLAassistant commented Aug 22, 2023

CLA assistant check
All committers have signed the CLA.

@chungyau97
Copy link
Contributor

chungyau97 commented Aug 29, 2023

Hi @gaord

I got a lot of additional weird links from this PR
Tested URL: https://www.itsjane.com

https://www.itsjane.com/jobs-1/eatatjane
image
https://www.itsjane.com/team-member/heidi-lee/itsjanesworld
image

Results:
webCrawl_bug.txt
original.txt

@gaord
Copy link
Author

gaord commented Aug 30, 2023

Hi,
The website is not well maintained actually. There are quite many links broken. As an example you mentioned:
go to in browser: https://www.itsjane.com/jobs-1/
in the left-bottom corner of the page you will see a link: https://www.itsjane.com/jobs-1/eatatjane
got a screenshot for you:
image

well tested! thanks.
@chungyau97

@chungyau97
Copy link
Contributor

Hi @gaord,

My testing:

eatatjane will not be scrape in current method as it's not a URL or a relative path:
image

image

info@itsjane.com will be scrape in current method but it does not have host name so it will not be added into pages array:
image

image

Conclusions:

What I can agree on with you is adding this code below, this will help improve scraping performance.:

        if (
            !linkElement.href ||
            linkElement.href.startsWith('about:blank') ||
            linkElement.href.startsWith('mailto:') ||
            linkElement.href.includes('#')
        )

Logs:
This can be done on your own forked repository.

Other functionality:
Not entirely sure what you're trying to achieve, as the current method merges baseURL and relative path correctly. I cross-reference the web scraping results from other sources, it displays the same outcomes as our current method.

cc @HenryHengZJ

@gaord
Copy link
Author

gaord commented Sep 1, 2023

hi there, could you try https://docs.ceph.com/en/quincy/ please? Let me know what you find with the new code.

@chungyau97

@chungyau97
Copy link
Contributor

chungyau97 commented Sep 1, 2023

Original Code:
pages: ["https://docs.ceph.com/en/quincy","https://docs.ceph.com/en/latest/releases/general","https://docs.ceph.com/en/latest","https://docs.ceph.com/en/reef","https://docs.ceph.com/en/pacific","https://docs.ceph.com/en/latest/releases","https://docs.ceph.com/en/octopus","https://docs.ceph.com/en/nautilus","https://docs.ceph.com/en/mimic","https://docs.ceph.com//readthedocs.org/projects/ceph","https://docs.ceph.com//readthedocs.org/builds/ceph"], length: 11

failed href: because does not cater to ../../, ../ and relativePathName/
failHref: ["start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","security/","glossary/","jaegertracing/","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","../../","../../","../../start/intro/","../../install/","../../cephadm/","../../rados/","../../cephfs/","../../rbd/","../../radosgw/","../../mgr/","../../mgr/dashboard/","../../api/","../../architecture/","../../dev/developer_guide/","../../dev/internals/","../../governance/","../../foundation/","../../ceph-volume/","../","../../security/","../../glossary/","../../jaegertracing/","../../","../#ceph-releases-index","../../ceph-volume/zfs/inventory/","../","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","releases/general/","releases/","security/","glossary/","jaegertracing/","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","releases/general/","releases/","security/","glossary/","jaegertracing/","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","genindex/","http-routingtable/","py-modindex/","start/intro/","","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","security/","glossary/","genindex/","genindex/","http-routingtable/","py-modindex/","start/intro/","","../","../","../start/intro/","../install/","../cephadm/","../rados/","../cephfs/","../rbd/","../radosgw/","../mgr/","../mgr/dashboard/","../api/","../architecture/","../dev/developer_guide/","../dev/internals/","../governance/","../foundation/","../ceph-volume/","general/","reef/","quincy/","pacific/","octopus/","nautilus/","mimic/","luminous/","kraken/","jewel/","infernalis/","hammer/","giant/","firefly/","emperor/","dumpling/","cuttlefish/","bobtail/","argonaut/","../security/","../glossary/","../jaegertracing/","../","reef","reef#v18-2-0-reef","quincy","quincy#v17-2-6-quincy","pacific","pacific#v16-2-14-pacific","octopus","octopus#v15-2-17-octopus","nautilus","nautilus#v14-2-22-nautilus","mimic","mimic#v13-2-10-mimic","luminous","luminous#v12-2-13-luminous","kraken","kraken#v11-2-1-kraken","jewel","jewel#v10-2-11-jewel","infernalis","infernalis#v9-2-1-infernalis","hammer","hammer#v0-94-10-hammer","giant","giant#v0-87-2-giant","firefly","firefly#v0-80-11-firefly","emperor","emperor#v0-72-2-emperor","dumpling","dumpling#v0-67-11-dumpling","reef","quincy","reef#v18-2-0-reef","quincy#v17-2-6-quincy","quincy#v17-2-5-quincy","quincy#v17-2-4-quincy","quincy#v17-2-3-quincy","quincy#v17-2-2-quincy","quincy#v17-2-1-quincy","quincy#v17-2-0-quincy","general/","reef/","genindex/","start/intro/","","radosgw","rbd","cephfs","start","architecture","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","releases/general/","releases/","glossary/","genindex/","genindex/","start/intro/","","genindex/","start/intro/","radosgw","rbd","cephfs","start","architecture","start/intro/","start/","install/","start/kube-helm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/","dev/internals/","governance/","ceph-volume/","releases/","glossary/","genindex/","genindex/","start/intro/","genindex/","start/intro/","radosgw","rbd","cephfs","start","architecture","start/intro/","start/","install/","start/kube-helm/","rados/","cephfs/","rbd/","radosgw/","mgr/","api/","architecture/","dev/","ceph-volume/","releases/","glossary/","genindex/","genindex/","start/intro/"]

This PR:
website too big and takes too long: actively crawling https://docs.ceph.com/docs/master/security/cves/ pages: 2672
Tested https://www.wagslane.dev/ produced the same result for both and does not cross over to other hostname.

@HenryHengZJ
Copy link
Contributor

hey @gaord thank you so much for the solution! We are going to put this on hold first, since we are going to revamp the URL scraping UI to allow users see what links are scraped, and stop whenever they want. Otherwise this could go on for a long period of time and leave users in the dark as in what have been scraped

@gaord
Copy link
Author

gaord commented Sep 5, 2023

no problem. Do things right is first thing for it saves time. Original code doesn't implement in a way to process web pages. PR is crafted in case it is useful for others. Thanks for your great work to make AI apps easy anyway.
@HenryHengZJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants