bugfix: webCrawl can't handle website well recursively #814

gaord · 2023-08-22T08:32:48Z

When using Cheerio Web Scraper and selecting Web Crawl, it is supposed to crawling web pages recursively. There were couple bugs for successfully doing that:
1, extracting urls from webpage with things like default:blank and mailto etc.
2, extracting rules are not well aligned with specification of html element
3, it always returns 10 pages of the website
Furthermore, this commit also adds a few informative logs when debug enabled

When using Cheerio Web Scraper and selecting Web Crawl, it is supposed to crawling web pages recursively. There were couple bugs for successfully doing that: 1, extracting urls from webpage with things like default:blank and mailto etc. 2, extracting rules are not well aligned with specification of html <a> element 3, it always returns 10 pages of the website Furthermore, this commit also adds a few informative logs when debug enabled Signed-off-by: Ben Gao <bengao168@msn.com>

CLAassistant · 2023-08-22T08:32:54Z

All committers have signed the CLA.

chungyau97 · 2023-08-29T10:34:34Z

Hi @gaord

I got a lot of additional weird links from this PR
Tested URL: https://www.itsjane.com

https://www.itsjane.com/jobs-1/eatatjane

https://www.itsjane.com/team-member/heidi-lee/itsjanesworld

Results:
webCrawl_bug.txt
original.txt

gaord · 2023-08-30T02:40:45Z

Hi,
The website is not well maintained actually. There are quite many links broken. As an example you mentioned:
go to in browser: https://www.itsjane.com/jobs-1/
in the left-bottom corner of the page you will see a link: https://www.itsjane.com/jobs-1/eatatjane
got a screenshot for you:

well tested! thanks.
@chungyau97

chungyau97 · 2023-09-01T03:12:08Z

Hi @gaord,

My testing:

eatatjane will not be scrape in current method as it's not a URL or a relative path:

info@itsjane.com will be scrape in current method but it does not have host name so it will not be added into pages array:

Conclusions:

What I can agree on with you is adding this code below, this will help improve scraping performance.:

        if (
            !linkElement.href ||
            linkElement.href.startsWith('about:blank') ||
            linkElement.href.startsWith('mailto:') ||
            linkElement.href.includes('#')
        )

Logs:
This can be done on your own forked repository.

Other functionality:
Not entirely sure what you're trying to achieve, as the current method merges baseURL and relative path correctly. I cross-reference the web scraping results from other sources, it displays the same outcomes as our current method.

cc @HenryHengZJ

gaord · 2023-09-01T07:34:41Z

hi there, could you try https://docs.ceph.com/en/quincy/ please? Let me know what you find with the new code.

@chungyau97

chungyau97 · 2023-09-01T10:14:26Z

Original Code:
pages: ["https://docs.ceph.com/en/quincy","https://docs.ceph.com/en/latest/releases/general","https://docs.ceph.com/en/latest","https://docs.ceph.com/en/reef","https://docs.ceph.com/en/pacific","https://docs.ceph.com/en/latest/releases","https://docs.ceph.com/en/octopus","https://docs.ceph.com/en/nautilus","https://docs.ceph.com/en/mimic","https://docs.ceph.com//readthedocs.org/projects/ceph","https://docs.ceph.com//readthedocs.org/builds/ceph"], length: 11

failed href: because does not cater to ../../, ../ and relativePathName/
failHref: ["start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","security/","glossary/","jaegertracing/","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","../../","../../","../../start/intro/","../../install/","../../cephadm/","../../rados/","../../cephfs/","../../rbd/","../../radosgw/","../../mgr/","../../mgr/dashboard/","../../api/","../../architecture/","../../dev/developer_guide/","../../dev/internals/","../../governance/","../../foundation/","../../ceph-volume/","../","../../security/","../../glossary/","../../jaegertracing/","../../","../#ceph-releases-index","../../ceph-volume/zfs/inventory/","../","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","releases/general/","releases/","security/","glossary/","jaegertracing/","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","releases/general/","releases/","security/","glossary/","jaegertracing/","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","genindex/","http-routingtable/","py-modindex/","start/intro/","","dev/developer_guide/basic-workflow/#basic-workflow-dev-guide","start/documenting-ceph/#documenting-ceph","radosgw","rbd","cephfs","install","architecture","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","security/","glossary/","genindex/","genindex/","http-routingtable/","py-modindex/","start/intro/","","../","../","../start/intro/","../install/","../cephadm/","../rados/","../cephfs/","../rbd/","../radosgw/","../mgr/","../mgr/dashboard/","../api/","../architecture/","../dev/developer_guide/","../dev/internals/","../governance/","../foundation/","../ceph-volume/","general/","reef/","quincy/","pacific/","octopus/","nautilus/","mimic/","luminous/","kraken/","jewel/","infernalis/","hammer/","giant/","firefly/","emperor/","dumpling/","cuttlefish/","bobtail/","argonaut/","../security/","../glossary/","../jaegertracing/","../","reef","reef#v18-2-0-reef","quincy","quincy#v17-2-6-quincy","pacific","pacific#v16-2-14-pacific","octopus","octopus#v15-2-17-octopus","nautilus","nautilus#v14-2-22-nautilus","mimic","mimic#v13-2-10-mimic","luminous","luminous#v12-2-13-luminous","kraken","kraken#v11-2-1-kraken","jewel","jewel#v10-2-11-jewel","infernalis","infernalis#v9-2-1-infernalis","hammer","hammer#v0-94-10-hammer","giant","giant#v0-87-2-giant","firefly","firefly#v0-80-11-firefly","emperor","emperor#v0-72-2-emperor","dumpling","dumpling#v0-67-11-dumpling","reef","quincy","reef#v18-2-0-reef","quincy#v17-2-6-quincy","quincy#v17-2-5-quincy","quincy#v17-2-4-quincy","quincy#v17-2-3-quincy","quincy#v17-2-2-quincy","quincy#v17-2-1-quincy","quincy#v17-2-0-quincy","general/","reef/","genindex/","start/intro/","","radosgw","rbd","cephfs","start","architecture","start/intro/","install/","cephadm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/developer_guide/","dev/internals/","governance/","foundation/","ceph-volume/","releases/general/","releases/","glossary/","genindex/","genindex/","start/intro/","","genindex/","start/intro/","radosgw","rbd","cephfs","start","architecture","start/intro/","start/","install/","start/kube-helm/","rados/","cephfs/","rbd/","radosgw/","mgr/","mgr/dashboard/","api/","architecture/","dev/","dev/internals/","governance/","ceph-volume/","releases/","glossary/","genindex/","genindex/","start/intro/","genindex/","start/intro/","radosgw","rbd","cephfs","start","architecture","start/intro/","start/","install/","start/kube-helm/","rados/","cephfs/","rbd/","radosgw/","mgr/","api/","architecture/","dev/","ceph-volume/","releases/","glossary/","genindex/","genindex/","start/intro/"]

This PR:
website too big and takes too long: actively crawling https://docs.ceph.com/docs/master/security/cves/ pages: 2672
Tested https://www.wagslane.dev/ produced the same result for both and does not cross over to other hostname.

HenryHengZJ · 2023-09-01T10:41:07Z

hey @gaord thank you so much for the solution! We are going to put this on hold first, since we are going to revamp the URL scraping UI to allow users see what links are scraped, and stop whenever they want. Otherwise this could go on for a long period of time and leave users in the dark as in what have been scraped

gaord · 2023-09-05T10:06:11Z

no problem. Do things right is first thing for it saves time. Original code doesn't implement in a way to process web pages. PR is crafted in case it is useful for others. Thanks for your great work to make AI apps easy anyway.
@HenryHengZJ

HenryHengZJ requested a review from chungyau97 August 22, 2023 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: webCrawl can't handle website well recursively #814

bugfix: webCrawl can't handle website well recursively #814

gaord commented Aug 22, 2023

CLAassistant commented Aug 22, 2023 •

edited

chungyau97 commented Aug 29, 2023 •

edited

gaord commented Aug 30, 2023

chungyau97 commented Sep 1, 2023

gaord commented Sep 1, 2023

chungyau97 commented Sep 1, 2023 •

edited

HenryHengZJ commented Sep 1, 2023

gaord commented Sep 5, 2023

bugfix: webCrawl can't handle website well recursively #814

Are you sure you want to change the base?

bugfix: webCrawl can't handle website well recursively #814

Conversation

gaord commented Aug 22, 2023

CLAassistant commented Aug 22, 2023 • edited

chungyau97 commented Aug 29, 2023 • edited

gaord commented Aug 30, 2023

chungyau97 commented Sep 1, 2023

My testing:

Conclusions:

gaord commented Sep 1, 2023

chungyau97 commented Sep 1, 2023 • edited

HenryHengZJ commented Sep 1, 2023

gaord commented Sep 5, 2023

CLAassistant commented Aug 22, 2023 •

edited

chungyau97 commented Aug 29, 2023 •

edited

chungyau97 commented Sep 1, 2023 •

edited