-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for recursive archiving of entire domains, or across domains to a given depth (using a crawler) #191
Comments
I'd say this is currently our 2nd most-requested feature :) It's definitely in the roadmap: #120 but not planned anytime soon because it's not ArchiveBox's primary use-case and it's extremely difficult to do well. For now I recommend using another software to do the crawling to produce a list of URLs of all the pages, and then pipe that list into archivebox to do the actual archiving. Eventually, I may consider exposing similar flags on ArchiveBox as are available on
https://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options-1 These flags together should cover all the use cases:
I anticipate it will take a while to get to this point though (multiple major versions likely), as we first have to build or integrate a crawler of some sort, and web crawling is an extremely complex process with lots of subtle nuance around configuration and environment (see scrapy for inspiration). The process will also naturally be additive if multi-snapshot support is added: #179. Unfortunately, doing mirroring / full-site crawling properly is extremely non-trivial, as it involves building or integrating with an existing crawler/spider. Even just the logic to parse URLs out of a page is deceivingly complex, and there are tons of intricacies around mirroring that don't need to be considered when doing the kind of single-page archiving that ArchiveBox was designed for. Currently this is blocked by setting up our proxy archiver which has support for deduping response data in the WARC files, then we'll also need to pick a crawler, or integrate with an existing one from here. For people landing on this issue and looking for an immediate solution, I recommend using this command (which is exactly what's used by ArchiveBox right now, but with a few recursive options added): wget --server-response \
--no-verbose \
--adjust-extension \
--convert-links \
--force-directories \
--backup-converted \
--compression=auto \
-e robots=off \
--restrict-file-names=unix \
--timeout=60 \
--warc-file=warc \
--page-requisites \
--no-check-certificate \
--no-hsts \
--span-hosts \
--no-parent \
--recursive \
--level=2 \
--warc-file=$(date +%s) \
--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" \
https://example.com Set |
I am interested in this feature, particularly in limited depth/level recursion. I would like ArchiveBox to be able to .archive a page from a specified URL and then archive any links/content that may not be on the same domain. My particular interest is to be able to click around on the archived page to local archived version of it rather than the actual URL (similar to how the Wayback Machine handles it). Keep up the good work and good luck! |
I downloaded this app and took this ability for granted. Just by accident I found it not to work, yet. When can this be expected to be implemented? |
Not for a while, it's a very tricky feature to implement natively, I'd rather integrate an existing crawler and use ArchiveBox to just process the generated stream of URLs. Don't expect this feature anytime soon unless you feel like implementing it yourself, for now you can check out some of the alternative software on the wiki: https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community |
@pirate Thank you very much for the reference. Will look into it, as long as the feature is not implemented into ArchiveBox. |
This got pinned I see. 👀 Does this mean this feature can expect to see some level of work sometime soon? I swear to God, once this gets added I'll be damn happy. It's something I've been dreaming of having, domain-level snapshotting to go back in time for various endeavors. Just imagine... browsing playstation.com or such 20-30 years later, relieving the PS4 and PS5 era in the context of them being retro, but being able to get all those articles etc back. When I find old screenshots of websites and my OS that are pushing 15 years or more now I get nostalgia tickles, a feature like this would go beyond seeing what I saw back in the day and enable me to discover stuff that escaped my eyes at all to begin with at solely MY discretion. The level of archivist comfort here is immeasurable. This combined with automated re-snapshotting over time... Incredible! Bonus question: would this also lay the foundation to enable use cases where I may want to archive not the entire domain, but all items of a certain user's feed? |
You can already archive everything on a user's feed with the However, you can achieve full recursive archiving if you do multiple passes of archivebox add --depth=1 https://example.com
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
...
# etc as many levels deep as you want, it wont duplicate stuff so it's safe to re-run until theres nothing new that it discovers This is nice for a number of reasons, you can keep an eye on progress and make sure it's not accidentally downloading all of youtube by accident, and the URLs are added in order of depth and can be tagged separately during the adding process if you want. It also allows you to manually rate-limit, to avoid being blocked by / taking down the site you're archiving. I really don't want to implement my own full recursive crawler, it's a lot of work and really difficult to maintain. Also a big support burden as crawlers constantly break and need fixing and extra config options to handle different desired behavior on different kinds of sites. I would much rather people use one of the many great crawlers/spiders that are already available and pipe the URLs into archivebox (e.g. Scrapy). https://scrapy-do.readthedocs.io/en/latest/quick-start.html As it stands, I'm unlikely to add a crawler directly into ArchiveBox anytime soon because I barely have enough time to maintain ArchiveBox-as-is, but I'm not opposed to improving the ergonomics around using it with a crawler with smaller PRs, or reviewing a proposed design if someone wants to contribute a way to build scrapy or another existing crawler into AB. This issue is pinned because we get a lot of requests for it and I'd rather make this thread easy to find so people know what the status is. |
@pirate I used a mix of the two suggestion you offered and build myself a workaround. That does the trick for a local version that I can navigate offline.
4a. Alternative way, drop each line with a single command The result pages can be navigated locally because archivebox is intelligent enough to find all linked offline versions. But it is of course not a single page dump. |
Is there a way to exclude urls not within example.org in this setting? Is there an option maybe? |
@francwalter I've added a new option export URL_ALLOWLIST='^http(s)?:\/\/(.+)?example\.com\/?.*$'
# then run your archivebox commands
archivebox add --depth=1 'https://example.com'
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
...
# all URLs that don't match *.example.com will be excluded, e.g. a link to youtube.com would not be followed
# (note that all assets required to render each page are still archived, URL_DENYLIST/URL_ALLOWLIST does not apply to inline images, css, video, etc.) I've also documented the new allowlist support here: https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#URL_ALLOWLIST It will be out in the next release v0.6.3, but if you want to use it early you can run from the |
I may be doing something wrong, but it doesn't work for me
|
@kiwimato I've got the same problem. Seems the I've modified the list command to the following and it seems to be working. Feedback and corrections welcome. I used the substring option as felt it was more flexible... I'm new to archivebox.
|
@JustGitting I got around this by actually using browsterix-cralwer which is awesome and does this out of the box. It also has the ability to create WACZ files directly which can then be used with web-replay-gen Thank you for posting a solution btw :) |
@kiwimato great to hear you found a workaround. I had a look at browsterix-crawler, but hoped to avoid needing to run docker. I like simple commands :-). |
I'm stuck with using SiteSucker and storing into my own local directory until this is added. There are some sites I want to archive entirely for my personal browsing at a later date because I do not know if the site owner will continue to keep their site up. For an example, I have an archive of http://hampa.ch/ which will be useful for a lot of the things I do personally. |
This comment was marked as duplicate.
This comment was marked as duplicate.
I'm working out a way to hopefully archive webcomics, the trouble is most comics store their images as separate but very similar web pages ie domain.org/d/20210617 then domain.org/d/20210618 and domain.org/d/20210619. Maybe rather than trying to cram and everything-crawler into ArchiveBox, we could just improve the connections that could be made to more generalized crawlers? Even a python slot to type in some bs4 or something :3 |
I saw @larshaendler said above that:
Is this true? My internal hyperlinks on snapshots seem to go to the originals, not to their archived offline versions already on ArchiveBox. Even without adding a full-blown crawler, I'm sure a much simpler enhancement would be to, upon adding a new snapshot, rewrite all hyperlinks pointing to that snapshot's original URL to now point to the offline version. And check if any hyperlinks on the current snapshot exist offline in the archive and rewrite those to point to the offline versions. |
I've actually also experienced this too, i tried it out just the other day infact! Even if i archive one path, then archive another with that archived path in it. The links still point back to the original. Dosn't seem to matter what order i archive the sites in. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@Ember-ruby it's about getting data for AI to read and summarize, as part of a use case. That is why the WGet method is so important. |
Is it in scope to have it be possible to archive:
The text was updated successfully, but these errors were encountered: