Webarchivn'

2025-02-10

This post is over a year old. Content may have changed, moved or may be wrong in it's entirety. The views that I express here probably have changed. If you feel this is in error, or something I should correct or clarify, please feel free to email me using the link below the post. Thank you.

Sometimes you just want to preserve something. Like a page you visited, or are visiting. Sometimes, the browser is Good Enough™ and you can save it for offline viewing. Sometimes, however, you want some level of archive. Thankfully, there is a standard for that!

But how do you actually make one of these so-called .warc files? Well, thanks to the folks at the GNU Project, wget already has it built in! You can just use some options to get an archive:

$ wget --adjust-extension \
       --execute robots=off \
       --convert-links \
       --no-parent
       --mirror \
       --warc-file=domainname.com \
       --no-warc-keep-log \
       --page-requisites \
       --no-verbose "https://urlhere.com"

Now, you’ll get a folder of files that mirror the page or site, plus you’ll get a .warc file that contains all of that in a single digestable format suitable for libraries and search engines, neat!

Respond via email.