Wednesday, October 18, 2006

Saving the disappearing web with wget

Anyone who has browsed the web for a while knows the frustration of finding an extremely cool site, only to return to it later and find that it's been taken down. The Internet Archive can often help find the lost information, but it's far from perfect.

Thus, it's with great joy that I've discovered wget, a small Linux application that automates the downloading and archiving of websites. Wget runs from the command line and is very flexible; it can do everything from save a single file ("wget http://mydomain/myfile") to saving lists of files or entire sites.

If you know all the URLs that you want to save, you can simple put them in a file (each URL on a separate line) and then tell wget to download all the URLs in that file ("wget -i file_with_urls.txt"). This can be handy if you have a lot of files with similar names that you want to download (e.g., a bunch of videos titled "video1.mov" "video2.mov" etc.).

The program can also recursively travel through a website's files and download all the content from the site; this is what I've started doing on a few of the sites that have information I'd like to keep for the long term. The command line flag to do this is -r (e.g., "wget -r http://mydomain"), but there are a few options that you might want to consider using along with the -r flag:
  • -l X - limits the depth of the recursion to the number specified (X) after the flag.
  • -k - converts the links in the saved files so that they refer to the files downloaded onto your computer, unless the files they refer to weren't downloaded (e.g., they exceeded the depth you specified), in which case the links are re-coded to refer to the original site.
  • -p - downloads everything required to view the page properly (e.g., CSS sheets or files that would otherwise have been rejected)
  • -w X - specifies the number of seconds (X) that wget should wait between each download.
  • --random-wait - randomly varies the number of seconds specified by -w between each file download; this apparently helps prevent sites from detecting automated downloads and blocking them
  • -H - spans the domains referred to by the pages when recursing (use with care)
  • -D X - spans the domains contained in the comma-separated list (X) when recursing
  • -0 X - sends all output to a file (X) instead of the terminal
  • -b - starts wget in the background
The -k option is insanely useful, as it rewrites the pages to make them browsable offline; without that option, any links that were hard-coded to refer to the original site would direct you back to the site, even if you were browsing from the copy you had saved.

So, for instance, to download a backup copy of my blog that's browsable offline and contains all the pictures and other content I host on external sites, I use the command:
wget -rkpl 3 -w 5 --random-wait -bo rhosgobel_bkup_log.txt -HD "rhosgobel.blogspot.com,www.flickr.com,home.comcast.net,photos1.blogger.com" http://rhosgobel.blogspot.com
To watch the progress all I have to do is "tail -f rhosgobel_bkup_log.txt". The only two problems with this scheme is that photos1.blogger.com (where I posted some pictures in 2004) has a robots.txt file that stops wget from downloading files from it, and I also end up getting more than just my own pictures on Flickr (but by limiting wget to 3 iterations I don't get too many extra pictures).

I can change the options to suit my needs: if I want to create a backup with the original links intact (i.e., not re-written to be browsable offline) I just remove the "k" option; if I want just the text of my posts without pictures or content I've posted on other sites I remove the -HD "..." option. So, as another example, here's the command I use to download a backup copy of my blog (without images) that leaves all links intact (i.e., it's suitable for re-uploading if my blog ever got deleted / erased):
wget -rp -w 5 --random-wait -bo rhosgobel_bkup_log.txt -HD "rhosgobel.blogspot.com,home.comcast.net" http://rhosgobel.blogspot.com
Wget (of course) has an insane number of options that I haven't listed; to see them all refer to its man page.
Categories:

No comments: