Download a Website Using wget

Recently, I needed to clone a website and make a few minor changes to it. I wanted to publish a slightly modified copy of the website. Luckily, it’s easy to do that using wget. Here’s how I did it.

1. Install wget

I’m on Mac, so I installed wget using Homebrew using the command

brew install wget

2. Download site

I wanted to download this small website. I used this command:

wget -p -r https://events.govexec.com/qualys-cyber-risk-conference/
  • The -p flag means download all page requisites, such as images, stylesheets, etc.
  • The -r flag means recursive.

Note that wget will crawl a website to download pages and dependencies. If a page is not linked, whether directly or indirectly from the URL passed to the wget command (an orphan page, it will not get downloaded. One option is to make a list of all URLs (one per line) by getting them from the websites sitemap, assuming the sitemap is complete, and passing that list to the wget command, e.g.

wget -p -r --input-file=download-list.txt

If download stops, you can continue using the command

$ wget -p -r --continue --input-file=download-list.txt

Learn more

3. Search and replace

Since I downloaded a bunch of HTML files, if I wanted to replace a common element on multiple pages, the easiest way was to do a search and replace. Using VisualStudio Code, you can easily find all HTML blocks within a particular tag using a multi-line regex. Here are some example regexes:

<footer(.|\n)*?</footer>
<script(.|\n)*?</script>
<a class="popup(.|\n)*?</a>

Note: these regexes only work if the tags don’t have any nested tags with the same name.