Web scraping and site mirroring

Need to mirror an entire website? Use the httrack command, available in all Linux distributions. If site requires authentication, provide to httrack a cookies.txt file exported from your browser.

A typical httrack use:

httrack https://some.site.com -W -O ~/websites/somesite --robots=0 --stay-on-same-address

On ~/websites/somesite will be created a folder structure with HTML and other files similar to the URLs on https://some.site.com. You can even open the index.html file locally in your browser and browse the mirrored content.

The httrack technique performs better and faster than the wget command, which is also very powerful but for simpler tasks. The curl command is popular in macOS but is far simpler and can’t crawl an entire site.

If the content you need to download is media, video, audio from sites like YouTube, Vimeo, Facebook, Instagram, TikTok, Twitter or other video sites, yt-dlp does the job perfectly. Also available in all Linux systems (or with a plain pip install yt-dlp in any Python environment), this tool is a maintained fork of the old youtube-dl, which is now unmaintained and obsolete.

Content downloading from the Web is the most basic operation of the Internet. It is what web browsers do all the time. Using these tools is exactly like switching to another browser, one that saves the content in a deterministic way instead of just rendering on the screen.

In the same way, there are even more specialized tools capable of saving content provided by music services as Spotify, Deezer, Tidal, sometimes in lossless and studio-grade quality, complete with embedded lyrics, album cover art and high quality tagging. But these are subjects for another writing.

These tools are not recommended if you need more advanced interactions with services on the Internet. For a more reliable and precise content exchange try first the service’s APIs, if they provide one. Keep web scraping and site crawling as your very last resource.

Also on my LinkedIn.

One thought on “Web scraping and site mirroring”

Leave a Reply

Your email address will not be published. Required fields are marked *