This self-hosted web scraper lets me archive my favorite websites

2 weeks ago 2

The internet is not permanent. That's something most of us forget until a favorite blog goes offline, a forum shuts down or a decade-old article you'd bookmarked for research is no longer available. Or perhaps, your favorite piece of writing gets tucked behind a paywall. Over the last few years, I've come to realize how fragile the internet is with pages breaking and hosts vanishing, and you can never assume that your favorite piece of content will be available in perpetuity.

As someone who writes for a living, that makes me uneasy. The showcase of my work lives on the internet and I want my work to outlive decisions made by the changing hands of publishers. I've always wanted to retain a copy of the stories I've written, not just as links, but as fully functional snapshots. More importantly, I want these under my control, on my own server where I can manage and back them up instead of the cloud. That's why I've been looking for a self-hosted way to archive web pages. Something I could run on my NAS or home server. Something that didn't require a subscription or a dependency on a third-party service. That search led me to Sosse, a clean and efficient selenium-based self-hosted web scraper that lets me save any site or archive in full HTML with just a few clicks.

Installing Sosse on a self-hosted setup

Quick to deploy with Docker and ready in minutes

sosse running on a Synology NAS  

With Docker and Portainer already installed on my NAS, it's what I naturally gravitate to when testing apps for writeups or for running them in production. Sosse offers a Docker image and a simple Compose file which makes things extremely easy to get up and running. In fact, it took me less than ten minutes to get it installed and online.

I started off by creating a local folder on my NAS to store archived pages, mounting the correct volumes using Portainer and hit the deploy button. That's it. The container spun up instantly with no fine-tuning or hand-holding needed. That reduces the barrier to entry dramatically for less technically inclined users. Once it was running, Sosse opened up to a clean welcome page on my local network. I've also configured port forwarding for remote access, but that's not essential. The excellent documentation provided by the developer makes it easy to get into more advanced workflows and is a reminder of how easy self-hosted software can be with quality documentation and ease of access in mind.

A minimal interface built for focused archiving

Clean design, smart defaults and just the right features

Sosse is a purpose-built tool designed to do one thing. You give it a URL, and it creates a full HTML snapshot of the page. But it doesn't stop at downloading the source. It can also rewrite internal links to ensure that they still work even when you're offline. It pulls in stylesheets and assets, so everything looks like the original. it even displays each archived page in a minimal layout that strips away most of the layout and just displays the content. Effectively like the reader view built into browsers.

I like the fact that Sosse doesn't try to do too much, like tacking on a browser extension, or a complicated database tool. It's much simpler with its minimal layout that lets you paste a URL, hit save and get a local archive that works whether the original page survives. The interface is clean, responsive and easy to navigate. You can organize entries by tags, effectively creating a folder-based hierarchy. You can even set browser agents, and schedules and change detection to keep pages updated. It's great.

Using Sosse day to day

A reliable archive that fits into your workflow

Sosse document management

Sosse has quickly become a part of my weekly workflow. I use it to save and archive my published articles the day they go live. I also use it to archive technical documentation that might go offline. Every time I hit save, I know I'm getting a working copy that won't disappear or turn into broken links a few months down the line and that gives me a lot of peace of mind.

Elsewhere, the search interface is fast and, using a combination of search and tags, it is trivial to find articles very quickly. Each saved entry loads up instantly and pages feel native. In fact, I haven't come across any broken stylesheets yet, which tends to be a common issue with such tools. Sosse also handles more complicated webpages just fine. Whether it's a minimalist blog or a JavaScript-heavy site, Sosse can capture it nearly perfectly, making it a great choice. And if you need an added backup, you can configure it to save screenshots alongside.

Why Sosse stands out from other web scrapers

There are dozens of tools that let you save web content but not all of them offer the combination of a simple interface, quick install, and self-hosting. Sosse keeps it simple. It supports multiple users making it easy to deploy across your home or office, there are no analytics. If you care about the content you read, write or share, there's value in keeping a copy offline. Sosse makes that not just possible but seamless. For it, it's already become a quiet but essential part of how I use the internet.

Read Entire Article