WebArc¶

WebArc is a local-first web archiving system designed to capture, preserve, and replay HTTP content in an extensible way.

Unlike traditional crawlers that fetch pages in isolation, WebArc focuses on recording real HTTP traffic, storing it in an archive, and making that archive usable through multiple interfaces.

About¶

WebArc is a toolchain for:

Capturing HTTP(S) traffic into a persistent archive
Replaying archived content locally as if it were still online
Inspecting archived data at different abstraction levels
Integrating with existing tools and workflows

WebArc is local-first by design: archives live on your machine, and you control how they are created, served, and accessed.

Core Concepts¶

At a high level, WebArc consists of three ideas:

Capture
HTTP traffic is intercepted or fetched and written into an archive.
Archive
The archive is the authoritative record of requests, responses, and metadata.
Access
Archived content can be accessed in multiple ways:

-> Served over HTTP

-> Proxied to other tools

-> Mounted as a filesystem

-> Queried or processed programmatically

You don’t need to use every component — WebArc is intentionally modular.