Skip to content

WebArc

Archive Format

Archive Format¶

The archive is the central storage system that powers all functionality.
It allows you to store, replay, and analyze HTTP traffic in a reliable and structured way.

At a high level, an archive consists of three main components:

Metadata database – structured information about every HTTP request and response
HTTP object store – the actual bodies of requests and responses, stored efficiently as blobs
Indexers – optional derived databases that provide extra metadata and analytics

Core¶

Metadata Database¶

Stores all information about each HTTP transaction (URL, headers, status codes, timestamps).
Acts as the index for the archive, letting WebArc quickly find the right snapshot or version of a resource.
Decouples metadata from actual content so queries are fast and storage is efficient.

HTTP Object Store¶

Stores the raw content (request and response bodies) separately from the metadata.
Content is split into chunks for efficiency, deduplication, and easier retrieval.
Each stored body is referenced by a hash in the metadata database.

Indexers¶

Optional databases built on top of the archive to derive additional insights, like full-text search, domain stats, or link graphs.
They never modify the core archive; they only enhance its usability.