Archive Format¶
The archive is the central storage system that powers all functionality.
It allows you to store, replay, and analyze HTTP traffic in a reliable and structured way.
At a high level, an archive consists of three main components:
- Metadata database – structured information about every HTTP request and response
- HTTP object store – the actual bodies of requests and responses, stored efficiently as blobs
- Indexers – optional derived databases that provide extra metadata and analytics
Core¶
Metadata Database¶
- Stores all information about each HTTP transaction (URL, headers, status codes, timestamps).
- Acts as the index for the archive, letting WebArc quickly find the right snapshot or version of a resource.
- Decouples metadata from actual content so queries are fast and storage is efficient.
HTTP Object Store¶
- Stores the raw content (request and response bodies) separately from the metadata.
- Content is split into chunks for efficiency, deduplication, and easier retrieval.
- Each stored body is referenced by a hash in the metadata database.
Indexers¶
- Optional databases built on top of the archive to derive additional insights, like full-text search, domain stats, or link graphs.
- They never modify the core archive; they only enhance its usability.