Skip to content

Archive Format

The archive is the central storage system that powers all functionality.
It allows you to store, replay, and analyze HTTP traffic in a reliable and structured way.

At a high level, an archive consists of three main components:

  1. Metadata database – structured information about every HTTP request and response
  2. HTTP object store – the actual bodies of requests and responses, stored efficiently as blobs
  3. Indexers – optional derived databases that provide extra metadata and analytics

Core

Metadata Database

  • Stores all information about each HTTP transaction (URL, headers, status codes, timestamps).
  • Acts as the index for the archive, letting WebArc quickly find the right snapshot or version of a resource.
  • Decouples metadata from actual content so queries are fast and storage is efficient.

HTTP Object Store

  • Stores the raw content (request and response bodies) separately from the metadata.
  • Content is split into chunks for efficiency, deduplication, and easier retrieval.
  • Each stored body is referenced by a hash in the metadata database.

Indexers

  • Optional databases built on top of the archive to derive additional insights, like full-text search, domain stats, or link graphs.
  • They never modify the core archive; they only enhance its usability.