Skip to content

Configuration

WebArc is configurable via a YAML file. The configuration controls how archives are stored, refreshed, and captured, down to individual domains and URL paths.

Top-Level Options

route_internal: bool

route_internal: true
  • Rewrite links inside archived pages to point back to the local archive.
  • Prevents browsers from accidentally fetching live content.

enable_fetch: bool

enable_fetch: true
  • Enable on-demand fetching of missing resources.
  • Useful for hybrid capture/replay setups.

Requests Section

The requests section controls what gets captured and how:

requests:
  blacklisted_domains: [...]
  global: {...}
  domains: [...]

It defines:

  1. Global defaults for all domains
  2. Domain-specific rules
  3. Path-specific overrides

Blacklisted Domains

blacklisted_domains:
  - "^gitlab"
  - "youtube"
  • Domains matching these regexes are never fetched or archived.
  • Useful for dynamic, private, or otherwise problematic sites.

Global Defaults

global:
  outdated: "30month"
  keep_n: 5

Global defaults apply everywhere unless overridden. You can see the available RequestConfigValues here.

Domain Configuration

domains:
  - domain: "example.com"
    global:
      outdated: "1d"
      keep_n: 5
  • domain — the domain this config applies to
  • global — overrides global defaults for this domain (see RequestConfigValues)
  • path_match — optional path-specific rules

WebArc uses cascading rules:

Path-specific > Domain-global > Global defaults

Path-Specific Overrides

path_match:
  - path: "watch"
    apply:
      drop: true
  • path — regex to match the URL path
  • applyRequestConfigValues applied to matching requests

This lets you handle dynamic pages, skip irrelevant content, or force re-fetches.

RequestConfigValues

RequestConfigValues is the set of options that can be applied globally, per-domain, or per-path.

Option Type Description
outdated duration string How long before a cached response is considered stale (e.g., 10d, 30month)
keep_n integer Number of snapshots to retain per resource
always_fetch bool If true, always fetch fresh content; never return from archive
drop bool If true, skip this resource entirely; never fetch or store

Examples:

  • Global default:
global:
  outdated: "30month"
  keep_n: 5
  • Domain override:
domains:
  - domain: "example.com"
    global:
      outdated: "1d"
      keep_n: 5
  • Path-specific override:
path_match:
  - path: "watch"
    apply:
      drop: true

Real-World Example: Arch Linux Mirror

- domain: "geo.mirror.pkgbuild.com"
  global:
    outdated: "10d"
    keep_n: 3

  path_match:
    - path: "\\.db(\\.sig)?$"
      apply:
        always_fetch: true
        outdated: "10s"
        keep_n: 3
  • Regular files: refreshed every 10 days, keep 3 snapshots
  • DB files: always fetched, keep 3 snapshots