README.md

April 16, 2026 · View on GitHub

Wayback-Diff Banner

Detect meaningful differences between web pages -- with Wayback Machine artifact cleaning, visual comparison, and significance scoring.

Python Versions PyPI Version License: GPL-3.0 Docker codecov


Why Wayback-Diff?

Comparing web pages sounds simple until you deal with Wayback Machine injection artifacts, insignificant whitespace noise, and visual regressions invisible to the DOM. Wayback-Diff is a purpose-built CLI that solves all three:

  • Wayback Machine cleaning -- automatically strips banners, analytics scripts, playback code, and URL rewrites so you compare actual content.
  • Significance scoring -- every change is tagged High, Medium, or Low so you focus on what matters.
  • Multi-browser visual comparison -- captures screenshots in Chrome, Firefox, Edge, and Opera, then generates pixel-diff images.
  • CI/CD-ready exit codes -- integrate directly into pipelines (0 = no changes, 1 = low/medium, 2 = high).

Table of Contents


Quick Start

pip install wayback-diff

# Compare two pages
wayback-diff https://example.com/old https://example.com/new

# Compare a Wayback snapshot with the live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/

# Full report: visual diff + markdown
wayback-diff https://old.example.com https://new.example.com --visual --markdown

Installation

From PyPI

pip install wayback-diff

# With visual comparison support
pip install wayback-diff[visual]

From source

git clone https://github.com/GeiserX/Wayback-Diff.git
cd Wayback-Diff
python3 -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

For visual comparison support:

pip install -e ".[visual]"

Docker

docker build -t wayback-diff .
docker run --rm wayback-diff https://example.com/a https://example.com/b

Usage

Basic comparison

wayback-diff https://example.com/page1 https://example.com/page2

Wayback Machine support

The tool automatically detects Wayback Machine URLs and cleans injection artifacts before comparing:

# Archive vs. live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/

# Two archive snapshots
wayback-diff \
  https://web.archive.org/web/20230101/https://example.com/ \
  https://web.archive.org/web/20230601/https://example.com/

Output formats

# Save to file
wayback-diff url1 url2 -o diff.txt

# JSON (for programmatic consumption)
wayback-diff url1 url2 --format json

# Unified diff
wayback-diff url1 url2 --format unified

Site-wide traversal

# Crawl and compare across linked pages (depth-limited)
wayback-diff url1 url2 --traverse --depth 2

Advanced options

FlagDescription
--no-clean-waybackDisable Wayback Machine artifact removal
--no-ignore-whitespaceTreat whitespace changes as significant
--timeout NSet HTTP timeout in seconds (default: 30)
--verboseEnable detailed logging

Visual Comparison

Take screenshots in one or more browsers and generate side-by-side difference images:

# Auto-detect all installed browsers
wayback-diff url1 url2 --visual

# Specific browsers
wayback-diff url1 url2 --visual --browsers chrome firefox edge opera

# Custom viewport
wayback-diff url1 url2 --visual --viewport-width 1280 --viewport-height 720

# Non-headless mode (for debugging)
wayback-diff url1 url2 --visual --no-headless

# Custom screenshot output
wayback-diff url1 url2 --visual --screenshot-dir ./my-screenshots

Visual comparison generates:

  • Screenshots of both pages per browser
  • Side-by-side comparison images
  • Pixel-level difference highlighting (red overlay marks changes)

Markdown Reports

Generate comprehensive Markdown reports that include everything in a single reviewable document:

wayback-diff url1 url2 --visual --markdown --report-dir ./reports

Each report contains:

  • Executive summary with change statistics
  • Visual comparison screenshots (when --visual is used)
  • Changes grouped by significance (High / Medium / Low)
  • Site-wide results (when --traverse is used)
  • Actionable recommendations

CI/CD Integration

Wayback-Diff returns meaningful exit codes designed for pipeline gates:

Exit CodeMeaning
0No differences detected
1Low or medium significance changes
2High significance changes detected

GitHub Actions example

name: Visual Regression Check
on:
  pull_request:

jobs:
  diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install Wayback-Diff
        run: |
          pip install -r requirements.txt
          pip install -e ".[visual]"

      - name: Compare staging vs production
        run: |
          wayback-diff \
            https://staging.example.com \
            https://production.example.com \
            --visual --markdown --format json -o diff.json

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: diff-report
          path: reports/

Shell script gate

wayback-diff "$OLD_URL" "$NEW_URL" --format json -o result.json
EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
  echo "BLOCKING: high-significance changes detected"
  exit 1
elif [ $EXIT_CODE -eq 1 ]; then
  echo "WARNING: minor changes detected"
fi

How It Works

Wayback Machine cleaning

When a Wayback Machine URL is detected, the tool automatically:

  1. Removes header artifacts -- strips analytics scripts, playback scripts, and banner CSS injected by the Wayback Machine.
  2. Removes footer comments -- removes archival metadata and copyright notices.
  3. Restores URLs -- converts web.archive.org/web/…/ prefixed URLs back to their originals.
  4. Normalizes content -- handles whitespace and formatting differences introduced by archival.

Significance scoring

Every detected change is categorized:

LevelExamples
HighStructural changes, content text, meta tags, scripts, stylesheets
MediumAttribute changes, inline styling, div/span modifications
LowWhitespace, comments, minor formatting

Intelligent comparison

The diff engine:

  • Focuses on meaningful content changes
  • Ignores noise like timestamps and auto-generated IDs
  • Provides context around each change
  • Groups results by significance for fast review

Output Formats

Text (default)

Summary statistics, significance breakdown, and detailed changes with context lines.

JSON

Structured output for programmatic processing:

{
  "summary": {
    "total_changes": 15,
    "added": 5,
    "removed": 3,
    "modified": 7,
    "high_significance": 2,
    "medium_significance": 8,
    "low_significance": 5
  },
  "changes": [
    {
      "type": "modified",
      "old_text": "...",
      "new_text": "...",
      "significance": "high"
    }
  ]
}

Unified diff

Standard unified diff format, compatible with patch and code review tools.


Comparison with Similar Tools

FeatureWayback-Diffhtmldiffdiff2htmlBackstopJSPercy
HTML-aware semantic diffYesYesNoNoNo
Wayback Machine artifact cleaningYesNoNoNoNo
Significance scoringYesNoNoNoNo
Visual (screenshot) comparisonYesNoNoYesYes
Multi-browser supportYesN/AN/AYesYes
Site-wide crawl and compareYesNoNoYesNo
Markdown report generationYesNoNoNoNo
CI/CD exit codesYesNoNoYesYes
Self-hosted / no SaaSYesYesYesYesNo
Free and open sourceGPL-3.0MITMITMITFreemium

Testing

pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=wayback_diff --cov-report=html

Contributing

Contributions are welcome. To get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Add tests for new functionality
  4. Ensure all tests pass: pytest tests/ -v
  5. Submit a Pull Request

ProjectDescription
Wayback-ArchiveDownload complete websites from the Wayback Machine with full asset preservation
Wayback-DiffIntelligent web page comparison tool with Wayback Machine support
Way-CMSSimple web CMS for editing HTML/CSS files downloaded from Wayback Archive
web-mirrorMirror any webpage to a local server for offline access
media-downloadDownload all media files from any web page into a folder schema
n8n-nodes-way-cmsn8n community node for Way-CMS archived web content management

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.

This software is not intended for commercial use.