README.md
April 16, 2026 · View on GitHub
Detect meaningful differences between web pages -- with Wayback Machine artifact cleaning, visual comparison, and significance scoring.
Why Wayback-Diff?
Comparing web pages sounds simple until you deal with Wayback Machine injection artifacts, insignificant whitespace noise, and visual regressions invisible to the DOM. Wayback-Diff is a purpose-built CLI that solves all three:
- Wayback Machine cleaning -- automatically strips banners, analytics scripts, playback code, and URL rewrites so you compare actual content.
- Significance scoring -- every change is tagged High, Medium, or Low so you focus on what matters.
- Multi-browser visual comparison -- captures screenshots in Chrome, Firefox, Edge, and Opera, then generates pixel-diff images.
- CI/CD-ready exit codes -- integrate directly into pipelines (
0= no changes,1= low/medium,2= high).
Table of Contents
- Quick Start
- Installation
- Usage
- Visual Comparison
- Markdown Reports
- CI/CD Integration
- How It Works
- Output Formats
- Comparison with Similar Tools
- Contributing
- License
Quick Start
pip install wayback-diff
# Compare two pages
wayback-diff https://example.com/old https://example.com/new
# Compare a Wayback snapshot with the live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/
# Full report: visual diff + markdown
wayback-diff https://old.example.com https://new.example.com --visual --markdown
Installation
From PyPI
pip install wayback-diff
# With visual comparison support
pip install wayback-diff[visual]
From source
git clone https://github.com/GeiserX/Wayback-Diff.git
cd Wayback-Diff
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
For visual comparison support:
pip install -e ".[visual]"
Docker
docker build -t wayback-diff .
docker run --rm wayback-diff https://example.com/a https://example.com/b
Usage
Basic comparison
wayback-diff https://example.com/page1 https://example.com/page2
Wayback Machine support
The tool automatically detects Wayback Machine URLs and cleans injection artifacts before comparing:
# Archive vs. live site
wayback-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/
# Two archive snapshots
wayback-diff \
https://web.archive.org/web/20230101/https://example.com/ \
https://web.archive.org/web/20230601/https://example.com/
Output formats
# Save to file
wayback-diff url1 url2 -o diff.txt
# JSON (for programmatic consumption)
wayback-diff url1 url2 --format json
# Unified diff
wayback-diff url1 url2 --format unified
Site-wide traversal
# Crawl and compare across linked pages (depth-limited)
wayback-diff url1 url2 --traverse --depth 2
Advanced options
| Flag | Description |
|---|---|
--no-clean-wayback | Disable Wayback Machine artifact removal |
--no-ignore-whitespace | Treat whitespace changes as significant |
--timeout N | Set HTTP timeout in seconds (default: 30) |
--verbose | Enable detailed logging |
Visual Comparison
Take screenshots in one or more browsers and generate side-by-side difference images:
# Auto-detect all installed browsers
wayback-diff url1 url2 --visual
# Specific browsers
wayback-diff url1 url2 --visual --browsers chrome firefox edge opera
# Custom viewport
wayback-diff url1 url2 --visual --viewport-width 1280 --viewport-height 720
# Non-headless mode (for debugging)
wayback-diff url1 url2 --visual --no-headless
# Custom screenshot output
wayback-diff url1 url2 --visual --screenshot-dir ./my-screenshots
Visual comparison generates:
- Screenshots of both pages per browser
- Side-by-side comparison images
- Pixel-level difference highlighting (red overlay marks changes)
Markdown Reports
Generate comprehensive Markdown reports that include everything in a single reviewable document:
wayback-diff url1 url2 --visual --markdown --report-dir ./reports
Each report contains:
- Executive summary with change statistics
- Visual comparison screenshots (when
--visualis used) - Changes grouped by significance (High / Medium / Low)
- Site-wide results (when
--traverseis used) - Actionable recommendations
CI/CD Integration
Wayback-Diff returns meaningful exit codes designed for pipeline gates:
| Exit Code | Meaning |
|---|---|
0 | No differences detected |
1 | Low or medium significance changes |
2 | High significance changes detected |
GitHub Actions example
name: Visual Regression Check
on:
pull_request:
jobs:
diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Wayback-Diff
run: |
pip install -r requirements.txt
pip install -e ".[visual]"
- name: Compare staging vs production
run: |
wayback-diff \
https://staging.example.com \
https://production.example.com \
--visual --markdown --format json -o diff.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: diff-report
path: reports/
Shell script gate
wayback-diff "$OLD_URL" "$NEW_URL" --format json -o result.json
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "BLOCKING: high-significance changes detected"
exit 1
elif [ $EXIT_CODE -eq 1 ]; then
echo "WARNING: minor changes detected"
fi
How It Works
Wayback Machine cleaning
When a Wayback Machine URL is detected, the tool automatically:
- Removes header artifacts -- strips analytics scripts, playback scripts, and banner CSS injected by the Wayback Machine.
- Removes footer comments -- removes archival metadata and copyright notices.
- Restores URLs -- converts
web.archive.org/web/…/prefixed URLs back to their originals. - Normalizes content -- handles whitespace and formatting differences introduced by archival.
Significance scoring
Every detected change is categorized:
| Level | Examples |
|---|---|
| High | Structural changes, content text, meta tags, scripts, stylesheets |
| Medium | Attribute changes, inline styling, div/span modifications |
| Low | Whitespace, comments, minor formatting |
Intelligent comparison
The diff engine:
- Focuses on meaningful content changes
- Ignores noise like timestamps and auto-generated IDs
- Provides context around each change
- Groups results by significance for fast review
Output Formats
Text (default)
Summary statistics, significance breakdown, and detailed changes with context lines.
JSON
Structured output for programmatic processing:
{
"summary": {
"total_changes": 15,
"added": 5,
"removed": 3,
"modified": 7,
"high_significance": 2,
"medium_significance": 8,
"low_significance": 5
},
"changes": [
{
"type": "modified",
"old_text": "...",
"new_text": "...",
"significance": "high"
}
]
}
Unified diff
Standard unified diff format, compatible with patch and code review tools.
Comparison with Similar Tools
| Feature | Wayback-Diff | htmldiff | diff2html | BackstopJS | Percy |
|---|---|---|---|---|---|
| HTML-aware semantic diff | Yes | Yes | No | No | No |
| Wayback Machine artifact cleaning | Yes | No | No | No | No |
| Significance scoring | Yes | No | No | No | No |
| Visual (screenshot) comparison | Yes | No | No | Yes | Yes |
| Multi-browser support | Yes | N/A | N/A | Yes | Yes |
| Site-wide crawl and compare | Yes | No | No | Yes | No |
| Markdown report generation | Yes | No | No | No | No |
| CI/CD exit codes | Yes | No | No | Yes | Yes |
| Self-hosted / no SaaS | Yes | Yes | Yes | Yes | No |
| Free and open source | GPL-3.0 | MIT | MIT | MIT | Freemium |
Testing
pip install -r requirements-dev.txt
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=wayback_diff --cov-report=html
Contributing
Contributions are welcome. To get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Add tests for new functionality
- Ensure all tests pass:
pytest tests/ -v - Submit a Pull Request
Related Projects
| Project | Description |
|---|---|
| Wayback-Archive | Download complete websites from the Wayback Machine with full asset preservation |
| Wayback-Diff | Intelligent web page comparison tool with Wayback Machine support |
| Way-CMS | Simple web CMS for editing HTML/CSS files downloaded from Wayback Archive |
| web-mirror | Mirror any webpage to a local server for offline access |
| media-download | Download all media files from any web page into a folder schema |
| n8n-nodes-way-cms | n8n community node for Way-CMS archived web content management |
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.
This software is not intended for commercial use.