README.md

April 16, 2026 ยท View on GitHub

Wayback-Archive banner

Download complete websites from the Wayback Machine for offline viewing.

PyPI Build Release License Python 3.8+ GitHub Stars codecov


Wayback-Archive is a Python tool that downloads archived websites from the Wayback Machine and reconstructs them for fully functional offline viewing. It preserves all assets -- HTML, CSS, JavaScript, images, and fonts -- rewrites URLs to relative paths, and cleans up Wayback Machine artifacts so the result looks like the original site.

Quick Start

# Install
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
pip install -r config/requirements.txt

# Run
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
python3 -m wayback_archive.cli

# Preview
cd output && python3 -m http.server 8000
# Open http://localhost:8000

Features

Core

  • Full website download -- HTML, CSS, JS, images, fonts, and all linked assets
  • Recursive link discovery -- Automatically follows links in HTML, CSS, and JS files
  • Smart URL rewriting -- Converts all links to relative paths for local serving
  • Timeframe fallback -- Searches nearby Wayback Machine timestamps when a resource returns 404
  • Real-time progress logging -- Displays download status and file processing as it happens

Asset Handling

  • Google Fonts support -- Downloads Google Fonts CSS and font files locally, fixing CORS issues
  • Font corruption detection -- Identifies and removes corrupted font files (HTML error pages served as fonts)
  • CDN fallback -- Automatic fallback to CDN for critical libraries (e.g., jQuery) when Wayback Machine fails
  • Data attribute processing -- Processes data-* attributes containing URLs (videos, images, etc.)

Preservation

  • Icon group preservation -- Preserves all links in icon groups (social media, contact icons)
  • Button link preservation -- Maintains styling and functionality of button links
  • Cookie consent preservation -- Keeps cookie consent popups and functionality intact

Optimization

  • HTML minification -- Uses minify-html (Python 3.14+ compatible)
  • JS/CSS minification -- Optional JavaScript and CSS minification via rjsmin and cssmin
  • Image compression -- Optional image optimization with Pillow
  • Tracker/ad removal -- Strips analytics, ads, and external iframes
  • Link cleanup -- Configurable external link removal with anchor preservation options
  • www/non-www normalization -- Normalize domain variations automatically

Why Wayback-Archive?

CapabilityWayback-Archivewgethttrack
Wayback Machine URL rewritingYesNoNo
Wayback artifact cleanupYesNoNo
Timeframe fallback for 404sYesNoNo
Google Fonts localizationYesNoNo
Font corruption detectionYesNoNo
CDN fallbackYesNoNo
HTML/CSS/JS minificationYesNoNo
Tracker and ad removalYesNoNo
data-* attribute processingYesNoNo

General-purpose tools like wget --mirror or httrack can download live websites, but they do not understand Wayback Machine URL structures, cannot clean up archive artifacts, and lack the specialized asset recovery that Wayback-Archive provides.

Installation

Prerequisites

  • Python 3.8 or higher
  • pip

From Source

git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive

# Optional: create a virtual environment
python3 -m venv venv
source venv/bin/activate  # macOS/Linux
# venv\Scripts\activate   # Windows

pip install -r config/requirements.txt

As a Package

cd Wayback-Archive
pip install -e .
wayback-archive  # Available as a CLI command after installation

Configuration

All options are set via environment variables. You can also use a .env file.

Required

VariableDescription
WAYBACK_URLThe Wayback Machine URL to download

Output

VariableDefaultDescription
OUTPUT_DIR./outputOutput directory for downloaded files

Optimization

VariableDefaultDescription
OPTIMIZE_HTMLtrueMinify HTML
OPTIMIZE_IMAGESfalseCompress images
MINIFY_JSfalseMinify JavaScript
MINIFY_CSSfalseMinify CSS

Content Removal

VariableDefaultDescription
REMOVE_TRACKERStrueRemove analytics and trackers
REMOVE_ADStrueRemove advertisements
REMOVE_CLICKABLE_CONTACTStrueRemove tel: and mailto: links
REMOVE_EXTERNAL_IFRAMESfalseRemove external iframes
VariableDefaultDescription
REMOVE_EXTERNAL_LINKS_KEEP_ANCHORStrueRemove external links, keep anchor text
REMOVE_EXTERNAL_LINKS_REMOVE_ANCHORSfalseRemove external links and anchor elements
MAKE_INTERNAL_LINKS_RELATIVEtrueConvert internal links to relative paths

Domain

VariableDefaultDescription
MAKE_NON_WWWtrueConvert www to non-www
MAKE_WWWfalseConvert non-www to www
KEEP_REDIRECTIONSfalseKeep redirect pages

Testing

VariableDefaultDescription
MAX_FILESunlimitedLimit number of files to download

Usage

macOS / Linux

export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export OUTPUT_DIR="./my_website"
export REMOVE_CLICKABLE_CONTACTS="false"  # Keep email/phone links

python3 -m wayback_archive.cli

Windows (PowerShell)

$env:WAYBACK_URL = "https://web.archive.org/web/20250417203037/http://example.com/"
$env:OUTPUT_DIR = ".\my_website"
$env:REMOVE_CLICKABLE_CONTACTS = "false"

python -m wayback_archive.cli

Windows (CMD)

set WAYBACK_URL=https://web.archive.org/web/20250417203037/http://example.com/
set OUTPUT_DIR=.\my_website
set REMOVE_CLICKABLE_CONTACTS=false

python -m wayback_archive.cli

Quick Test

Download a limited number of files to verify everything works:

export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export MAX_FILES=5
python3 -m wayback_archive.cli

How It Works

  1. Initial download -- Fetches the main page from the Wayback Machine
  2. Link extraction -- Parses HTML to find all referenced assets (links, images, CSS, JS)
  3. CSS processing -- Extracts font URLs, background images, and @import statements; downloads Google Fonts locally; detects corrupted font files
  4. JS processing -- Extracts dynamically loaded resources from JavaScript
  5. Data attributes -- Scans data-* attributes for additional asset URLs
  6. Iterative crawling -- Continues discovering and downloading resources until the queue is empty
  7. Timeframe fallback -- For 404 responses, searches nearby Wayback Machine timestamps
  8. URL rewriting -- Converts all URLs to relative paths for offline serving
  9. Preservation -- Maintains icon groups, button links, and cookie consent functionality

Project Structure

Wayback-Archive/
  wayback_archive/          # Main package
    __init__.py
    __main__.py
    cli.py                  # CLI entry point
    config.py               # Environment variable configuration
    downloader.py           # Core download and processing engine
  config/
    requirements.txt        # Runtime dependencies
    requirements-dev.txt    # Development dependencies
    setup.py                # Package setup
    pytest.ini              # Test configuration
  tests/                    # Test suite
  docs/                     # Documentation
  LICENSE                   # GPL-3.0
  README.md

Testing

pip install -r config/requirements-dev.txt

# Run tests
pytest

# Run tests with coverage
pytest --cov=wayback_archive

Troubleshooting

Port Already in Use

python3 -m http.server 8080  # Use a different port

Font Loading Issues

  • Google Fonts: Downloaded automatically to avoid CORS issues
  • Corrupted fonts: Detected and removed from CSS automatically
  • Missing fonts: Some fonts may not exist in the Wayback Machine archive

See Font Loading Research Notes for details.

  • Icon groups (social media, contacts) are preserved automatically
  • Button links with sppb-btn or btn classes are preserved
  • Set REMOVE_CLICKABLE_CONTACTS=false to keep tel: and mailto: links

jQuery or Libraries Not Loading

The tool includes automatic CDN fallback for critical libraries. If a file fails to download from the Wayback Machine, it will attempt to fetch it from a CDN.

Dependencies

PackagePurpose
requestsHTTP client
beautifulsoup4HTML parsing
lxmlFast HTML/XML parser
minify-htmlHTML minification
cssminCSS minification
rjsminJS minification
PillowImage optimization
python-dotenv.env file support

Contributing

Contributions are welcome. Please feel free to submit a Pull Request.

ProjectDescription
Wayback-DiffIntelligent web page comparison tool with Wayback Machine support
Website-DiffIntelligent web page comparison tool with visual regression testing
Way-CMSSimple web CMS for editing HTML/CSS files downloaded from Wayback Archive
web-mirrorMirror any webpage to a local server for offline access
media-downloadDownload all media files from any web page into a folder schema
n8n-nodes-way-cmsn8n community node for Way-CMS archived web content management

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).