curl-based HTTP mirroring script

January 1, 2026 · View on GitHub

This is a script that creatively uses the curl CLI to download an HTTP resource (colloquially "file"); It saves time & bandwidth whenever possible, but not at the expense of correctness.

  • Compares ETags to make sure that an unchanged resource is not transfered again, but a changed resource always is.
  • Requests a CE-coded (a.k.a. compressed, e.g. gzipped) representation of the resource, falling back to the "regular" one.
  • Supports continuation, using conditional requests, but in contrast to the -C - curl flag works with CE-coded responses, and falls back to a "full body" request.

People asked me: Why not use wget for this?

  • wget does not store the resource's ETag, so it cannot compare it when re-requesting.
  • Combining -c (continuation) and -N (timestamping using Last-Modified) don't work together.
  • I witnessed some subtle but significant bug in the -c (continuation) implementation once. I can't remember the details anymore, unfortunately.

server side

For all of the above features to work, you need a server that supports

  • serving pre-compressed sidecar files (i.e. a statically compressed file next to the original) as CE-coded;
  • range requests, both on the "regular" file as well as the pre-compressed one;
  • conditional requests, specifically If-Range with ETags.

For testing purposes, we create a test file:

yes | head -n 50000000 >/var/www/y.txt
gzip -k /var/www/y.txt
ls -lh /var/www
# -rw-r--r--   1 j  staff    95M Aug  8 16:54 y.txt
# -rw-r--r--   1 j  staff    95K Aug  8 16:54 y.txt.gz

Caddy

A recently fixed bug with a wrong ETag aside, Caddy v2 does this with the following Caddyfile:

localhost:8080 {
	root * /var/www
	file_server browse {
		precompressed gzip
	}
}

nginx

I couldn't find much about this topic, but a response on the mailing reads like nginx does not support range requests on pre-compressed files because there are (non-trivial) problems with dynamically compressed responses. 😔 It seems like nginx does not support range requests on pre-compressed files:

Note also that it's impossible to ungzip a response part if you have not preceding parts from the very start.

This as well applies to many other types of data.

The main problem with Content-Encoding and ranges is that one somehow should be able to reproduce exactly the same entity-body (or at least make sure cache validators would change on entity-body change). This is not something trivial when you compress on the fly with possible different compression options.

I personally think that moving towards using Transfer-Encoding would be a good step for "on the fly" compression. But browser support seems to be not here at all.

TLDR: The following nginx config file enables every aspect but the range requests:

server {
	listen 80 default_server;
	listen [::]:80 default_server;
	server_name _;

	root /var/www;
	gzip_static on;
	gzip_vary on;

	location / {
		try_files $uri $uri/ =404;
	}
}

usage

The following script

  1. downloads the resource into a temp file (/tmp/mirror-${sha256(url)}) and stores the ETag & response headers next to it (/tmp/mirror-${sha256(url)}.etag & /tmp/mirror-${sha256(url)}-${randomHex()}.headers),
  2. if applicable, decompresses the file (into /tmp/mirror-${sha256(url)}-${randomHex()}.decompressed),
  3. copies the decompressed file to the actual destination path (in order to work atomically).

demo

To demonstrate that it works as intended, we abort it in between:

export LOG_LEVEL=debug

./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 does not exist
# /tmp/mirror-15c86ece76 does not exist, downloading "regularly" & saving ETag
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 --etag-save /tmp/mirror-15c86ece76.etag { stdio: [ 'ignore', 'inherit', 'inherit' ] }

# we abort the download half way:
^C
ls -lh /tmp/mirror-15c86ece76
# -rw-r--r--   1 j  staff    40M Aug  9 15:50 mirror-15c86ece76

# and then continue it by re-running the script:
./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "rg7r1m6dvsyb" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl exited { status: 0, … }
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying processed download file to destination path
# cp /tmp/mirror-15c86ece76-153154.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp exited { status: 0, … }
# done!

# we check if the file has been downloaded corretly:
shasum /var/www/y.txt y.txt
f1f40059b87621eca87321c4436747d75ecaebbf  /var/www/y.txt
f1f40059b87621eca87321c4436747d75ecaebbf  y.txt

Now that we have downloaded the file, let's emulate the file changing on the server by changing the ETag stored locally:

echo '"foo"' >/tmp/mirror-15c86ece76.etag

./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "foo" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.
# curl exited { status: 33, … }
# file download couldn't be continued, server responded with 200 & full body; starting "regular" download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 --etag-save /tmp/mirror-15c86ece76.etag { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl exited { status: 0, … }
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying exiteded download file to destination path
# cp /tmp/mirror-15c86ece76-7cb151.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp process { status: 0, … }
# done!

It has requested a full "regular" (re-)download, because the If-Range header has not matched, because the local ETag is different than the server one.

If we re-run it without changing the ETag again, it will refrain from re-downloading the file:

./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "rg7r1m6dvsyb" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl: (22) The requested URL returned error: 416
# curl exited { status: 22, … }
# server-reported size 32601485
# downloaded size 32601485
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying processed download file to destination path
# cp /tmp/mirror-15c86ece76-fd6f2d.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp exited { status: 0, … }
# done!