Silesia Compression Corpus

September 2, 2018 · View on GitHub

Silesia corpus is a set of files of different characteristics to test compression algorithms.

It was once available here: http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia but is inaccessible recently.

Size	File	Description
10,192,446	dickens	English novels, ASCII plain text
51,220,480	mozilla	Program, UNIX executables and others, tar
9,970,564	mr	3-D MRI image, DICOM
33,553,445	nci	Chemical database, text
6,152,192	ooffice	Windows DLL
10,085,684	osdb	Database, synthetic data, binary
6,627,202	reymont	Polish text, uncompressed PDF
21,606,400	samba	Source code and graphics, tar
7,251,944	sao	Database, star catalog, binary
41,458,703	webster	English dictionary, HTML
8,474,240	x-ray	16 bit grayscale, DICOM
5,345,280	xml	XML files, text, tar