Silesia Compression Corpus

September 2, 2018 ยท View on GitHub

Silesia corpus is a set of files of different characteristics to test compression algorithms.

It was once available here: http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia but is inaccessible recently.

SizeFileDescription
10,192,446dickensEnglish novels, ASCII plain text
51,220,480mozillaProgram, UNIX executables and others, tar
9,970,564mr3-D MRI image, DICOM
33,553,445nciChemical database, text
6,152,192oofficeWindows DLL
10,085,684osdbDatabase, synthetic data, binary
6,627,202reymontPolish text, uncompressed PDF
21,606,400sambaSource code and graphics, tar
7,251,944saoDatabase, star catalog, binary
41,458,703websterEnglish dictionary, HTML
8,474,240x-ray16 bit grayscale, DICOM
5,345,280xmlXML files, text, tar