Silesia Compression Corpus
September 2, 2018 ยท View on GitHub
Silesia corpus is a set of files of different characteristics to test compression algorithms.
It was once available here: http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia but is inaccessible recently.
| Size | File | Description |
|---|---|---|
| 10,192,446 | dickens | English novels, ASCII plain text |
| 51,220,480 | mozilla | Program, UNIX executables and others, tar |
| 9,970,564 | mr | 3-D MRI image, DICOM |
| 33,553,445 | nci | Chemical database, text |
| 6,152,192 | ooffice | Windows DLL |
| 10,085,684 | osdb | Database, synthetic data, binary |
| 6,627,202 | reymont | Polish text, uncompressed PDF |
| 21,606,400 | samba | Source code and graphics, tar |
| 7,251,944 | sao | Database, star catalog, binary |
| 41,458,703 | webster | English dictionary, HTML |
| 8,474,240 | x-ray | 16 bit grayscale, DICOM |
| 5,345,280 | xml | XML files, text, tar |