MSO-Dumper
January 26, 2026 ยท View on GitHub
A comprehensive set of tools for analyzing and dumping Microsoft Office file formats.
Description
MSO-Dumper is a package for analyzing and dumping various Microsoft Office file formats, including binary formats like DOC, XLS, PPT, and graphics formats like EMF, WMF. It provides detailed structural analysis and can extract content from these files.
Author Information
- Authors: See https://github.com/LibreOffice/mso-dumper/graphs/contributors
- Email: libreoffice@lists.freedesktop.org
- License: Mozilla Public License 2.0
Installation
python setup.py install
Tools and Usage
Document Format Dumpers
ppt-dump.py - PowerPoint File Dumper
Analyzes and dumps PowerPoint (.ppt) binary format files.
./ppt-dump.py [options] [ppt file]
Options:
--help- displays help message--no-struct-output- suppress normal structure analysis output--dump-text- extract and print textual content--no-raw-dumps- suppress raw hex dumps of uninterpreted areas--id-select=id1[,id2 ...]- limit output to selected record IDs
Example:
./ppt-dump.py presentation.ppt
./ppt-dump.py --dump-text --no-raw-dumps slides.ppt
doc-dump.py - Word Document Dumper
Analyzes and dumps Word (.doc) binary format files.
./doc-dump.py [doc file]
Example:
./doc-dump.py document.doc
xls-dump.py - Excel Spreadsheet Dumper
Analyzes and dumps Excel (.xls) binary format files with extensive options.
./xls-dump.py [options] [xls file]
Options:
-d, --debug- turn on debug mode--show-sector-chain- show sector chain information at start of output--show-stream-pos- show position of each record relative to the stream--dump-mode MODE- specify dump mode: 'flat' (default), 'xml', or 'canonical-xml'--catch- catch exceptions and try to continue--utf-8- output strings as UTF-8
Examples:
./xls-dump.py spreadsheet.xls
./xls-dump.py --dump-mode xml --debug workbook.xls
./xls-dump.py --show-stream-pos --utf-8 data.xls
vsd-dump.py - Visio Document Dumper
Analyzes and dumps Visio (.vsd) format files.
./vsd-dump.py [vsd file]
Example:
./vsd-dump.py diagram.vsd
Graphics Format Dumpers
emf-dump.py - Enhanced Metafile Dumper
Analyzes and dumps Enhanced Metafile (.emf) format files.
./emf-dump.py [emf file]
Example:
./emf-dump.py image.emf
wmf-dump.py - Windows Metafile Dumper
Analyzes and dumps Windows Metafile (.wmf) format files.
./wmf-dump.py [wmf file]
Example:
./wmf-dump.py graphic.wmf
OLE Format Dumpers
ole1-dump.py - OLE1 Embedded Object Dumper
Dumps OLE1 embedded objects according to [MS-OLEDS] 2.2.5 specification.
./ole1-dump.py [ole1 file]
Example:
./ole1-dump.py embedded_object.ole1
ole2preview-dump.py - OLE2 Preview Stream Dumper
Dumps OLE2 preview streams according to [MS-OLEDS] 2.3.4 specification.
./ole2preview-dump.py [ole2 file]
Example:
./ole2preview-dump.py preview_stream.ole2
VBA and Macro Analysis
vbadump.py - VBA Project Dumper
Extracts and analyzes VBA (Visual Basic for Applications) code from Office documents.
./vbadump.py [office file with VBA]
Example:
./vbadump.py macro_document.xls
Special Format Tools
swlaycache-dump.py - StarWriter Layout Cache Dumper
Dumps Star Writer binary layout cache format.
./swlaycache-dump.py [cache file]
Example:
./swlaycache-dump.py layout.cache
Utility Scripts
compress.py - VBA Stream Compressor
Compresses VBA streams using Microsoft's compression algorithm.
./compress.py [offset]
Takes input from stdin and outputs compressed stream to stdout. Optional offset parameter.
decompress.py - VBA Stream Decompressor
Decompresses VBA streams.
./decompress.py [offset]
Takes compressed input from stdin and outputs decompressed stream to stdout. Optional offset parameter.
pptx-kill-uuid.py - PowerPoint UUID Replacement Tool
Replaces UUIDs in PowerPoint XML streams with sequential integers for easier analysis.
cat ppt/diagrams/data1.xml | ./pptx-kill-uuid.py
convert-enum.py
Utility script for converting enumerations (see source for specific usage).
Output Formats
Most dump tools output XML-formatted analysis data that includes:
- File structure information
- Record-by-record analysis
- Raw hex dumps of binary data
- Extracted text content (where applicable)
- Stream hierarchies for compound document formats
Development
The core parsing logic is contained in the msodumper/ package with specialized modules for each format:
docstream.py,docrecord.py- Word document parsingxlsstream.py,xlsrecord.py,xlsmodel.py- Excel parsingpptstream.py,pptrecord.py- PowerPoint parsingemfrecord.py,wmfrecord.py- Graphics format parsingole.py,olestream.py- OLE compound document parsingvbahelper.py- VBA macro analysis- etc.
Submit Patches to LibreOffice Gerrit:
License
This project is licensed under the Mozilla Public License 2.0 - see the license header in each source file for details.