Description 🖼️

August 5, 2021 · View on GitHub

Platform for Automatic Analysis of Malicious Applications
Using Artificial Intelligence Algorithms

Description 🖼️
- Objectives 🎯
- Features 🧰
Important Observation ⚠️
Setup 🛠️
- For Private Repositories 🙊
Typical Usage 🔎
Resources 🥣

Description 🖼️

dike ( pronounced /ˈdaɪkiː/) is an open-source platform combining the fields of malware analysis with the one of artificial intelligence, more precisely the machine learning subfield.

Objectives 🎯

At the moment, dike is capable only of analyzing Portable Executable and Object Linking and Embedding formats. Besides this limitation, it has three main objectives:

Regression of malice
Classification in malware families
Similarity analysis.

Features 🧰

The software enables the creation of analysis pipelines (named in the context of the platform models), which deals with the specific steps of the malware analysis and data engineering:

Dataset management, where it uses three main sources of labeled PE and OLE files:
- The open-source dataset DikeDataset
- Accurate results of analysis made by the analysts of the organization in which the platform is set up
- Results of automatic VirusTotal scans
Features extraction, in which extractors are used to obtain relevant information such as:
- Strings
- Characteristics of the file format
- Opcodes
- Windows API calls
- Macros
Features preprocessing, where preprocessors are used to transform the features into a more friendly format for the machine learning algorithms
- Transformations
  - Binarization
  - Discretization
  - Counting (and in a special approach, for categories of opcodes and API calls)
  - Vectorization
  - NGrams
- Scaling
- Dimensionality reduction
Training of machine learning models with included cross-validation and evaluation (regression-wise and classification-wise).

Important Observation ⚠️

dike is part of my Bachelor thesis, which aims at demonstrating that the artificial intelligence techniques can improve the malware analysis. The document and the presentation (in Romanian 🇷🇴 only) can be found in a separate repository.

At the moment, this is the only place where some relevant information can be found:

Software requirements
Architecture (more detailed than the description above)
Testing
Evaluation
Further development.

Setup 🛠️

Download the script manage.sh from the folder infrastructure.
Obtain a VirusTotal API key.
Create and host (on a server which the platform can access) a TGZ archive containing two folders, ghidra (with a Ghidra project) and qiling (with the dynamically linked libraries needed by Qiling).
Run the script and follow the instructions.

Setup Example

For Private Repositories 🙊

If the repository hosting the platform is private, there are two steps that needs to be performed before:

Generate an asymmetric key pair via ssh-keygen -t ed25519 -C "EMAIL_ADDRESS", where EMAIL_ADDRESS need to be populated with your email address.
Add the public one into the GitHub's deployment key section.

Typical Usage 🔎

For Clients 👨‍💼

Malice Prediction

Similarity Analysis

Feature-wise Comparison of Samples

Model Evaluation

Settings

For Administrators 👩‍💻

A powerful command line interface can be used by the administrators, by running the dike command on a leader server. Some available commands are demonstrated in the recording below.

Connections with Subordinate Servers

Datasets

Training and Management of Models

Predictions with Models

The administrators use also manual editing of YAML files, respecting a schema depending on the context in which the file is used. Some existing files (one per type, only for exampling purposes) has comments to document these schemas as follows:

For Other Systems 🖥️

Other systems of the organization can use the scan services of the platform, creating HTTP or HTTPS (depending on the configuration) requests to the following API endpoints.

Route	Action
`/get_malware_families`	Retrieves the used malware families.
`/get_evaluation/MODEL_NAME`	Retrieves the evaluation of a model.
`/get_configuration/MODEL_NAME`	Retrieves the configuration.
`/get_features/MODEL_NAME/FILE_HASH`	Retrieves the features of a file from the platform's dataset.
`/create_ticket/MODEL_NAME`	Creates a prediction ticket.
`/get_ticket/TICKET_NAME`	Retrieves the content of a prediction ticket.
`/publish/MODEL_NAME`	Publishes for a specific model the results of a scan.

Resources 🥣

The most important used resources are listed in the table below.

Name	Description	Link
Ghidra	Software reverse engineering framework	repository
VirusTotal API	Scanning API that aggregates multiple antivirus engines	website
Qiling	Python 3 emulation framework	repository
Pandas	Python 3 data analysis and manipulation library	repository
scikit-learn	Python 3 machine learning library	repository
Python 3	General-purpose programming language	website
Docker	Software product for OS-level virtualization	website
Docker Compose	Tool for running multi-container applications on Docker	repository
GitHub	Git repository hosting service	website
YAML	Data-serialization language	website