Description ๐Ÿ–ผ๏ธ

August 5, 2021 ยท View on GitHub

Logo
Platform for Automatic Analysis of Malicious Applications
Using Artificial Intelligence Algorithms



Description ๐Ÿ–ผ๏ธ

dike ( pronounced /หˆdaษชkiห/) is an open-source platform combining the fields of malware analysis with the one of artificial intelligence, more precisely the machine learning subfield.

Objectives ๐ŸŽฏ

At the moment, dike is capable only of analyzing Portable Executable and Object Linking and Embedding formats. Besides this limitation, it has three main objectives:

  1. Regression of malice
  2. Classification in malware families
  3. Similarity analysis.

Features ๐Ÿงฐ

The software enables the creation of analysis pipelines (named in the context of the platform models), which deals with the specific steps of the malware analysis and data engineering:

  1. Dataset management, where it uses three main sources of labeled PE and OLE files:
    • The open-source dataset DikeDataset
    • Accurate results of analysis made by the analysts of the organization in which the platform is set up
    • Results of automatic VirusTotal scans
  2. Features extraction, in which extractors are used to obtain relevant information such as:
    • Strings
    • Characteristics of the file format
    • Opcodes
    • Windows API calls
    • Macros
  3. Features preprocessing, where preprocessors are used to transform the features into a more friendly format for the machine learning algorithms
    • Transformations
      • Binarization
      • Discretization
      • Counting (and in a special approach, for categories of opcodes and API calls)
      • Vectorization
      • NGrams
    • Scaling
    • Dimensionality reduction
  4. Training of machine learning models with included cross-validation and evaluation (regression-wise and classification-wise).

Important Observation โš ๏ธ

dike is part of my Bachelor thesis, which aims at demonstrating that the artificial intelligence techniques can improve the malware analysis. The document and the presentation (in Romanian ๐Ÿ‡ท๐Ÿ‡ด only) can be found in a separate repository.

At the moment, this is the only place where some relevant information can be found:

  • Software requirements
  • Architecture (more detailed than the description above)
  • Testing
  • Evaluation
  • Further development.

Setup ๐Ÿ› ๏ธ

  1. Download the script manage.sh from the folder infrastructure.
  2. Obtain a VirusTotal API key.
  3. Create and host (on a server which the platform can access) a TGZ archive containing two folders, ghidra (with a Ghidra project) and qiling (with the dynamically linked libraries needed by Qiling).
  4. Run the script and follow the instructions.
Setup Example Setup Example

For Private Repositories ๐Ÿ™Š

If the repository hosting the platform is private, there are two steps that needs to be performed before:

  1. Generate an asymmetric key pair via ssh-keygen -t ed25519 -C "EMAIL_ADDRESS", where EMAIL_ADDRESS need to be populated with your email address.
  2. Add the public one into the GitHub's deployment key section.

Typical Usage ๐Ÿ”Ž

For Clients ๐Ÿ‘จโ€๐Ÿ’ผ

Malice Prediction Malice Prediction
Similarity Analysis Similarity Analysis
Feature-wise Comparison of Samples Feature-wise Comparison of Samples
Model Evaluation Model Evaluation
Settings Settings

For Administrators ๐Ÿ‘ฉโ€๐Ÿ’ป

A powerful command line interface can be used by the administrators, by running the dike command on a leader server. Some available commands are demonstrated in the recording below.

Connections with Subordinate Servers Connections with Subordinate Servers
Datasets Datasets
Training and Management of Models Training and Management of Models
Predictions with Models Predictions with Models

The administrators use also manual editing of YAML files, respecting a schema depending on the context in which the file is used. Some existing files (one per type, only for exampling purposes) has comments to document these schemas as follows:

For Other Systems ๐Ÿ–ฅ๏ธ

Other systems of the organization can use the scan services of the platform, creating HTTP or HTTPS (depending on the configuration) requests to the following API endpoints.

RouteAction
/get_malware_familiesRetrieves the used malware families.
/get_evaluation/MODEL_NAMERetrieves the evaluation of a model.
/get_configuration/MODEL_NAMERetrieves the configuration.
/get_features/MODEL_NAME/FILE_HASHRetrieves the features of a file from the platform's dataset.
/create_ticket/MODEL_NAMECreates a prediction ticket.
/get_ticket/TICKET_NAMERetrieves the content of a prediction ticket.
/publish/MODEL_NAMEPublishes for a specific model the results of a scan.

Resources ๐Ÿฅฃ

The most important used resources are listed in the table below.

NameDescriptionLink
GhidraSoftware reverse engineering frameworkrepository
VirusTotal APIScanning API that aggregates multiple antivirus engineswebsite
QilingPython 3 emulation frameworkrepository
PandasPython 3 data analysis and manipulation libraryrepository
scikit-learnPython 3 machine learning libraryrepository
Python 3General-purpose programming languagewebsite
DockerSoftware product for OS-level virtualizationwebsite
Docker ComposeTool for running multi-container applications on Dockerrepository
GitHubGit repository hosting servicewebsite
YAMLData-serialization languagewebsite