πŸ“¦ GT Malware Classifier

Machine learning-based PE file malware classification with feature extraction and real-time API deployment.


πŸ“– Overview

GT Malware Classifier is a machine learning tool designed to analyze and classify Portable Executable (PE) files as either benign (goodware) or malicious (malware).
It leverages advanced feature extraction with the LIEF library and robust machine learning models to provide efficient, scalable malware detection.
The project also includes a RESTful API for real-time analysis and classification.

This project was inspired by and based on the ideas presented in the 2021 Machine Learning Security Evasion Competition.

πŸ”— GitHub Repository: GT Malware Classifier


πŸ› οΈ Technologies Used

  • Feature Extraction: Python (LIEF library)
  • Machine Learning Models: Random Forest, Gradient Boosting (Scikit-learn)
  • Backend API: Flask
  • Data Handling: JSONL, CSV
  • Imbalance Handling: SMOTE, Class Weights
  • Utilities: Timeout Handling, Metrics Calculation

πŸš€ Features

  • ✨ Detailed PE file analysis (imports, exports, sections, strings)
  • ✨ Custom machine learning model (GTModel)
  • ✨ Handles .jsonl and .csv data formats
  • ✨ RESTful API deployment for real-time file classification
  • ✨ Comprehensive performance metrics: accuracy, precision, recall, FPR, FNR

βš™οΈ Setup Instructions

  1. Build the Docker image:
    docker build -t gt-malware-classifier .
    
  2. Run the Docker container:
     docker run -p 5000:5000 gt-malware-classifier
    
  3. Access the API at http://localhost:5000.
  4. Use the /predict endpoint to classify PE files:
    curl -X POST -F 'file=@path/to/your/file.exe' http://localhost:5000/predict
    
  5. The API will return a JSON response with the classification result.

πŸ—ΊοΈ Workflow

  1. Feature Extraction:
    Extract attributes such as imports, exports, sections, and strings from PE files using LIEF.

  2. Model Training:
    Train the GTModel using labeled datasets with support for Random Forest and Gradient Boosting classifiers.

  3. Model Testing:
    Evaluate the model’s performance using a testing set and output detailed metrics.

  4. API Deployment:
    Serve the trained model through a Flask API that accepts file uploads for real-time classification.

  5. Data Analysis:
    Analyze feature distributions and detect common malware patterns.


🀝 Contributing

Open to improvements!
Pull requests and suggestions are welcome at GT Malware Classifier GitHub.


πŸ“„ License

Distributed under the MIT License.