Google Summer of Code 2020 - Final Report - OWASP IIDS

Organiztion:- OWASP

Project:- Intelligent Intrusion Detection System

Mentor:- Sri Harsha

Student:- Ashish Malik

Project Information:-

OWASP’s IIDS is an open source software that leverages the benefits of Artificial Intelligence to detect the intrusion and alert the respective network administrator. It is a fully Django based application, which supports multiple ML and NN models. With the increase in the network traffic and the introduction of new applications and attacks, continuous improvement is required to make the IDS to detect those threats. The user behavior also shows great unpredictability and changes over time. Modeling the network traffic is an immensely challenging undertaking because of the complexity and intricacy of human behaviors. For this reason, I had to choose the KDDCup'99 dataset for training. KDD training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type. The simulated attacks fall in one of the following four categories:

  • Denial of Service Attack (DoS)
  • User to Root Attack (U2R):
  • Remote to Local Attack (R2L)
  • Probing Attack

Work Done

Phase-1:-

In this phase, I had to preprocess the KDDCup'99 dataset. I basically had to clean and transform the dataset, so that it will become machine readable. This involved replacing the missing data with their respective mean or median value and encoding the categorial data. Here’s the notebook where I have done the most of the pre-processing and model testing work.

As there are 41 features present in the KDDCup dataset it would become difficult for our model to train and detect anomalies, because there are a number of irrelevant features on it. So next, I started with the process of Feature extraction. I had to use RfeCV (Recursive feature elimination with cross-validation) for the extraction process. Here’s the notebook I used during the feature extraction process.

Link to the Phase-1 PR - #12

I wrote a Blog Post describing the work done in more detail.

Phase-2:-

During this phase, I had to deploy my machine learning models on Django REST API. The models consisted of multiple ML and NN classifiers. This was done to prevent writing of those models repetitively, which was time-consuming. Productionizing models means that we would be using those models in real world application, instead of just saving/storing in Jupyter notebooks. Me and my mentor were inspired by the use cases of RASA, which is an open-source machine learning framework to automate text- and voice-based assistants. So, we decided to create an application which would call those models using Django REST API. The user would configure the model to be used and its parameters in a JSON file and then proceed to call the endpoints. By the end of this phase, I had created and configured those models for deployment.

Link to the Phase-2 PR - #19

I again wrote a Blog Post about my work.

Phase-3:-

To make the application more user friendly, We decided to add CLI support. Adding the CLI suppport was a bit challenging, but still somehow I managed to do it. After adding this feature, the user would be able to provide flag aruguments in the CLI by providing the training dataset, config file and model to be choosed as shown below.


python manage.py get_data -d  data_path -c config_path -m 'ml'

I also made some changes to the code, so that when the user will save the model, it will get saved in a zip format. It will really become very convenient to transfer those contents of the model.

After all these, I started writing the documentation of our tool by modifying the README.

Link to the Phase-3 PR - #21

Link to the Documentation PR - #22

Future Work

There’s still more work to be done in order to release the V 1.0. Since, the user has to provide their own training dataset everytime, I would be creating a Feature Extraction script which would extract all necessary features of KDDCuP'99 from the input pcap file which would be containing all the network traffic info. I have also created an Issue #14 regarding that.

After releasing the v1.0 of the application, I would still be contributing and helping others for this project.

Closure

The past few months were some of the most productive time of my life. I’ve never been much active in open source before and from here the curve only goes up. I always wanted to create something for security using AI and this was the first time where, I got the chance to work on something truly amazing. I really learned a lot more than I excepted. The amount of exposure I got from my mentor is truly inexpressible. I am thankful to the organization’s admin and my mentor for supporting me.

A conclusion is simply the place where you got tired of thinking. -Dan Chaon