Page 50

reports of our models, because of the imbalance of our dataset, and shows a quick evaluation
of the performance of each binary classification as a whole.

5.2. Evaluation and results compared to relevant research

In reality, case B is not very useful, because there is rarely any chance to encounter traffic data
so close to the training data of the model, especially with the rapid rate that network
exploitations evolve today; it was mostly done to test if there was an overfitting problem, as
validation for the training phase of the models, and to apply the same practices that are usually
done step-by-step in most machine learning projects, which usually split the original dataset
into training and test subsets.

The test set of the NSL-KDD, with its difference in the distribution of the labels , services, flags
and many more features, reflects more of the real world, and the performance of all the
models actually reaches almost the same levels as some of the latest research, even much
more complex and innovative models, like [35], [36], [12] and [9].

In [35], the best results of all are found, with a record 89% accuracy reached in a LSTM model.
Except that, they use a Deep CNN, combined with Denoising and Contractive AE in different
balances,  and  reach  81 − 85%.  Using  more  classical  approaches  similar  to  ours,  (kNN,  DT,
MLP, RF) they reach 74 − 82% accuracy. [12] have the second best accuracy results, with their
implementation being an AE, followed by another sparse AE network, and for the output layer
they have put a LR classifier, that only provides binary classificaction. With these, they reach
87.2%  accuracy.  In  [36],  the  input  goes  through  multiple  CNNs,  a  BLSTM  and  an  attention
layer,  in  order  to  reach  84.2%  accuracy.  With  traditional  approaches  (DT,  MLP,  RF),  they
reqach 72 − 78%. Lastly, in [9], they developed similar classifiers (DT, DNN) that reached 76 −
79%, and with PCA they reduced the features to 6, making accuracy drop to 71 − 75%.

[13] and [11] have made studies that are very similar to our own, but they use the whole NSL-
KDD dataset as one, and after the preprocessing phase, they split it into training and
validation/test subsets, like in our case B. The 99 − 99.6% accuracy obtained there in all the
classifiers tested looks like the results extracted from our models, found in

Table 8

. However,

the problem here lies with the real-life experience that case B-like experiments don’t provide.

Many similar projects can also be found in Github, since the NSL-KDD is a very popular dataset
for intrusion detection, some of which only utilise the KDDTrain+_20Percent and
KDDTest+_20Percent for easier and faster processing. Most such projects either have minimal
optimisation, and mostly analyse the NSL-KDD in depth, or develop only one method of a more
advanced technique (CNN, Autoencoder) to develop the model itself better.

In the next section, we will discuss the problems and limitations of this project, and future work
that could improve the work done.