17
2.3. The advantages of the NSL-KDD dataset
Lastly for this section, it is worthwhile to mention some more information on why the NSL-KDD
was chosen and where it came from. The NSL-KDD was created in 2009, as an effort to
overcome some of the limitations and problems that its ancestors, DARPA (1998) and
KDDCup99 (1999), had. it is, like the original KDDCup99 before it, a publicly available dataset
of network traffic data records, which contains a selected subset of the data in KDDCup99 [1].
The selection of that data occurred by applying some filters targeting the problematic instances
in it, and at the same time, providing best practices for data mining to create the new dataset.
So, the main advantages of using this dataset are:
- It doesn’t include any redundant records in it, thus avoiding biasing toward more
frequent records.
- There are no duplicate records in the test set, so that the performance of the models
is not biased by those with falsely higher detection rate.
- The number of selected records from each difficulty level is inversely proportional to
the percentage of records in the original KDDCup99, therefore the classification rates
of various machine learning methods vary in a wider range.
- Opposite to the KDDCup99, that had millions of data records in it, both the KDDTrain+
and the KDDTest+ have a reasonable amount of records in them, making it affordable
to run experiments on the complete datasets instead of selecting a random small
portion of it. That is why evaluation results of different research groups are consistent
and comparable (like it happens with our models).
The NSL-KDD is not a perfect dataset, as it is quite outdated, and because it is a synthetic
dataset. There is, however, much value in those rare, good datasets that are available, even if
they are old. Firstly, they are already labelled, a process that is very time consuming or even
impossible sometimes, which allows researchers to test supervised learning methods, or
validate the unsupervised models more frequently used today. Benchmark datasets, like NSL-
KDD, are used for validation and evaluation of new approaches to intrusion detection, and
comparison between different methods, old and new. They are also the only way to have
repeatability in the experiments done over the years, especially because they are publicly
available to all researchers. A rich in features dataset like NSL-KDD also allows different
approaches to fine-tune into different parameters, and extract features for more light-weight
models, or simply provide a base on which new datasets can be built.
The network traffic datasets are valuable assets for IDS research. However, none of them can
clearly represent the real-world traffic, as it is constantly evolving, and new attacks always
appear (or haven’t been discovered yet). Apart from the privacy and security concerns that
hinder the mining of real data, simulations are also difficult to do realistically. Evaluation of IDS
datasets is challenged by all the difficulties in collecting attack and victim scripts, by the rapid
speed at which attacks evolve and are produced, and also by the many different network
services that not only make traffic more complex, but also leave new gaps for exploitation.