51
6. Discussion and future work
Anomaly detection, and network security in general, is facing a lot of challenges. Some notable
ones are:
- The rapid development of networks today, which leads to a great increase of novel and
unknown attacks, that take advantage of new gaps and services.
- The ever growing reliance of our society on the Internet, where more and more data
are generated and handled every year, already barely within our processing
capabilities.
- The Internet of Things (IoT), due to which devices of lower level, thus much less
processing power and capabilities, are connected to each other, leaving us exposed to
new security gaps that could even affect our health, other than our data.
- The unavailability of open network traffic datasets, especially more recent, that could
have newer attacks, because of security concerns and competition among service
providers, which could refresh the reserch domain.
- The incompetence that unsupervised learning still shows, even though it is very
appropriate for anomaly detection, because the performance of models with
unlabelled data cannot be tested correctly, while labelling whole datasets is a specially
time consuming and difficult process.
Further improvement on this particular project could include two types of upgrades. Firstly,
since the NSL-KDD dataset is already labelled, there are many unsupervised learning
mechanisms that could then be validated through the respective labels, including attention
mechanisms, autoencoders, and clustering methods that we could optimise and compare.
Even with supervised learning, a dive into more advanced DNN methodologies would provide
better results, and the models could be much more flexible; DNN techniques that could be
tested are Convolutions and Pooling methods, RNNs or LSTM approaches, etc. Deep Neural
Networks are a central part of the machine learning and AI research nowadays, naturally,
because of their flexible architecture, robust performance, and abundance of functions for
every part of the models. Using only the normal traffic of the dataset, as is done in [7], could
be very useful for unsupervised methods like autoencoders, that learn from the pattern of
normal data, and recognise anomalies based on their deviation from them.
The second course of action that could upgrade this project is data centric. With traffic data
captured via Wireshark, thanks to the IT department’s cooperation, we could use the headers,
the only part available to us because of privacy and security concerns, to create the features
of the NSL-KDD for the connections provided, or part of them; we have seen throughout the
pre-processing phase of our project that some variables influence the data more than others
(e.g. with correlation). With the dataset created from the recent traffic data, a whole process
of its own to accomplish, we have a potentially worthwhile sample of records, which would be
unlabelled. Thankfully, the University has a very secure network infrastructure, being a big
campus network where sensitive data moves around, so if we choose a node from the lower
levels of it, where the data is already filtered through the security solutions of the network, we