23
The value that comes out of the activation function is now the value of the hidden layer’s node
and it can be fed to the next layer (another hidden layer or the output layer) with the same
process of dot product calculation and activation function; this process is repeated in all the
nodes of all the hidden layers [21], [22].
After getting to the output layer, the outcome value of the ANN is either used for
backpropagation during the training phase, or the is presented as the result of the prediction
during the test.
Multilayer perceptrons are the basis of all ANNs, and have greatly improved machine learning
algorithms, both in regression and in classification applications. Their flexibility and the
abundance of both activation and optimisation functions have enabled computers to not be
constricted by XOR calculations and enrich their learning potential for more rich and complex
problems.
MLPs can be shallow, when there is only one hidden layer, or Deep Neural Networks (DNNs),
when there are two or more hidden layers. ANNs, especially DNNs are at the forefront of
research in the past few years, as they are fundamental for deep learning and AI. One of their
core strengths is that they solve problems stochastically, therefore allow approximate
solutions to very complex or even unsolvable problems. The stochastic way of work allows the
model not to make any assumptions about underlying probabilistic density or other relations
between the variables of the input data, but rather get to it through the weight functions
(interconnections of the nodes) and the repetitive process of training. MLPs can have high
performance scores even with less training data, if given a sufficient number of nodes and
layers, and a two-layer backpropagation neural network with enough hidden neurons has been
proved to be a universal approximator [23]. The most important disadvantage of MLP
compared with other DNN methods is that it is fully connected and creates a dense network,
so the number of parameters needed for the model becomes very high. This leads to
inefficiency and redundancy in more complex problems [24].
These are the five models that were used for this project. This brief explanation of the way
they work is hopefully helpful for understanding why each of them showcased the results it did
during the experimental phase. Each model has its own use and advantages in different cases,
so it was considered useful to provide this comparison among them, with the NSL-KDD, which
provides for a rich dataset when it comes to the number and variety of features, variation in
the correlation between them, and many classifications to study.