Page 23

The value that comes out of the activation function is now the value of the hidden layer’s node
and it can be fed to the next layer (another hidden layer or the output layer) with the same
process of dot product calculation and activation function; this process is repeated in all the
nodes of all the hidden layers [21], [22].

After getting to the output layer, the outcome value of the ANN is either used for
backpropagation during the training phase, or the is presented as the result of the prediction
during the test.

Multilayer perceptrons are the basis of all ANNs, and have greatly improved machine learning
algorithms, both in regression and in classification applications. Their flexibility and the
abundance of both activation and optimisation functions have enabled computers to not be
constricted by XOR calculations and enrich their learning potential for more rich and complex
problems.

MLPs can be shallow, when there is only one hidden layer, or Deep Neural Networks (DNNs),
when  there  are  two  or  more  hidden  layers.  ANNs,  especially  DNNs  are  at  the  forefront  of
research in the past few years, as they are fundamental for deep learning and AI. One of their
core  strengths  is  that  they  solve  problems  stochastically,  therefore  allow  approximate
solutions to very complex or even unsolvable problems. The stochastic way of work allows the
model not to make any assumptions about underlying probabilistic density or other relations
between  the  variables  of  the  input  data,  but  rather  get  to  it  through  the  weight  functions
(interconnections  of  the  nodes)  and  the  repetitive  process  of  training.  MLPs  can  have  high
performance  scores  even  with  less  training  data,  if  given  a  sufficient  number  of  nodes  and
layers, and a two-layer backpropagation neural network with enough hidden neurons has been
proved  to  be  a  universal  approximator  [23].  The  most  important  disadvantage  of  MLP
compared with other DNN methods is that it is fully connected and creates a dense network,
so  the  number  of  parameters  needed  for  the  model  becomes  very  high.  This  leads  to
inefficiency and redundancy in more complex problems [24].

These are the five models that were used for this project. This brief explanation of the way
they work is hopefully helpful for understanding why each of them showcased the results it did
during the experimental phase. Each model has its own use and advantages in different cases,
so it was considered useful to provide this comparison among them, with the NSL-KDD, which
provides for a rich dataset when it comes to the number and variety of features, variation in
the correlation between them, and many classifications to study.