Page 37

After separating column #42, one more was deleted from all the dataframes, column #19
(number of outbound commands in an ftp session). This column, which was all zeros, became
𝑁𝑎𝑁 during the correlation calculations, and also, it didn’t seem to offer any information to
the dataset.

4.3.1. One-hot encoding

After cleaning the data, the categorical variables need to be changed into numerical, in order
to be counted in the process of correlation calculations and then to be fed into the model. As
described before, there are four features in the dataset that represent classification: protocols,
services, flags, and attack types (see 4.2.1.

Categorical features). While the first three are

common in all the dataframes created, the attack type makes it necessary to create different
encoded representations for each dataframe.

The way that the categorical features were encoded into numerical variables is through one-
hot encoding [27]. With one-hot encoding, specifically by creating dummy variables out of the
categorical ones, one label is turned into a vector of 𝑁-dimensions, where 𝑁 is the number of
all the different values this categorical variable might have.

For example, in column #1 are the protocols used for each record. In the dataset, there are
three protocol categories: {𝑇𝐶𝑃, 𝑈𝐷𝑃, 𝐼𝐶𝑀𝑃}, and each record can have only one of these
values.  The  protocols,  like  all  other  categorical  features  in  this  dataset,  do  not  have  a
hierarchical relationship with each other, which means that they can’t be replaced by integer
values, like: {0, 1, 2}, as they would indicate an order to the different categories. With one-hot
encoding  of  the  protocols,  the  categorical  variables  are  turned  into  3-dimensional  vectors:
{[1, 0, 0], [0, 1, 0], [0, 0, 1]}. 𝑇𝐶𝑃 is represented by [1, 0, 0], similarly 𝑈𝐷𝑃 with [0, 1, 0] and
𝐼𝐶𝑀𝑃  with  [0, 0, 1].  As  the  name  on-hot  encoding  suggest,  each  record  must  have  all
dimensions zero, except for the one that represents its category.

With  the  .get_dummies  method,  the  categorical  features  are  all  moved  to  the  end  of  the
dataframe, and the one-hot encoded vectors are expanded as different columns. This is shown
in  Table 5,  where columns 0  and 4 − 40  are  ordered the  same  as before,  while columns  1
(protocols),  2  (services),  and  3 (flags)  have  moved  to  the  tail  of  the  dataframe,  and  the
previously one-column-each feature has expanded to 3, 70 and 11 columns respectively (in
the training set).

In the case of column 41 (type of traffic), it has expanded to different numbers of labels,
according to the classification. There are 23 traffic labels for the multiclass training set (in the
respective test set, there are 38 different labels), 2 labels for the binary classification and 5
labels for the four-class classification training and test sets.

One-hot encoding has a minor flow, which can cause a problem if not considered. As it was
mentioned, the number of traffic labels differs between the training set and the test set,
resulting in a different number of columns after the encoding. The same thing happens in the