37
After separating column #42, one more was deleted from all the dataframes, column #19
(number of outbound commands in an ftp session). This column, which was all zeros, became
πππ during the correlation calculations, and also, it didnβt seem to offer any information to
the dataset.
4.3.1. One-hot encoding
After cleaning the data, the categorical variables need to be changed into numerical, in order
to be counted in the process of correlation calculations and then to be fed into the model. As
described before, there are four features in the dataset that represent classification: protocols,
services, flags, and attack types (see 4.2.1.
Categorical features). While the first three are
common in all the dataframes created, the attack type makes it necessary to create different
encoded representations for each dataframe.
The way that the categorical features were encoded into numerical variables is through one-
hot encoding [27]. With one-hot encoding, specifically by creating dummy variables out of the
categorical ones, one label is turned into a vector of π-dimensions, where π is the number of
all the different values this categorical variable might have.
For example, in column #1 are the protocols used for each record. In the dataset, there are
three protocol categories: {ππΆπ, ππ·π, πΌπΆππ}, and each record can have only one of these
values. The protocols, like all other categorical features in this dataset, do not have a
hierarchical relationship with each other, which means that they canβt be replaced by integer
values, like: {0, 1, 2}, as they would indicate an order to the different categories. With one-hot
encoding of the protocols, the categorical variables are turned into 3-dimensional vectors:
{[1, 0, 0], [0, 1, 0], [0, 0, 1]}. ππΆπ is represented by [1, 0, 0], similarly ππ·π with [0, 1, 0] and
πΌπΆππ with [0, 0, 1]. As the name on-hot encoding suggest, each record must have all
dimensions zero, except for the one that represents its category.
With the .get_dummies method, the categorical features are all moved to the end of the
dataframe, and the one-hot encoded vectors are expanded as different columns. This is shown
in Table 5, where columns 0 and 4 β 40 are ordered the same as before, while columns 1
(protocols), 2 (services), and 3 (flags) have moved to the tail of the dataframe, and the
previously one-column-each feature has expanded to 3, 70 and 11 columns respectively (in
the training set).
In the case of column 41 (type of traffic), it has expanded to different numbers of labels,
according to the classification. There are 23 traffic labels for the multiclass training set (in the
respective test set, there are 38 different labels), 2 labels for the binary classification and 5
labels for the four-class classification training and test sets.
One-hot encoding has a minor flow, which can cause a problem if not considered. As it was
mentioned, the number of traffic labels differs between the training set and the test set,
resulting in a different number of columns after the encoding. The same thing happens in the