Data Cleaning
■ Confirm that there are no missing, wrong format, out-of-bounds and redundant
values
– All records are unique and with all features
■ Separate last column (difficulty level)
– No real information for the model, only for us to compare training and test set
■ Drop col. 20 (number of outbound commands in an ftp session)
– All 0𝑠, became 𝑁𝑎𝑁 during correlation calculations
16