Bi-Weekly Update #4

During the last two weeks, I first implemented the analysis functions for random forests and linear SVMs, but I hit a roadblock because my functions weren’t properly reading the data. I eventually realized that I needed to add data scaling to my features, and encode the string values in the dataset into binary features. 

Although I was now able to run my analysis, the results I saw from linear SVMs were lackluster and although Random Forests performed marginally better, I wanted to see if I could get better results with a different model.I then researched and implemented Gaussian SVMs, which ended up performing similarly to Random forests. 

During the following two weeks I plan to research and implement neural networks, and experiment again with the other models’ hyperparameters to see if they can be further optimized. Then I plan to write my final report and summarize all of my findings.

Current Results

Figure 1(Left): Tuned Linear SVM Confusion Matrix

Figure 2(Right): Tuned Random Forest Confusion Matrix

The Linear SVM had an accuracy of 0.6 on the test set, which clearly indicates that the data is not at all linearly separable.

The Tuned Random Forest had an accuracy of 0.81, which although much better, is still not near the desired precision, especially since it let 8762 attacks go undetected(19% of all attacks in the test set)(See Figure 2).

Figure 3: Tuned Gaussian SVM Confusion Matrix

The tuned Gaussian SVM had the best performance of the three models, but was only marginally better than the tuned Random Forest model with an accuracy of 0.82. The Gaussian SVM did however catch many more attacks, only letting 7229 through this time(15% of all attacks)(See Figure 3).

Overall the models that I have tested thus far have been somewhat effective, though I believe I can achieve a much higher accuracy yet.

Project Logbook – Apr 7

Feb 5 – Researched project Ideas – 2 hours

Feb 7 – Wrote Project proposal – 3 hours

Feb 17 – Further research of learning algorithms – 1 hour

Feb 19 – Wrote code for:   – 3 hours

  • Reading the dataset
  • Partitioning the data into training and testing sets
  • Using K-fold cross validation

Feb 21 – Wrote biweekly update – 2 hours

Feb 28 – Researched Random Forests, and Implementation of Random Forests through Scikit Learn – 4 Hours

March 3 – Implementation of code for running Random Forests – 3 hours

March 7 – Wrote Biweekly update – 2 hours

March 9 – Reading/Researching SVMs in[1] – 3 hours

March 14 –  Researched possible libraries to use for my base models – 2 hours

March 20 – Implemented Linear SVMs through Scikit learn – 2 hour

March 21 – Wrote Biweekly update and responded to feedback on my project – 2 hours

March 27 – Implemented Analysis in code – 4 hours

April 3 – Researched and implemented data scaling and pre-processing – 1 hour

April 4 – Implemented Gaussian SVM and analysis[1][2] – 2 hours

April 7 – Wrote and recorded project demo – 2 hours

April 7 – Wrote Biweekly update – 1 hour

References

[1] Machine Learning, Tom Mitchell, url: https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf, accessed April 4, 2025

[2] SGD Classifier, Scikit Learn, url: 1.4. Support Vector Machines — scikit-learn 1.6.1 documentation, accessed April 4, 2025