Final Project Logbook – April 18

Feb 5 – Researched project Ideas – 2 hours

Feb 7 – Wrote Project proposal – 3 hours

Feb 17 – Further research of learning algorithms – 1 hour

Feb 19 – Wrote code for:   – 3 hours

  • Reading the dataset
  • Partitioning the data into training and testing sets
  • Using K-fold cross validation

Feb 21 – Wrote biweekly update – 2 hours

Feb 28 – Researched Random Forests, and Implementation of Random Forests through Scikit Learn – 4 Hours

March 3 – Implementation of code for running Random Forests – 3 hours

March 7 – Wrote Biweekly update – 2 hours

March 9 – Reading/Researching SVMs in[1] – 3 hours

March 14 –  Researched possible libraries to use for my base models – 2 hours

March 20 – Implemented Linear SVMs through Scikit learn – 2 hour

March 21 – Wrote Biweekly update and responded to feedback on my project – 2 hours

March 27 – Implemented Analysis in code – 4 hours

April 3 – Researched and implemented data scaling and pre-processing – 1 hour

April 4 – Implemented Gaussian SVM and analysis[1][2] – 2 hours

April 7 – Wrote and recorded project demo – 2 hours

April 7 – Wrote Biweekly update – 1 hour

April 13 – Researched and implemented and tested the neural network model – 4 hours

April 15 – Started writing my final report – 4 hours

April 18 – Finalized final report – 8 hours

April 18 – Responded to peer questions, updated the website, and wrote final logbook – 20 minutes

Bi-Weekly Update #4

During the last two weeks, I first implemented the analysis functions for random forests and linear SVMs, but I hit a roadblock because my functions weren’t properly reading the data. I eventually realized that I needed to add data scaling to my features, and encode the string values in the dataset into binary features. 

Although I was now able to run my analysis, the results I saw from linear SVMs were lackluster and although Random Forests performed marginally better, I wanted to see if I could get better results with a different model.I then researched and implemented Gaussian SVMs, which ended up performing similarly to Random forests. 

During the following two weeks I plan to research and implement neural networks, and experiment again with the other models’ hyperparameters to see if they can be further optimized. Then I plan to write my final report and summarize all of my findings.

Current Results

Figure 1(Left): Tuned Linear SVM Confusion Matrix

Figure 2(Right): Tuned Random Forest Confusion Matrix

The Linear SVM had an accuracy of 0.6 on the test set, which clearly indicates that the data is not at all linearly separable.

The Tuned Random Forest had an accuracy of 0.81, which although much better, is still not near the desired precision, especially since it let 8762 attacks go undetected(19% of all attacks in the test set)(See Figure 2).

Figure 3: Tuned Gaussian SVM Confusion Matrix

The tuned Gaussian SVM had the best performance of the three models, but was only marginally better than the tuned Random Forest model with an accuracy of 0.82. The Gaussian SVM did however catch many more attacks, only letting 7229 through this time(15% of all attacks)(See Figure 3).

Overall the models that I have tested thus far have been somewhat effective, though I believe I can achieve a much higher accuracy yet.

Project Logbook – Apr 7

Feb 5 – Researched project Ideas – 2 hours

Feb 7 – Wrote Project proposal – 3 hours

Feb 17 – Further research of learning algorithms – 1 hour

Feb 19 – Wrote code for:   – 3 hours

  • Reading the dataset
  • Partitioning the data into training and testing sets
  • Using K-fold cross validation

Feb 21 – Wrote biweekly update – 2 hours

Feb 28 – Researched Random Forests, and Implementation of Random Forests through Scikit Learn – 4 Hours

March 3 – Implementation of code for running Random Forests – 3 hours

March 7 – Wrote Biweekly update – 2 hours

March 9 – Reading/Researching SVMs in[1] – 3 hours

March 14 –  Researched possible libraries to use for my base models – 2 hours

March 20 – Implemented Linear SVMs through Scikit learn – 2 hour

March 21 – Wrote Biweekly update and responded to feedback on my project – 2 hours

March 27 – Implemented Analysis in code – 4 hours

April 3 – Researched and implemented data scaling and pre-processing – 1 hour

April 4 – Implemented Gaussian SVM and analysis[1][2] – 2 hours

April 7 – Wrote and recorded project demo – 2 hours

April 7 – Wrote Biweekly update – 1 hour

References

[1] Machine Learning, Tom Mitchell, url: https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf, accessed April 4, 2025

[2] SGD Classifier, Scikit Learn, url: 1.4. Support Vector Machines — scikit-learn 1.6.1 documentation, accessed April 4, 2025

Bi-Weekly Update #3

During these last two weeks, I focussed on implementing SVMs. I again used Tom Mitchell’s “Machine Learning”[1] as a base for my research and understanding of SVMs. After some research I decided a linear kernel would be appropriate for my purposes, as the dataset is quite large so training multiple gaussian kernels SVMs, as is necessary for analyzing performance, would be extremely time consuming. After some research I decided to use Scikit learn’s SGDClassifier[2] as the base for my model, as it is the fastest option I found. I have now implemented the model, though I have yet to finish the analysis portion of my code.

During the next two weeks I will focus on completing the analysis portion of my code, then preparing the demo video, which will likely feature an explanation of my code, and my reasoning behind choosing my hyperparameter configurations.

Project Logbook – Mar 21

Feb 5 – Researched project Ideas – 2 hours

Feb 7 – Wrote Project proposal – 3 hours

Feb 17 – Further research of learning algorithms – 1 hour

Feb 19 – Wrote code for:   – 3 hours

  • Reading the dataset
  • Partitioning the data into training and testing sets
  • Using K-fold cross validation

Feb 21 – Wrote biweekly update – 2 hours

Feb 28 – Researched Random Forests, and Implementation of Random Forests through Scikit Learn – 4 Hours

March 3 – Implementation of code for running Random Forests – 3 hours

March 7 – Wrote Biweekly update – 2 hours

March 9 – Reading/Researching SVMs in[1] – 3 hours

March 14 –  Researched possible libraries to use for my base models – 2 hours

March 20 – Implemented SVMs through Scikit learn – 2 hour

March 21 – Wrote Biweekly update and responded to feedback on my project – 2 hours

References

[1] Machine Learning, Tom Mitchell, url: https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf, accessed March 21, 2025

[2] SGD Classifier, Scikit Learn, url: SGDClassifier — scikit-learn 1.6.1 documentation, accessed March 21, 2025

Bi-Weekly Update #2

During these last two weeks I focused on researching my first chosen learning algorithm – Random Forests. First I found a textbook[1] to gain a solid background of Random Forests, and how they work. After I had gained a solid understanding, I researched machine learning libraries that I could use to implement Random Forests in my code. I found Skit Learn’s RandomForestClassifier and I spent some time learning what each of its parameters’ function. Earlier this week I wrote the function in my code to implement random forests on the given dataset. 

Before the next progress report I plan to research and implement SVMs in my code, and add analysis for the performance of the two methods.

Project Logbook – Mar 7

Feb 5 – Researched project Ideas – 2 hours

Feb 7 – Wrote Project proposal – 3 hours

Feb 17 – Further research of learning algorithms – 1 hour

Feb 19 – Wrote code for:   – 3 hours

  • Reading the dataset
  • Partitioning the data into training and testing sets
  • Using K-fold cross validation

Feb 21 – Wrote biweekly update – 2 hours

Feb 28 – Researched Random Forests, and Implementation of Random Forests through Scikit Learn – 4 Hours

March 3 – Implementation of code for running Random Forests – 3 hours

March 7 – Wrote Biweekly update – 2 hours

References

[1] Machine Learning, Tom Mitchell, url: https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf, accessed Feb 28, 2025

[2] RandomForestClassifier, Scikit Learn, url: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, accessed Feb 28, 2025

Bi-Weekly Update #1

During the last couple weeks I did some more research on my project. First I evaluated the machine learning algorithms I’d be using. I had intended to use a neural network to classify the network intrusion packets because of its current popularity and strong performance, though I have now decided to use a linear support vector machine instead, because neural networks are very difficult to interpret, and I would like to be able to investigate what features the learning methods deem most indicative of a network intrusion. Thus I will use a random forest learning algorithm, and a support vector machine algorithm in my analysis. 

I have also decided that instead of building a packet reading script to build a csv, I will use the pre-existing csv’s for my analysis. I have two reasons for this change: first, because the packet reading script would be time consuming to create and I believe that the implementation of the algorithms will be very time consuming by themselves, and secondly, because the CSVs from the dataset I’ll be using contain information in the features section that cannot be found through analysis of the packets alone, and I more information will surely lead to better learning algorithms.

During these last two weeks I have researched the algorithms I’d be using, and written and implemented the code to read the csv files, as well as the code for using k-fold cross validation for my analysis.

Project Logbook – Feb 21

Feb 5 – Researched project Ideas – 2 hours

Feb 7 – Wrote Project proposal – 3 hours

Feb 17 – Further research of learning algorithms – 1 hour

Feb 19 – Wrote code for:   – 3 hours

  • Reading the dataset
  • Partitioning the data into training and testing sets
  • Using K-fold cross validation

Feb 21 – Wrote biweekly update – 2 hours

Project Proposal – Network Intrusion Detection with Machine Learning

Hayden Dunstan, CSC 466

Overview

As many students in our CSC 446 class discussions have agreed, network security is a top priority as the internet is becoming increasingly popular, and an increasing amount of personal information is being stored in online servers. 

Network intrusion detection is widely accepted as an effective method for dealing with network threats[1], though traditional rule-based IDS systems (like Snort) struggle with new attack patterns. During my project I will investigate the effectiveness of Machine Learning Methods on intrusion detection datasets, and attempt to train a model that can effectively detect advanced threats, in order to gain a better understanding of network attacks, and machine learning.

Project Plan

I plan to use the UNSW-NB15 dataset[2] created by the University of Sydney which contains 9 types of attacks, and 49 features. The set contains over 2 million records, and has conveniently been partitioned into a test set and a training set. 

For the first part of my project, I will design a program that can read pcap files and output their features. In the second part of my project I will train two models,  using Random forests and Neural networks respectively, and then evaluate their performance and show the capabilities of my final model in my final report.

Schedule Dates

  • First Biweekly Update Feb 21
  • Midterm Update Mar 7
  • Third biweekly update Mar 21
  • Final Presentation Apr 4
  • Final Report Apr 11

References

[1] Zhen Yang et. al., “A systematic literature review of methods and datasets for anomaly-based network intrusion detection”, Computers & Security, url: network intrusion detection – an overview | ScienceDirect Topics, accessed: Feb 7, 2025

[2] UNSW-NB15 Dataset, University of Sydney, url: The UNSW-NB15 Dataset | UNSW Research, accessed: Feb 7, 2025