Practical Feedback on Using Machine Learning in Information Security

29 Apr 2024

Introduction

The ever-evolving landscape of online threats demands constant adaptation from information security teams. We see this in the relentless tactics employed by "bad actors" – from creating fake profiles for malicious purposes to exploiting vulnerabilities in network infrastructure. As the scale and complexity of these attacks grow, traditional defense mechanisms like rate limiting often reach their limits. This article dives into the power of Machine Learning (ML) as a powerful weapon in the information security arsenal, offering practical insights from real-world applications. Here, we'll explore how ML can be harnessed to identify and neutralize these threats, drawing from my experience at Meta, where we safeguard billions of users on the WhatsApp platform. We'll not only delve into the technical aspects of building ML models but also explore crucial pre-ML strategies to fortify your defenses.

Have you ever wondered about how X(formerly known as Twitter) identifies bots that are tweeting spam? or how banks identify fraudulent accounts? or how GitHub identifies faulty servers in the network? Such systems are built by information security teams that monitor and take down such activities at scale using AI/ML systems. In cases where the automation can’t handle or identify, incident response teams take them down. These learnings are captured and then train new ML classifiers to identify outliers.

Bad actors have a wide range of sophistication and their intent varies too. Some bad actors create fake profiles that can be used to carry out many different types of abuse: scraping, spamming, fraud, and phishing, among others. To build robust countermeasures against different types of attacks on our platform, a funnel of defenses is built to detect and take down fake accounts at multiple stages.

Example of how an attacker goes after your system

(1) The attacker needs to first create an account for which they would use PVAcreator or other tools

Step 1: Create more accounts and enable connection with host system

(2) Using the automated accounts, the attacker needs to reach the data by navigating through the network and moving laterally.

Step 2: Identify what data / process you want to control over

(3) Once the attacker has access to the data, the attacker needs to exfiltrate this data out of the network.

Step 3: Start the attack!

Some strategies to start employing before using ML approach

IP-Based Rate Limiting
- Limits the number of requests from a single IP address within a specific timeframe.
- Helps mitigate DDoS attacks and brute-force attempts.
User-Based Rate Limiting
- Sets a cap on the requests a single user can make within a given time window.
- Guards against abuse and unauthorized access attempts.
Token-Based Rate Limiting
- Uses unique tokens or API keys to track and control API requests per token.
- Secures APIs from misuse and potential data leaks.
Using CAPTHA to distinguish between automated bots and legitimate human users
- In addition to the traditional rate limiting techniques mentioned above, organizations can further enhance their cybersecurity by incorporating the Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) challenges.

Very soon you will start seeing these rules become quite ineffective as attackers rotate IPs, accounts, and tokens. They will start slowing down requests e.t.c

Bird's-Eye View of Information Security Framework

Funnel of Defense

Building ML/AI-based solution

You need to start collecting logs and turn them into ML features for training.

Here is an example of building features based on the traffic data you see:

Here are some examples of features this paper has developed from logs:

RDP Features: SuccessfulLogonRDPPortCount, UnsuccessfulLogonRDPPortCount, RDPOutboundSuccessfulCount, RDPOutboundFailedCount, RDPInboundCount
SQL Features: UnsuccessfulLogonSQLPortCount, SQLOutboundSuccessfulCount, SQLOutboundFailedCount, SQLInboundCount
Successful Logon Features: SuccessfulLogonTypeInteractiveCount, SuccessfulLogonTypeNetworkCount, SuccessfulLogonTypeUnlockCount, SuccessfulLogonTypeRemoteInteractiveCount, SuccessfulLogonTypeOtherCount
Unsuccessful Logon Features: UnsuccessfulLogonTypeInteractiveCount, UnsuccessfulLogonTypeNetworkCount, UnsuccessfulLogonTypeUnlockCount, UnsuccessfulLogonTypeRemoteInteractiveCount, UnsuccessfulLogonTypeOtherCount
Others: NtlmCount, DistinctSourceIPCount, DistinctDestinationIPCount

With such feature vectors, we can build a matrix and start using outlier detection techniques.

Principal Component Analysis (PCA): We start with a classical way of finding outliers using PCA because it is the most visual, in my opinion. In simple terms, you can think of PCA is a way to compress and decompress data, where the data lost during compression in minimized. Here is an example from fast.ai

2. Auto-encoders: These are in short non-linear version of PCA.

The neural network is constructed such that there is an information bottleneck in the middle. By forcing the network to go through a small number of nodes in the middle, it forces the network to prioritize the most meaningful latent variables, which is like the principal components in PCA. Here is an example using fast.ai

Isolation Forest and other decision forest: Anomalies have two characteristics. They are distanced from normal points and there are only a few of them. Isolation Forest randomly cuts a given sample until a point is isolated. The intuition is that outliers are relatively easy to isolate. Isolation Forest is a tree ensemble method of detecting anomalies was first proposed by Liu, Ting, and Zhou. Here is an example from scikits.

Conclusion

The fight against online threats is a continuous game of cat and mouse. While advanced ML techniques offer unparalleled capabilities, it's crucial to remember that they are just one piece of the puzzle. By combining pre-emptive measures like rate limiting with robust ML models, information security professionals can build a multi-layered defense system. This article serves as a practical roadmap for leveraging ML in your security strategy, empowering you to identify and thwart even the most sophisticated attacks. Remember, information security is a shared responsibility. By actively implementing these techniques and fostering collaboration within the security community, we can create a safer online environment for everyone.