Machine Learning Goes Dark And Deep To Find Zero-Day Exploits Before Day Zero

How do you stop someone from exploiting a vulnerability in your software when you don’t know that the vulnerability exists? That’s the problem faced by cyber security experts who try to stop zero-day exploits. If you’re lucky, a friendly spots the vulnerability and tells you about it so you can fix it before any damage is done. If you’re unlucky, the hackers find it first and you find out after the attacks begin on day zero.

Mega-corporations like Google GOOGL and Apple AAPL are attacking this problem with bounties offered to anyone who can hack their software. A team at Arizona State University offers a different approach with a machine learning system that monitors darknet and deepnet traffic for information about zero-day exploits before the exploits are let loose in the wild.

The stereotype of the lone hacker with bad hygiene infecting the planet with malicious code from a dark basement sizzling with the sub audible hum of high grade electronics is overdone. The cyber criminals that attack the world’s computers every minute of every day are much more sophisticated. Criminal hacking is a culture where people share ideas and information on forums that exist on TOR networks (the darknet) and websites that are not indexed by search engines (the deepnet). Malware programs and exploit kits are bought and sold on darknet markets. Ransomware-as-a-service is available for as little as $39.

Information about future exploits often appears on discussion forums while the exploit is being designed and tested. Once the exploit is perfected, it will likely be offered for sale on a darknet market before it appears in the wild. The team at Arizona State took advantage of this and built a machine learning system that monitors darknet and deepnet websites for traffic about security exploits. In effect, their machine learning system turns the hackers’ communication and marketing activities into an early-warning system that protects software developers against zero-day exploits. They hacked the hackers.

The system the team built has three components. Crawlers retrieve HTML documents based on exploit-related content. Parsers extract content from the retrieved documents and communicate with the crawlers to find information that changes over time on each webpage. Machine learning classifiers interact with the parsers to filter out information that is unrelated to hacking and exploits.

The filtering out process is critical because darknet markets advertise drugs, pornography, weapons and many other things in addition to security exploits. The team found that only 19% of 162,872 forum posts and 13% of 11,991 darknet ads were concerned with malicious hacking. The machine learning classifier saves a person from having to wade through all this garbage to find the traffic related to security exploits.

The system works well. Testing showed that the machine learning classifier successfully identified 80% of forum discussions about exploits and 92% of the exploit-related products offered for sale on darknet markets. The system has the added side benefit of identifying the usernames of criminals who operate across more than one forum or marketplace. These are likely to be the more successful cyber criminals whose username is their brand.

Building and training the team’s machine learning system is labor intensive. Every darknet and deepnet website that is monitored for security exploits has to have its own specially designed crawler, parser and machine learning classifier. The classifiers undergo semi-structured learning which means that human experts are needed to tag 25% of the data retrieved from each marketplace and forum for training purposes.

The time and labor needed to build the system and keep it current is worth it when you consider the payoff.  The system is currently  producing an average of 305 high-quality cyber-threat security warning each week while monitoring 27 darknet markets and 21 forums. If only 1% of these warnings results in discovering and patching a potential zero-day exploit before it can affect untold numbers of computers, the time to build and maintain the system will have been well spent.