We’ve covered some of the main points of the Mining Malware project, but haven’t gotten to the real meat of the discussion; What would a search for automation software look like, and would it even be successful? To demonstrate this, I’m going to start with a small example, and then explain the issues with scaling it up to the amount of malware we currently have in our system. This demonstration is definitely cherry picking, I’m picking a set of conditions to demonstrate how a search could work, and discussing the various issues with scaling.
To start, I pulled down the APT1 dataset from the @VXShare torrents. I figured this would be a good place to start, as the set is reasonably familiar to everyone (even found an easter egg in it). Secondly, we have to take a step outside of automation at this point, as the APT1 dataset (to our knowledge) didn’t target automation systems specifically, it was designed to gather information from infected systems and transport it to the attackers.
I used a similar process of gathering string data from the VirusShare torrent, which I discussed in a previous post. I parsed plaintext out of the 293 items in the APT1 dataset, and saved them away for search.
So, hypothetical time: Let’s assume you’re an automation company that is concerned about their product being used by malware authors to do bad things. This could be through either direct usage, or simply using the name of the DLL to camouflage malicious software. You come up with a list of DLLs and maybe other unique information used in the product, and use a “hypothetical” service (not software, a subscription service) to watch incoming malware for signs of your product. In this case, I’m going to call this company the “Lotus Notes” company, and I’m going to alert on a small subset of the DLLs “Lotus Notes” uses. The DLLs I’m looking for are lcppn30i.dll and nNotes.dll.
I load up APT1 plaintext, and search for those DLLs, getting a single hit on the nNotes.dll in sample VirusShare_94a59ce0fadf84f6efa10fe7d5ee3a03. Now, as a representative of Lotus Notes, I would want to know why my DLL is showing up in a malware sample, and would pull up the 94a59c sample (in this case, in Virustotal). Looking over the various sections, I can see that this sample is identified as malware by 18/46 antivirus products, and the nNotes.dll is shown in the PE imports to the right. User comments flag it as part of an APT1 compromise, and the original filename of the sample is ‘nsfdump.exe‘ (apparently, a tool to dump out Lotus Notes data).
Sounds impressive right? Simply putting all your details into a search set and letting an automated malware sampling system do all the work of finding instances for you? Not so fast. This example is a contrived example, I’m working with an extremely small malware sample, and I’ve cherry picked my DLLs as well. In the real world, I worked on the Malware Mining set of blog posts for about 4 days to write. In that amount of time, VirusTotal received 445,000 distinct new files, orders and orders of magnitude over my example of 293. I currently don’t have access to the Private API of Virus Total, or I would search the PE imports for nNotes.dll, but I did some google-fu against the indexed portions of Virus Total and came up 176 unique matches, likely a much smaller number than a real Virus Total search would show.
So what’s the point I’m trying to make? While it’s likely that someone can search the dataset and retrieve matches for specific strings within malware, the amount of positives, or false positives, is going to make detailed analysis an ugly experience. What’s needed is specificity in generating the searches so that as many false positives are left out as possible, and that there are tools available to quickly and easily drill down into already searched datasets to help create more accurate searches.
For example, instead of using DLL names, a good search would attempt to use CLSIDs. CLSIDs are unique, and will be far more accurate when searching the datasets. Using hashes of specific files is far more effective, though the intent is to detect malware making use of files that may not be hashed. I wouldn’t mind seeing a capability that could search a filename and a hash, and let you search for all filename that did NOT have a specific hash (this would rule out legitimate uses, leaving illegitimate).
I provided an initial automation search set on a particular product to an interested third party last year. The individual wished to use the sample to search their malware samples, and I recently got feedback on how the search went. The response was millions of hits, most likely false positives due to the ambiguity in my search set, which didn’t consist of the unique CLSIDs and Hashes. The problem here is a big data problem.
Being able to mine malware data effectively is the difference between simply detecting malware and actually analyzing malware to look for trends, vulnerabilities, and new attacks. Additionally, the malware authors have started catching wise to static analysis techniques, and are branching out in different ways. One of the more interesting methods to counteract this is to run the malware in a sandbox for a small amount of time, an approach recently adopted by VirusTotal, and also shown (in far greater detail) on MalWr.com.
title image by josie lynn richards