Text mining for cyber attack forensics

According to last year data there are 1,5 million cyber attacks per year, overall it may not sound much but it amounts to 4000 attacks every day or around 200 per hour. Some of them can be irrelevant or small enough to get noticed, however other they can create chaos and millionaire losses for the companies affected; like the cyber-attack that took place last Friday affecting big market players. 

As reported by British insurance company Lloyd’s in 2015 cyber attacks costed $400 billions and if we take into consideration Junipe’s predictions this rate will quadruplicate by 2019 if we consider not only the damage per se but also returning to the normality.

costs of cybercrime per country.png

Source: Statista

We are not experts in cybersecurity, but we are in linguistics, and that is why we were thinking what could we do to prevent these crimes?

When we went online and start researching, we found out that there is not much activity to prevent cybercrimes. Social media analytics and text mining for cyber attack forensic have been marginalized by the industry and almost no company puts their technology to work by helping businesses to prevent themselves against cyber hacking.

In line with the evidences, cybercriminals tend to exchange knowledge and cyber attack tools online using bot networks through darkmarkets: social media forums and networks. Therefore, the data is out there, however it’s a massive amount to analyze by hand.

In natural language processing Latent Dirichlet Allocation or (LDA) is an algorithm that allows to discover topics automatically from different sentences and then group them into categories. This model considers any document or piece of text as a mixture of topics that contain words with certain probabilities.

Latent Dirichlet Allocation works requires a huge amount of annotated data to offer the probability of each word to show in the text. So, if you don’t have it will be difficult for the model to make accurate predictions and therefore to achieve success predicting cybercrime.

Also, this model is difficult to apply for an average user, and it’s based in probabilities that may or may not be true. But if we analyze data with a linguistic approach we may get more accurate results. Why?

At Bitext we internally built a language-independent lexical analyzer and a PDA-based non-deterministic GLR parser to analyze text and that allow us to extract the main concepts.

How is this done? By using syntactic analysis. Noun phrases, verb phrases, adjectival phrases and others, can be identified. This is something that cannot be done if you don’t focus on linguistics. You should understand the nature of each sentence to extract valuable information.

There is another technology involved to achieve better results and it’s stemming or lemmatizing. What can we achieve by using it? For example, we can collapse all the forms of a verb “hacked, hacking, etc.” to its root form “hack”.

So, as a conclusion, linguistics can help you to defend your company from cyber attack if we use it as a tool to discover what is going on. Once we know what may happen or what is trending in those cyber-crime communities it is easier to take measures to defend your business from a cyber attack.

Do you want to see how powerful our tool is to extract any type of context from different text sources? We will be glad to show it to you!


Show me how concept extraction works

Subscribe Here!