The last

**Report to the Nation published by ACFE**, stated that on average,**fraud**accounts for nearly the**5% of companies revenues**.on average, fraud accounts for nearly the 5% of companies revenues

Projecting this number for the whole world GDP, it results that the “

**fraud**-country” produces something like a**GDP 3 times greater than the Canadian GDP**.

No surprise than if

**companies and scholars**are increasingly**looking for way to prevent fraud**and spot it when it occurs.Particularly flourishing is the

**interest**in methods and**mathematical models**useful for**fraud detection**, i.e. fraud analytics models.During my working experience I got interested in a particular family of those models: unsupervised models.

## Autonomous models for fraud detection

As greatly stated by

**professors Hastie and Tibshirani**,**unsupervised models**are models that try to**figure out things autonomously**.In fraud analytics, unsupervised models are those that do not require

**any previous knowledge of fraud schemes**affecting the population you are looking at.The reason why I decided to focus my attention to this family of models is simple:

Unsupervised fraud analytics models can be a

**really powerful weapon****for companies**facing frauds events.Unsupervised fraud analytics models can be a really powerful weapon for companies facing frauds events

More specifically, those kind of models can be

**applied when no previous fraud analytics**activity was**done**into the company in the past and no knowledge of undergoing fraud schemes is possessed.Doesn’t sound sweet and magic?

Unfortunately,

**fraud**is quite multi-face phenomenon, and may assumes**different connotations in different situations**.From the analytical point of view, this means that it is

**not possible to find a model**respecting the golden rule “**one size fits all**” (not a great academic jargon, I know…).## Two families of models

More formally, we can assume two main type of fraud analytics models:

**distribution****based**models;**density based**models.

Those models look for different kind of frauds:

- the first for fraud that results in the
**population not following a distribution**that otherwise should be followed; - the second for fraud that results in
**values isolated from the rest**of the population .

Quite nice, but

**what kind of fraud schemes**are we talking about?Not pretending to be exhaustive, we can give two examples:

- the
, consisting in not recording payments from customer, is easily caught by distribution models, because it results in not respecting the “natural” distribution of proceeds;*skimming*scheme *frauds on*are more likely to be detected by density based models, if they are in a small number.**Expense reimbursements**

That said,

**what if we join those complementary families of models**into one singular algorithm able to intercept a wider range of fraud events?There is some

**evidence from literature**that this can lead to an**increased accuracy**(see for instance**and****Hwang 2003****) , and that led me to develop Afraus.****Wheeler 2000**## Afraus: three models for one score

**Afraus**is developed leveraging the

**complementarity concept**, and is built using three different fraud analytics models:

- the
**Benford’s Law**, an empirical law used as a distribution model; - the
**Control Chart**, a distribution model; - the
**Local Outlier Factor**, a density model.

The conjunct use of those three models resulted in an

**increased precision**over a population of audited data, as showed in paragraph**Does really it work? Looking at real data**But before going to the results, let us see more closely the model.

### The Benford’s Law

This model tests the

**data against a theoretical distribution of first digits**, highlighting those records that significantly deviates from that distribution.**I have written**about Benford’s Law in a**previous quite popular post**, therefore I am not going to repeat myself here.It is enough to say that this test has proven to be

**effective**for the most**different kinds of anomalies**, from**frauds in accounting data**to**social network suspicious behaviour**. Generally speaking, Benford’s law is really good at looking for manipulation on datasets as a whole.Afraus knows that, and uses it to understand how ‘clean’ is the population of data considered, assigning a

**score from 1 to 100**proportionate to the**significant deviation**from the expected frequency of the given digit. A 0 score is assigned to those records with no significant deviation.Before moving on I have to mention

**some cautions**that you need to use when applying Benford’s Law:as stated by

**Durtschi, Hillison and Pacini**not all kind of population are suitable to be tested using Benford’s Law, and even when they are, it is not obvious that they will comply with the Law, as clearly showed from professor**Goodman**.## Control Chart

Control charts are mainly known as a reliable tool for statistical process control.

Those kind of models derive from data

**a center line, an upper and a lower bound**, highlighting as ‘**out of control’ all records out of those limits**.As showed by

**Wheeler**, even if those models assumes a normal distribution, they are also applicable to populations with non-normal distributions. This is mainly due to the**robustness of the bounds derivation**.**Afraus**leverages the robustness of those charts to identify atypical data, calculating

**a score from 1 to 100 proportionate to the distance from the upper/lower bound**, and assigning a 0 score to the records falling within the bounds.

### Local Outlier Factor

Local Outlier Factor is a

**k-NN algorithm**, based on the concept of density: the**more isolated is a record**, the**higher**is the**likelihood**it is originating from**fraud**. Afraus uses the LOF to intercept isolated values.As for the two previous models,

**Afraus assigns a score from 0 to 100**to each record within the population being analyzed.Given the formulation of the

**Local Outlier Score**a 0 score is given to records less or equal to 1 and a 1 to 100 score is given tho the other records, using the maximum level of LOF to set the 100 score.## The final score

As seen within the previous paragraphs,

**Afraus leverages the concept of complementarity**, joining together different models.For each model Afraus calculates a “

**compliance-score” ranging from 0 to 100.**In order to express a**synthetic judgment**on each record, a**final score is calculated.**This score is simply defined as

**the weighted average of the three “compliance-scores”**.## How can I give it a try?

If you are interested in

**testing the algorithm**I have developed a nice**web application**, based on**Shiny**, named Afraus:#### try the Afraus web app

Using the Afraus web application you can

**test the algorithm with your own custom data,**and give a look to the results into details.Of course, I have

**already done some test**on real data, and I am going to show you the**results in the next paragraph**.## Does it really work? Looking at real data

As always is the case for fraud analytics, having developed a nice algorithm is not enough, you need to show that it works and does its job properly.

Therefore In order to check the validity of the complementary concept, I have

**made the following test**:I have applied the

**three models separately**on a population of**labelled data****with frauds**, resulting from a fraud audit, and**then**I have applied**Afraus on the same population**.After that I have

**determined**the**confusion matrix**and the**precision score**, measuring how good performed Afraus against the single models it is made of.Those were the

**results**:As you can see,

**Afraus brings a sensible improvement into the precision index**, suggesting that a good way was taken.Nevertheless as is often the case, it seems to be

**quite a long journey**, and more developments are coming.**Could it work better? currently working on…**

As shown in previous paragraph, Afraus has shown to be an interesting unsupervised fraud detection algorithm, able to detect real cases of fraud having no previous hint at all.

But, as said, I got interested into

**unsupervised models**because they can make a**good job for companies**even when they have no previous knowledge about fraud and fraud detection.That is why, after smiling for a while at the graph above, I asked myself:

**how can I make that 0.17 growth**always avoiding to ask the analyst any specification?This question led me to define the development path of Afraus:

**integrating more fraud detection models**, dynamically chosen by the algorithm itself.This implies some

**conditional instructions**based on**tests of the assumptions**taken for true from the different models.Let’s make an

**example**: we could**integrate regression models**and choose the type of regression equation**based on the R squared**parameter.Afraus is currently developed in

**R language**, and I am committing changing to the code on a**Github public repo**,**start watching**it if you would like to keep an eye on the project.**Conclusion**

**Fraud is a dangerous animal**, threatening companies and the economy as a whole. Fraud Analytics can be a great ally into the battle with this public enemy, that is why it is receiving increasing attention from companies and academy. Among fraud analytics models,

**unsupervised models can be really precious for the first fraud detection**activities in a company, and a complementary approach can make them even more precious.

Knowing that, I have developed for professional purposes

**Afraus**, an**unsupervised fraud detection algorithm**aimed at taking advantage of different models in order to easily scan population of data looking for fraud, without requiring any previous knowledge of undergoing fraud schemes or events affecting the data.**Have you ever tried this kind of approach?**Which were your results?

Hi Andrea,

Thank you for this nice post and R-Code. I think an unsupervised model can achieve great results in fraud detection.

I just wanted to mention that tha advantage of unsupervised models is not that they “…do not require any previous knowledge of fraud schemes affecting the population you are looking at.”

Actually, most supervised machine learning will be able to detect complex patterns without previous knowledge.

The difference between supervised and unsupervised is, that in the first one you got data where frauds have been labeled as 0 or 1 (for example). With unsupervised models you just try to detect anomalies in the data.

The advantage of the latter is that you can start your analysis without a huge handlabeled data base (this would be ideal). The disadvantage is, that detected anomalies could be caused by other things than fraud. Therefore the false-positive rate might be an important issue.

I think the two most common measures in classification are sensitivity and specificity as described here:

https://en.wikipedia.org/wiki/Confusion_matrix

In R package caret there is a nice function confusionMatrix that calculates all common measures.

Best regards,

Samuel

Hi Samuel,

really

thank youfor taking the time to write down your comment.You are

perfectly rightwhen saying that the main point with unsupervised models is the absence of a labeled datasets.I have translated this technical concept

forlet’s say,divulgation purposes: since Afraus is intended to be used within the very first fraud detection activities within a company,I have associated the absence of a labeled dataset to the lack of knowledge of undergoing fraud schemeswithin the company.This does make sense to me, since in my professional experience I have noticed this common pattern: no previous fraud detection activity, no labeled data available, no idea where to start.

In this context I have noticed unsupervised methods can be a good tool to start looking into your data for anomalies.

This concept is better stated in Bolton & Hand paper “Satistical Fraud Detection: A Review”

That said, I am quite

interested by your statementabout supervisedmachine learning, since as far as I can see when you train a supervised algorithm you are teaching it to respect certain rules, even if not strict ones, so that you necessarly need some previous knowledge of frauds occuring.I would be

really grateful to you if you could point me outsome work wheremachine learning is applied the way you meant.About

confusion matrix, thank you for pointing this out: I amgoing to share other performance details ( and metrics) within next posts, and I would love to have your opinion on that too.Let me know if I didn’t see your point properly.

Regards

What about false positive?

Thank you for writing.

False positivesare always anissuewhen talking about fraud detection, even if we shouldnot considerthose algorithms asinfapllible snipers, rather as useful tools in the first skimming activities on population of data.Going deeper into false positives topic, as you can see into

Does it really work? Looking at real data, Afraus resulted into anincreased precision scorecompared to single models it is composed of.Now, precision Score is

defined as:precision score = (True Postivives)/(True Positives + False Positives)

Therefore, adopting the precision score for performance evaluation I am actually

considering the false positive metric.Anyway, I am going to

share more on the algorithm code and performances in coming posts, and I would be glad to knowyour opinion on this.Moreover, you can find the full code on a Github Repo. Fill free to “watch” it to follow incoming developments.

Thanks again,

Andrea