In the early ‘900 Frank Benford observed that ‘1’ was more frequent as first digit in his own logarithms manual.

More than one hundred years later, we can use this curious finding to look for fraud on populations of data.

just give a try to the shiny app

What ‘Benford’s Law’ stands for?

Around 1938 Frank Benford, a physicist at the General Electrics research laboratories, observed that logarithmic tables were more worn within first pages: was this casual or due to an actual prevalence of numbers near 1 as first digits?

Starting from this intuition/question, Benford tested the hypothesis against 30 populations of data, finally coming to what is known as the Benford Law:

the expected frequencies of the digits in lists of numbers

LogaritmeTafel

That is to say that using the Benford Law is possible to make a prediction about the expected distribution of a population of data, in term of recurrence of numbers from 1 to 9 as first digits.

The prediction is made in term of probability, and is calculated using the following formula:

Probability (D_{1}=d_{1}) = log_{10}(1+(\frac{1} {d_{1}}))

d_{1}=(1,2,3,...,9)

You can also calculate the conditional probability of having two numbers as the first two digits:

 Probability(D_{2}=d_{2}|D_{1}=d_{1})= \frac {log_{10}(1+(\frac{1} {d_{1}d_{2}}))} {log_{10}(1+(\frac{1} d_{1}))}

d_{1}=(1,2,3,...,9)

d_{2}=(1,2,3,...,9)

A typical “Benford Distribution” looks really left skewed, having the first two numbers nearly the 50% of probability of being the first digit.

benford distribution plot

Nice stuff, but what can I do with Benford’s Law?

Until a certain point in mathematical history, Benford’s Law was regarded as a bizzare feature of some populations of data.

Something really amusing, but not really useful.

Eventually, mainly due to the meritory work of the mathematician Mark Nigrini, Benford’s law began to be used for  practical purposes, and more precisely for Fraud Analytics purposes.

You can find fraud with it

The idea of using Benford Law for Fraud Analytics purposes is based on one of the main assumptions in Fraud Analytics:

“if something is behavouring differently from what it should, it could be due to fraud.”

In the case of Benford’s Law the aim of using it for catching fraud is to verify if the law is respected within the population and if not, wich elements are not respecting it.

Of course,  as is always the case for fraud analytics, the anomaly could be due to human error, at least for some of the anomalies ( i.e. false alarm).

Nevertheless, Benford-based fraud analysis has proven to be a very effective tool for detecting fraud, especially considering the non compliance as a red-flag on the data integrity.

Some precautions

As pointed out by Durtschi, Hillison and Pacini , you have to be cautious when using Benford’s Law.

Particularly, Benford’s Law is unlikely to be useful in the following cases:

Data set is comprised of assigned numbers Check numbers, invoice numbers, zip codes
Numbers that are influenced by human thought Prices set at psychological thresholds ($1.99), ATM withdrawals
Accounts with a large number of firm-specific numbers An account specifically set up to record $100 refunds
Accounts with a built in minimum or maximum Set of assets that must meet a threshold to be recorded
Where no transaction is recorded Thefts, kickbacks, contract rigging

BenfordeR: another lean shiny application

BenfordeR Screenshot
click to try BenfordeR

Because I am getting used to do it ( and also because my readers seems to find it useful) , I developed a lean shiny app in R, that let’s you play around with Bendford law, loading your own population and looking for Benford’s law compliance ( no worry, a demo dataset is also provided).

BenfordeR‘s main feature includes:

  • custom dataset upload
  • number of first digits to test option
  • suspected records higlight ( se below ‘detecting suspected records)

 Some code specs ( if you are interested in)

Complete, and commented, Rstudio Project is given on GitHub, but I would like to point out some code details in the following lines.

 performing a benford analysis

BenfordeR is based on benford.analysis package, a well documented package that lets you simply perform a Benford analysis, using a single function: benford().

benford_obj    = reactive({benford(data(),input$digits)})

plotting results

benford() function returns a benford objects that can be easily visualised using the plot function:

plot(benford_obj())

BenfordeR_App

detecting suspected records

Finally, in order to allow the user to easily discover wich records are causing the anomaly, BenfordeR selects the first three digits in term of deviance from the expected recurrence and subset the user dataset looking for these digits:

output$benford_suspect = renderDataTable({
 digits                   = left(data(),input$digits) # extract n digits from the data() table, using the left function
 data_output     = data.frame(data(),digits)# join the digits with the dataset
 suspects            = suspectsTable(benford_obj()) # this could be improved using benford_table()
suspects              = suspects[1:3,] #see above
suspects_digits = (as.character(suspects$digits))
data_suspected = subset(data_output,as.character(data_output[,2])%in% suspects_digits)})# filter the data based on first three suspected digits 

 What’s next

I approached Benford’s Law as a part of my Internal Audit job and as a part of my bachelor’s thesis.

More precisely, I am currently developing a fraud-scoring algorithm that take advantage from different anomaly-detection algorithms.

Walking along this path, I have further developed BenfordeR as a “module” within the fraud-scoring algorithim and the related Shiny App. The result of this development is Afraus, an unsupervised fraud detection algorithm. Find out more on Afraus reading the dedicated post.

afraus logo

Advertisements

8 thoughts on “Catching Fraud with Benford’s law (and another Shiny App)

  1. Without disputing the existence of Benford’s Law as a cool phenomenon, I worry at the claims often made for it as a “fraud detector”. Sure, no harm in taking a second look at data that looks “unusual” somehow. But it is often implied that one could set up a hypothesis test, with a “null hypothesis” that a data set “should’ look like Benford’s pattern, and raise red flags if a data set is ‘significantly’ different from Benford’s. Two (of several) problems with that are (1) significance testing requires a measure for expected error . (Such measures are a rarely found in this literature; and empirically, I’ve found a lot of noise even when comparing “Benford’s-like” sample sets). And (2) even the most-cited experts acknowledge that there is no clear, “a priori” indicator of when or why a particular data set will or will not conform closely with Benford’s. This confounds attempting a true hypothesis test, because if a particular data set happens to not conform, who’s to say that this not one of the categories of cases that happen (quite innocently) to not conform to Benford’s? For more about this, you might be interested in: http://www.statlit.org/pdf/2013-Goodman-ASA.pdf

    1. Thank you for your commenting.
      I have carefully readen the paper you pointed out in your comment (has this paper been written by you?) and I am glad to notice that we share substantially the same position about Benford’s Law:

      1. great caution and professional skepticism is needed when using the Law, and particularly when reading results of its application;
      2. Benford’s law can usefully be used in associatation with other fraud detection techniques.

      In my experience Benford’s Law has prooved to be usefull in different cases, mainly but not only in the field of error-checking.
      Nevertheless, I discovered that a good and feasable approach is to leverage the complementarity of different fraud analytics models in order to maximise the probability of intercept fraud events.

      That is why I am currently working on an unsupervised fraud detection algorithm developed using different models, Benford’s Law included.
      You can find out more on Github:

      https://github.com/AndreaCirilloAC/Afraus

      Once again, thank you for your stimulating comment and paper.

      1. Hi, Thanks for your reply. Yes, I wrote that paper, and was pleased that it was picked up the Statistical Literacy website (statlit.org).
        Your Afraus app looks interesting. (I haven’t tried it yet.) Do you think it can still be applied in cases where dates-of-input would not apply? For example, some articles talk about fraud in data that have been input for science articles. The values recorded may be output measures during a certain window period, and exact times of measurement may not have been recorded or published. Could simple order of input be good enough to use–even if the time intervals were likely not all equal?

      2. The current version of Afraus is designed to be used with bivariate populations of data.
        Nevertheless, all the three models joined into Afraus are applicable to monovariate population of data as well.
        Therefore Afraus could be developed also for those kind of population.

        Moreover, I am currently working on a new version of the algorithm able to select among different models depending on the features of loaded data by the user.

        One of the feature considered could also be the number of variables.
        Thank you for this “suggestion”!

        More on Afraus is coming on next post, I will be pleased to see your comment on that as well.

  2. Could it be that an intelligent agent that seeks to minimize its costs would display a behavior indistinguishable from that of “fraud”? After all, behaviors associable with “gaming the system” are not expected to appear randomly.

    1. Thank you for commenting, really interesting observation.
      As far as I can see, if we can still consider as random variables prices and quantities , their products ( that are the payments the agent is willing to reduce) should respect Benford’s Law.
      As a matter of fact this should fall into the demonstration given by Jeff Boyle on his paper “An application of Fourier series to the most significant digit problem”. Boyle explains that if population are resulting from a computation of two random variables, the distribution of first digits is expected to follow the Benford’s Law.
      Moreover, within the cited package benford.analysis, you can find a real payments dataset (corporate.payments dataframe) substantially respecting the Benford’s Law. The dataset represents 2010’s payments data of a division of a West Coast utility company.
      I really hope I am seeing your point. Thanks again.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s