Answering to Ben ( functions comparison in R)

Following the post about %in% operator, I received this tweet:

@AndreaCirilloAC check out the dplyr #rstats package. Easy to use the %in% operator with the filter() function. https://t.co/6oC0YkvUI7

— Ben White (@benwhite21) September 12, 2014

I gave a look to the code kindly provided by Ben and then I asked myself:
I know dplyr is a really nice package, but which snippet is faster?

to answer the question I’ve put the two snippets in two functions:

#Ben snippet dplyr_snippet =function(object,column,vector){ filter(object,object[,column] %in% vector) } #AC snippet Rbase_snippet =function(object,column,vector){ object[object[,column] %in% vector,] }

Then, thanks to the great package microbenchmark, I made a comparison between those two functions, testing the time of execution of both, for 100.000 times.

comparison = microbenchmark(Rbase_snippet(iris,5,vec),dplyr_snippet(iris,5,vec),times = 100000)

#plot the output autoplot(comparison)+ labs(title = "comparison between dplyr_snippet and Rbase_snippet", y="snippet")

And that was the result:

R Base package seems to be the winner, even if just for an handful of microseconds…

Nevertheless, I am really grateful to Ben, it was a great fun!

Code snippet: subsetting data frame in R by vector

Problem:

you have to subset a data frame using as criteria the exact match of a vector content.

for instance:

you have a dataset with some attributes, and you have a vector with some values of one of the attributes. You want to make a filter based on the values in the vector.

Example: sales records, each record is a deal.

The vector is a list of selected customers you are interested in.

Is it possible to make such a kind of filter?

Solution

Of course it is!

you just have to use the %in% operator.

let’s see how to do it in the short tutorial here below.

Tutorial

suppose you have a sales data frame object like this:

suppose you want to extract sales to Francesca, Tommaso and Gianna.

first, you have to assign those names to a vector.

`vector = c(“Francesca","Tommaso","Gianna")`

then, you can write the filtering statement, using the %in% operator.

`query_result = sales[sales$customer %in% vector,]`

and that’s it!

The meaning of %in% operator is exactly the one you guess:

“ select only values present IN the specified group”.

Full code is available as an R workbook for quick reference:

filter_code.R

Let me know if you use any other method to obtain the same result.

Finally, if you enjoyed the tutorial, you can find more tutorial on page Tutorial (quite obvious, isnt’ it?).

Saturation with Parallel Computation in R

I have just saturated all my PC:

full is the 4gb RAM

and so is the CPU (I7 4770 @3.4 GHZ)

Parallel Computation in R

which is my secret?

the doParallel package for R on mac

The package lets you make some very useful parallel computation, giving you the possibility to use all the potentiality of your CPU.

As a matter of fact, the standard R option is to use just on of the cores you have got on your PC.

With parallel computation, just to say it easy, you can take your job, divide it in some smaller jobs, solve them and then put them together in one new R object.

Tutorial ( More or Less)

There are some useful tutorial on the web (try to google it), but let me introduce the stuff in a really basic way so that you can immediately try it out:

install.packages("doParallel") library(doParallel) cl = makeCluster(2) # if you want to use all your fire power, put the number of your cores registerDoParallel(cl) parallelization = function(x){ n = number #put here the number of repetitions you need foreach ( i=1:n,.combine = rbind) { #this '.combine = rbind' let R understand that has to put the results together with an rbind function, you can use cbind as well x*2 } }

Final Warnings

just a tip: DON’T MAKE THE RESULT BE AN OBJECT!

sorry about the capital letters but I have been stacked on this error for quite a long time…

Finally, are you using Windows? instead of doParallel you can obtain the same result with doSNOW package.

comments are welcome.

How to Visualize Entertainment Expenditures on a Bubble Chart

I’ve been recently asked to analyze some Board entertainment expenditures in order to acquire sufficient assurance about their nature and responsible.

In response to that request I have developed a little Shiny app with an interesting reactive Bubble chart.

The plot, made using ggplot2 package, is composed by:
a categorical x value, represented by the clusters identified in the expenditures population
A numerical y value, representing the total amount expended
Points defined by the total amount of expenditure in the given cluster for each company subject.
Morover, point size is given by the ratio between amount regularly passed through Account Receivable Process and total amount of expenditure for that subject in that cluster.

The advantage i found in developing such a plot is the possibility to get a lot of information at a glance.

Reactivity is given by two sliders that let you decide the period and the range you want to analyze.

Full code (with italian comments, sorry about that) is given on GitHub, but let me underline some details:

Labels are made pasting subjects names and the amount show, in order to give you both information at once.

Colours are automatically assigned by ggplot function simply by specifying

col=factors(Soggetto)

Point size is simply obtained by putting
size=rapporto

Hope you will find at least colorful the job.
Going on talking about colours: does anyone knows how to make point solid in this case?

Any feedback will be appreciated, but please be kind!

andrea cirillo's blog

analytics & me

Category: analytics

Answering to Ben ( functions comparison in R)