Command Line R

It is no secret that I prefer to use R at the command line than go through the very popular RStudio interface.  This post is by no means an endorsement over using R through terminal over any number of GUI wrappers out there.  However, for those who are using R through command line/terminal either out of preference or necessity, here are some customization options I’ve learned that may help speed things along.

Finding R_HOME

The directory where R environment and profile settings are stored can be found through using R.home() inside the R console.  This is useful because many of the files governing the R start is in the ‘R_HOME/etc/’ folder.

There are two files that we are mainly concerned with: Renviron.site and Rprofile.site.  Renviron contains general environmental variables such as where the installation path to your R libraries are kept and allows for the definition of other environment variables that might be used in deeper levels of integration between R and the operating environment.

Rprofile

Rprofile allows for the customization of option sites, package loading, and other quality of life settings that makes R a lot more comfortable to use in terminal.  For example, two of the options I used to always run at the top of my code is as follows:

options(scipen=20)
options(stringsAsFactors = FALSE)

By default, R imports character strings in data.frames as factors.  This becomes exceptionally annoying when your data set comes in multiple files, and combining these data sets into one large data frame requires you to manually convert all factors into strings.  My personal preference is to always keep strings as characters unless I really am either in need of the memory savings or am definitively sure that my array contains the entire universe of permitted values for that data element.

By appending these options to the end of Rprofile, I save myself the silliness of either typing or source my option sets each time I start a new project.

Add some color

Alas, this next option is only for the Unix flavor of R but the colorout package adds syntax highlighting and a bunch of other goodies to the plain grey R console that we’ve come to love and squint at.  Below, I show you a before and after version of using colorout in looking at the function ls():

ls_colored

Bland vs Syntax Highlighting (colorout)

Awesome right? As icing to the cake, colorout, or any package for that matter, can be loaded by default by adding the following code to Rprofile:

options(defaultPackages=c(getOption("defaultPackages"), "colorout", "other_packages..."))

 

Presenting LearnDC

This Saturday, I had the pleasure of presenting LearnDC, a project created by the Office of the State Superintendent of Education (OSSE) as an information portal to empower students, parents, and the greater education community, with the amazing Chris Given of Collaborative Communications.  The conference, Assembled Education, was a one day EdTech panel discussion put on by General Assemb.ly, a newly formed education institution that has put on a phenomenal series of workshops, classes, and boot camps on a wide range of technology and business topics.

Assembled Education in Action

Image by Elle Gitlin (@evoque)

Driving Factors

The development of LearnDC did not happen in a vacuum.  The charter school movement, for better or for worse, created an education environment that gave parents multitudes of choices but a dearth of information during key transition points in a child’s K-12 academic career.  While much of the information currently hosted by LearnDC already existed in the public domain, public domain rarely meant open and accessible.  In order to aggregate much of that information, a parent would have had to muster teams of interns and graduate students to pour over fixed-width federal reports, PDFs submitted at council hearings, and other arcane formats under which data used to be  is published in.

As OSSE matured as an agency, its involvement in #OpenData and #OpenGov events also increased.  I am grateful to have had the opportunity to join the Code for DC education project and work on a number of projects that was both personally and professionally rewarding.  A quick and very unscientific Twitter poll shows OSSE‘s leadership in opening DC’s education data here, here, and here.  As credit should be given where it’s due, the other DC education agencies, DCPS, PCSB, and DME, have all released data sets as a part of DC Open Data Day 2014.

Image by V. Rao Dumpeti (@vishpool)

However, #OpenData not is achieved through a one time data release nor does a government agency build good will through single acts of public service.  LearnDC represents a cornerstone in a larger dialogue with the public around public education.

When the data side of the project was first conceived, the first few requirements set in stone were openness, automation, and scale.  The product had to resemble an honest representation of education data with methodologies and aggregation rules fully exposed.  While there currently exists a technical barrier to understanding the coding behind the data, the code used to generate the data is posted on Github and openly available for public scrutiny.

The product also needed to be automated and scalable.  We did not want to create a site where we updated the data once a year and each data update became its own massive project like Edfacts, filled with its own share of grief and headaches.  The data on the site needed to be fresh, up to date, and easily revised so that when we do make a mistake (and we make plenty of those), they can be easily fixed.

The Product

The product that we have today is a website that acts as the front end, release tool for an array of education information that OSSE publishes. However, LearnDC is more than a website that hosts information.  LearnDC is a model under which local government can communicate with its constituents and provide leadership in its domain.

As organizations and government agencies shift towards data driven decision making, it is also important an education agency to nurture data literacy among constituents.  By presenting DC education data through a series of stunning data visualizations that each have its own exploration tool, LearnDC enables the complexity of data exploration without burying the basic user in a sea of numbers or confusing lingo.  For the technically advanced, OSSE makes the entire set of aggregated data (in machine readable formats) available through Github, in addition to plans of having a fully functional API operational by the end of 2014.

The shared experiences of government agencies with #OpenData has not led to the embracing of transparency for transparency sake, as admirable as that would be.  Rather, opening troves of publicly paid-for data has led to novel and innovative ways in which the data was used to empower ordinary citizens.  Open sourcing LearnDC‘s data creates opportunities for the greater education data community to build products and tools from the data that OSSE would not necessarily have the resources to do so.

TL;DR: LearnDC is awesome.  Go check it out already, duh!

Download the slides here: LearnDC Presentation (thanks to Chris for putting it together!)

To Apply or Merge?

As my very first post on the 8th reincarnation of StrayDots (hopefully, this one will stick), I wanted to write about a little performance optimization scenario I encountered while TA’ing the Data Science course taught by the awesome Aaron Schumacher at General Assembly.

In hard coding a K-nearest neighbor algorithm, with m training observations and n test observations, we are left with the unenviable position of doing m*n comparisons.  Although there exists algorithms that will run KNN at O(n log m), those methods often depend on complex tree structures and search methods that are quite a bit beyond the scope of the class.

In a lower level compiled language (read: not R), one might run three for-loops, one along m, one along n, and one along x, the number of features used for KNN.  Since R vectorizes most of its base functions, only two for-loops are required, one along m and one along n.

A speed up recommended by Aaron was to use the apply function such that instead of for loops, we would get two apply functions:

knn_one = function(test_data, training_data) {
    min_find = apply(training_data[,1:4], 1, function(x,y) sqrt(sum((x-y)^2)), test_data)
    return(training_data[which.min(min_find),'Label'])
}

While this method seems both elegant and R-like, we’re being taxed a minute overhead each time apply is run, especially from the second apply function.  Each time the anonymous function within the second apply is called, R is repetitively copying two small vectors into a temporary memory space to compute the Euclidean distance between those two vectors and returning that value.

Having had to do perform in-database operations at work from time to time due to computing constraints, I was curious to see if a memory-intensive but CPU-saving workaround would perform any faster.  My ugly prototype is as follows:

knn_one = function(test_data, training_data) {
    test_data$key = seq_along(test_data[,1])
    big_df = merge(training_data, test_data, by=NULL, all=TRUE)
    big_df$distance = sqrt((big_df[,1]-big_df[,6])^2 + 
       (big_df[,2]-big_df[,7])^2 + 
       (big_df[,3]-big_df[,8])^2 + 
       (big_df[,4]-big_df[,9])^2)
   index = aggregate(distance ~ key, data=big_df, which.min)
   return(big_df$Label[index$distance])
 }

Granted, those column additions look ugly as hell, the idea is to merge (joining for SQL folks) the test_data and training data into one big data frame and perform column operations on them.  This method pays one big CPU overhead in the merge process and avoids all the micro-transactions (from copying) that performing a double apply incurs.  In addition, merging the training and test data together creates a much bigger memory imprint during the lifecycle of the function call, which may or may not matter depending on scale.

system.time(replicate(200, RunKNN_apply(normalized_data)))
 user system elapsed 
 6.74 0.00 6.74 
system.time(replicate(200, RunKNN_merge(normalized_data)))
 user system elapsed 
 5.72 0.00 5.72

The merge method yielded a small but obvious edge over the double-apply method over a 150 observation data set.  Yet, when applied to a 10,000 row data set, the burden of the initial overhead started showing:

system.time(replicate(5, RunKNN_apply(norm_data_10k)))
 user system elapsed 
 9.16 0.00 9.16 
system.time(replicate(5, RunKNN_merge(norm_data_10k))) 
 user system elapsed 
 12.80 0.69 13.48

Moral of the story/TL;DR: Tom is dumb. Merging on NULL in R creates a cross product of your data. R isn’t optimized to perform database-styled operations.

Ughhhh