Adding points from multiple dataframes to R-leaflet maps

I’ve been getting interested in plotting electoral data onto maps — and as a result have been looking at how combinations of information might explain the results of the 2019 election.

To do so, i have been plotting the booth-by-booth results onto a map, as well as data from other information sources.  This was tricky initially, but once you get the knack it makes a lot of sense (in that Hadley Wickham way).

The key is  moving the data = XYZ bit from inside the leaflet() function to the first argument of the addCircleMarkers() function.

The other trick to note below is that I set the  bins for the colorBin function manually.  This is necessary, as the mid point of the range is not zero.

Building on my earlier single-layer maps example, in this case we have the location of mines and the two-party-preferred election 2019 results by polling booth.

Maps with geo-spatial coordinates in R

I’ve been thinking a lot about the recent Australian federal election … and to visualise the data i’ve needed to plot the vote outcomes by booth.

As is so often the case, this is pretty easing using R.

The fantastic data.table package looks after joining the polling place IDs to the longitude and latitude coordinates, and the leaflet package creates great maps.

Here’s an example

Australian Federal Election 2019: New South Wales

NSW Polling booths (the intensity of the red simply reflects the number of polling places)

Using R’s optimize function

Like many people, I came to R via excel.  And like most of those people, there are still some things for which i still find myself (somewhat guiltily) firing up excel. 

One of those guilty pleasures is the goal seek function.  You know, set the value of this cell equal to something by changing the value of some other cell.  I often found myself thinking: is there an R equivalent to excel’s goal seek … but i didn’t have the time to figure it out.

A bit of spare time and a real-life problem coincided recently.

The problem was to structure an interest rate curve trade such that it wasn’t correlated with USDJPY.  Or to be more precise, to select some amount of 2yr bonds to hold against a short 10yr bond position such that the 2×10 curve trade was uncorrelated to USDJPY.

The basic problem is that US yield tends to increase at the same time as USDJPY appreciates, and in particular that USDJPY and 10yr US yields tend to move together more-so than USDJPY and 2yr yields. So when USDJPY rises the US yield curve also tends to steepen.

The problem is to find the weight on US 2yr bonds such that a US 2×10 steepener trade is uncorrelated to USDJPY.

The first task is to form the problem into a function of a single variable.

That’s done below:

corFun <- function(b, targetCor = 0) {
	spread <- portfolio_diff[, 'us10yr'] - b * portfolio_diff[, 'us2yr']
	abs(cor(spread, portfolio_diff[, 'usdjpy']) - targetCor)
}

The function corFun takes a weight parameter b, and a target correlation targetCor and returns the absolute value of the difference between the empirical correlation and targetCor.

Once the problem is packaged up into a function, you can pass the problem-function into optimize and get the result.

The optimization function works as follows:

optimize(f, lower, upper)

In the present case, f is corFun, lower is 0 (which would mean do not have a 2yr position), and upper has been set to 2; this is arbitrary, but having a 2yr position that’s more than twice the size of the 10yr position would strain my sense of proportion.

It turns out that you want to have ~1.5x the position in US 2yrs.

So if you have one unit of risk in USDJPY, and you want to have a US 2×10 curve steepener that’s uncorrelated to USDJPY, you should sell one unit of US 10yrs (say that’s 100k DV01) and buy 1.5 times that amount in US 2yr bonds (say 150k DV01).

curve usdjpy

The full code is below:

Adding Code to an RcppArmadillo package

[This is the third post in a three part series that demonstrates how to create an R package that includes RcppArmadillo source code. Follow these links for part one and part two]

So we now have an R package that is compiled and delivers a linked C++ function into your R session (when loaded with the usual library( ... ) command).

One thing that’s going to happen when you use getCoef in the real world is that you’re going to have problems with missing values.

If your missing value is in X, you will see:


R> getCoef(c(NA, X[-1]), Y)
Error in getCoef(c(NA, X[-1]), Y) :
  BLAS/LAPACK routine 'DLASCL' gave error code -4

If the missing value is in Y, you will see (the harder to diagnose):


R> getCoef(X, c(NA, Y[-1]))
     [,1]
[1,]   NA
[2,]   NA

There are two possible solutions: changing the code to check to see if the variables that have been passed include any NA values; or adding a function that fills in NA values somehow. The point of this blog is to show what happens when we add a new function — and how to do it right — so we’re going to take the second path.

A short google reveals that this is a solved problem: Romain Francois provided this solution on stackoverflow:


//[[Rcpp::export]]
NumericVector naLocf(NumericVector x) {
    double *p=x.begin(), *end = x.end() ;
    double v = *p ; p++ ;

    while( p < end ){
        while( p<end && !NumericVector::is_na(*p) ) p++ ;
        v = *(p-1) ;
        while( p<end && NumericVector::is_na(*p) ) {
            *p = v ;
            p++ ;
        }
    }

    return x;
}

If you're like me, you'll be thinking: 'great, problem solved'.

Not quite … but let's try it to see how it breaks.

Save the above snippet into file naLOCF.cpp, inside the /PAX/src/ folder, and recompile the package with:

$ CMD build C:\Users\abc\Dropbox\R\packages\PAX

You should see something like:

$ naLOCF.cpp:1:15: fatal error: Rcpp: No such file or directory

Followed by a warning that compilation has been terminated. Mine says the following:

compilation terminated.
make: *** [C:/PROGRA~1/R/R-35~1.1/etc/x64/Makeconf:215: naLOCF.o] Error 1
ERROR: compilation failed for package 'PAX'
* removing 'C:/Users/abc/AppData/Local/Temp/RtmpA7sPSX/Rinst36441e58481d/PAX'
In R CMD INSTALL
-----------------------------------
ERROR: package installation failed

So what went wrong?

We forgot the compileAttributes() and package_native_routine_registration_skeleton steps.

Recall from part two that you need to open an R terminal and do the following:

R> library(RcppArmadillo)
R> setwd("C:\Users\abc\Dropbox\R\packages\PAX")
R> Rcpp::compileAttributes()
R> tools::package_native_routine_registration_skeleton(dir = "path-to-PAX", character_only = FALSE)

NOTE this time you must set character_only = FALSE

Now copy the output to the init.c file, and try again:

$ CMD build C:\Users\abc\Dropbox\R\packages\PAX

Building an R package that includes RcppArmadillo code

[This is the second post in a three part series that demonstrates how to create an R package that includes RcppArmadillo source code. Follow these links for part one and part three]

Last time I showed how you might speed up getting the coefficients from a linear regression.  Comparisons once the code was compiled and loaded were, of course, flattering for the Rcpp solution.

But this misses the fact that compilation takes time — and at this stage we have to wait while Rcpp::sourceCpp compiles the code each session.

On my system I’d have to do about 50mil regressions per session to repay the compilation time. That’s plausible as a once off, but most of the time it would not be worth it.

The solution is to build an R package that includes the C++ code. That way you pay the compilation tax only once. After your package is built and installed, you load the package the regular way with library(PAX) — which is basically instantaneous.

Are you ready?

To build packages, you need to have Rtools installed.

  1.  Install Rtools:
    • Accept all the defaults, particularly the path (probably C:\Rtools).Make sure you keep note of where you install Rtools — we are going to need it soon.
  2. Check your PATH:
        • open cmd and type in path —  $ path
        • Do you see the path to R and Rtools? You should see something like the following: C:\Program Files\R\R-3.5.1\bin\x64 for R and C:\Rtools\bin for Rtools.
        • If you don’t, then you need to edit your path: hit the windows key and type in “edit the system environment variables”. systemProp
        • When you see it pop up in the search pane, hit enter. This should open the “System Properties” box. Select the “Advanced” tab (if it is not already selected) and press the “Environment Variables” button.
        • Now click the “Browse” button and navigate to your R and your Rtools folders. If you are having trouble finding R and can find a working R shortcut (perhaps on your desktop) you can see the path in the properties if you right-click.  Of course you wrote down the path to Rtools two steps back (right?) so that one will be easy.
      • Rprop
        • Now open cmd and check your path with $ path. You should see the paths to R and Rtools (you may have to reset).
        • Now you’re set!

Creating a RcppArmadillo package

  1. Open R and create an RcppArmadillo package skeleton: R> RcppArmadillo.package.skeleton("PAX", path = "~/Dropbox/R/packages")
  2. Using what ever method you prefer (cmd, explorer etc), copy the .cpp files into .../PAX/src/
  3. back in R … R> setwd("C:\Users\abc\Dropbox\R\packages\PAX")
  4. R> Rcpp::compileAttributes()
  5. R> tools::package_native_routine_registration_skeleton(dir = "path-to-PAX", character_only = TRUE).  NOTE: the character_only variable should be = TRUE the first time, and = FALSE if you’re updating the package.
  6. Copy the text that was output in R to \PAX\src\init.c … this tells R about your C++ functions.
  7. Now build your package … open cmd: $ R CMD build C:\Users\abc\Dropbox\R\packages\PAX … NOTE: complete paths always work; relative paths sometimes fail. You should see output similar to the below:
    • * checking for file 'C:/Users/abc/Dropbox/R/packages/PAX/DESCRIPTION' ... OK
    • * preparing 'PAX':
    • * checking DESCRIPTION meta-information ... OK
    • * cleaning src
    • * installing the package to process help pages
    • * saving partial Rd database
    • * cleaning src
    • * checking for LF line-endings in source and make files and shell scripts
    • * checking for empty or unneeded directories
    • * building 'PAX_1.0.tar.gz'
  8. Still in cmd, run $ R CMD INSTALL PAX_1.0.tar.gz … note that you don’t need the full path in this step.  You should see some compilation stuff such as:
    • * installing *source* package 'PAX' ...
    • ** libs
    • c:/Rtools/mingw_64/bin/g++ -std=gnu++11 -I"C:/PROGRA~1/R/R-35~1.1/include" -DNDEBUG -I"C:/Users/abc/R/rpax/Rcpp/include" -I"C:/Users/abc/R/rpax/RcppArmadillo/include" -fopenmp -O2 -Wall -mtune=generic -c RcppExports.cpp -o RcppExports.o
  9. This should conclude with * Done(PAX)
  10. Open up R and try R> library(PAX) … now execute R> getCoef .. you should see the function and definition:

R> getCoef
function (X,Y)
{
.Call('_PAX_getCoef', X, Y)
}

    .

Find coefficients from linear regression in R (really fast)

[This is the first part of a three part series that demonstrates how to create an R package that includes RcppArmadillo source code; follow these links for parts two and three]

Every journey needs motivation … so let’s say you want to run a LOT of regressions. Additionally, let’s say you are really only interested in the coefficients.

In that case, lm.fit in R does way too much work and we can speed it up with a few relatively simple tricks (TL;DR RcppArmadillo is fastest!)

First of all, it should be noted that it might not be worth the effort. You can get a long way without linking to C++ code.  In the present slightly contrived case, the lm.fit and .lm.fit functions are much faster than lm.


set.seed(1)

ROWS <- 1e5
Y <- 1:ROWS + runif(ROWS, -1, 1)
X <- 1:ROWS + rnorm(ROWS, 0, 3)

# get coefs
coef_1st <- function() lm(Y ~ X)$coefficients
coef_2nd <- function() lm.fit(cbind(rep(1, length(X)), X), Y)$coefficients
coef_3rd  microbenchmark(coef_1st(), coef_2nd(), coef_3rd(), times = 1e4)

# benchmark
microbenchmark(coef_1st(), coef_2nd(), coef_3rd(), times = 1e4)
Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval
 coef_1st() 25.453486 34.005381 49.059493 45.734325 60.332143 1476.2446 10000
 coef_2nd()  3.976818  5.264472  9.967156  7.899375 11.087321  177.8020 10000
 coef_3rd()  2.868874  3.476643  7.131980  5.091950  7.207663  114.3153 10000

Using medians,  lm.fit is ~6x faster than lm, and .lm.fit is a little over 9x faster! These are big improvements for little effort. So if this is your exact problem, you can now stop reading.

HOWEVER if you have REAL work to do, you might still be interested in C++.  And if you want to do regressions, you need linear algebra.  Enter RcppArmadillo!

You could of course use the ​fastLmPure function in the RcppArmadillo package. I have done that below.


require(RcppArmadillo)

coef_4th <- function() {
# note to make XX to get intercept from fastLmPure;
# this behavior is the default for the other functions
    XX <- cbind(rep(1L, length(X)), X)
    fastLmPure(XX, Y)$coefficients
}

Note that we need to prepare the data differently, by adding the column of 1s to the matrix (this is the intercept).  This isn't free so i included that step in the benchmarking function.

Right away you should be thinking that we can do better … we could move the code that adds the column of 1s (creating XX) into C++, and also do only the calculations we need.

This C++ code does just that:


#include

// [[Rcpp::depends(RcppArmadillo)]]

//[[Rcpp::export]]
arma::colvec getCoef(const arma::vec & X, const arma::vec & Y) {
	// this function takes two numeric vectors
	// checks to make sure the dimensions match
	// returns the coefficients from a bivariate regression
	// TODO: add an intercept switch ... noInt

	int Xlen = X.size();
	int Ylen = Y.size();

	if( Xlen != Ylen ) Rcpp::stop("X and Y must have the same length");

	arma::mat XX(Xlen, 2);

	for( int i = 0; i < Xlen; i++) {
		XX(i, 0) = 1; // constant
		XX(i, 1) = X(i);
	}

	// find coefficients
	arma::colvec coef = arma::solve(XX, Y);

	return coef;
}

Now we source the C++ function using Rcpp::sourceCpp, loading getCoef as a function in the global environment.

R> Rcpp::sourceCpp("~/Dropbox/Cpp/R/fastLMCo.cpp")

As you would expect, our custom getCoef function is fastest, besting RcppArmadillo’s fastLmPure by ~25%


coef_5th <- function() getCoef(X, Y)

microbenchmark(coef_1st(), coef_2nd(), coef_3rd(),
               coef_4th(), coef_5th(), times = 1e4)

Unit: milliseconds
expr min lq mean median uq max neval
coef_1st() 24.688233 28.716522 33.826753 32.054957 35.927653 187.17681 10000
coef_2nd() 3.672581 4.328348 7.166927 4.916639 8.739248 87.31223 10000
coef_3rd() 2.508058 3.079421 5.142339 3.409392 4.531942 101.43583 10000
coef_4th() 2.242320 2.614726 3.578321 2.834551 3.240812 106.49090 10000
coef_5th() 1.986783 2.169972 2.467147 2.260406 2.530783 44.72023 10000