To do so, i have been plotting the booth-by-booth results onto a map, as well as data from other information sources. This was tricky initially, but once you get the knack it makes a lot of sense (in that Hadley Wickham way).
The key is moving the data = XYZ bit from inside the leaflet() function to the first argument of the addCircleMarkers() function.
The other trick to note below is that I set the bins for the colorBin function manually. This is necessary, as the mid point of the range is not zero.
Building on my earlier single-layer maps example, in this case we have the location of mines and the two-party-preferred election 2019 results by polling booth.
# packages | |
require(data.table) | |
require(leaflet) | |
require(readxl) | |
twoppReps <- as.data.table(read.csv("https://tallyroom.aec.gov.au/Downloads/HouseTppByPollingPlaceDownload-24310.csv", skip = 1L)) | |
pollingPlaces <- as.data.table(read.csv("https://tallyroom.aec.gov.au/Downloads/GeneralPollingPlacesDownload-24310.csv", skip = 1)) | |
operatingMines <- as.data.table(read_excel("C:/Users/mcoog/Dropbox/blogs/operating_mines.xls")) | |
# inner join to add lat and longtitude data | |
twoppReps[pollingPlaces[, list(PollingPlaceID, Latitude, Longitude)], | |
`:=` (lat = i.Latitude, long = i.Longitude), | |
on = 'PollingPlaceID'] | |
# part 1 bin the data so that 0 = neutral, -100 = red and +100 = blue | |
swing_bins <- c(–100, –25, –10, –5, –2, –1, 0, 1, 2, 5, 10, 25, 100) | |
swing_pal <- colorBin(c("#f00505", "#fafafa", "#0520f0"), | |
bins = swing_bins) | |
ausMines_booths <- leaflet() %>% | |
addTiles() %>% | |
addCircleMarkers(data = twoppReps, | |
lat = ~lat, lng = ~long, popup = ~PollingPlace, | |
col = ~swing_pal(Swing), | |
stroke = FALSE, fillOpacity = 0.3, opacity = 0.3) %>% | |
addCircleMarkers(data = operatingMines, | |
lat = ~Latitude, lng = ~Longitude, popup = ~Name, | |
color = 'black', radius = 7, stroke = FALSE, | |
opacity = 0.50, fillOpacity = 0.50) | |
print(ausMines_booths) |
As is so often the case, this is pretty easing using R.
The fantastic data.table package looks after joining the polling place IDs to the longitude and latitude coordinates, and the leaflet package creates great maps.
Here’s an example
]]>One of those guilty pleasures is the goal seek function. You know, set the value of this cell equal to something by changing the value of some other cell. I often found myself thinking: is there an R equivalent to excel’s goal seek … but i didn’t have the time to figure it out.
A bit of spare time and a real-life problem coincided recently.
The problem was to structure an interest rate curve trade such that it wasn’t correlated with USDJPY. Or to be more precise, to select some amount of 2yr bonds to hold against a short 10yr bond position such that the 2×10 curve trade was uncorrelated to USDJPY.
The basic problem is that US yield tends to increase at the same time as USDJPY appreciates, and in particular that USDJPY and 10yr US yields tend to move together more-so than USDJPY and 2yr yields. So when USDJPY rises the US yield curve also tends to steepen.
The problem is to find the weight on US 2yr bonds such that a US 2×10 steepener trade is uncorrelated to USDJPY.
The first task is to form the problem into a function of a single variable.
That’s done below:
corFun <- function(b, targetCor = 0) { spread <- portfolio_diff[, 'us10yr'] - b * portfolio_diff[, 'us2yr'] abs(cor(spread, portfolio_diff[, 'usdjpy']) - targetCor) }
The function corFun
takes a weight parameter b
, and a target correlation targetCor
and returns the absolute value of the difference between the empirical correlation and targetCor
.
Once the problem is packaged up into a function, you can pass the problem-function into optimize
and get the result.
The optimization function works as follows:
optimize(f, lower, upper)
In the present case, f
is corFun
, lower
is 0 (which would mean do not have a 2yr position), and upper
has been set to 2; this is arbitrary, but having a 2yr position that’s more than twice the size of the 10yr position would strain my sense of proportion.
It turns out that you want to have ~1.5x the position in US 2yrs.
So if you have one unit of risk in USDJPY, and you want to have a US 2×10 curve steepener that’s uncorrelated to USDJPY, you should sell one unit of US 10yrs (say that’s 100k DV01) and buy 1.5 times that amount in US 2yr bonds (say 150k DV01).
The full code is below:
library(fredr) | |
require(xts) | |
# note you'll need an API key | |
fredr_set_key("123YourFREDAPIKey321") | |
# helper function to convert tib to xts | |
xtsT <- function(TIB) xts(TIB$value, TIB$date) | |
# helper function to remove rows that have NA values | |
rmNA <- function(X) X[complete.cases(X), ] | |
# get data | |
us2yr <- xtsT(fredr("DGS2", observation_start=as.Date('2016-01-01'))) | |
us10yr <- xtsT(fredr("DGS10", observation_start=as.Date('2016-01-01'))) | |
usdjpy <- xtsT(fredr("DEXJPUS", observation_start=as.Date('2016-01-01'))) | |
# create portfolio: % returns for USDJPY and bps changes for 2×10 curve | |
portfolio <- rmNA(cbind(usdjpy, us2yr, us10yr)) | |
portfolio_diff <- rmNA(100 * cbind(diff(portfolio[, 1], log=T), | |
diff(portfolio[, 2:3]))) | |
# embed problem in a function | |
corFun <- function(b, targetCor = 0) { | |
spread <- portfolio_diff[, 'us10yr'] – b * portfolio_diff[, 'us2yr'] | |
abs(cor(spread, portfolio_diff[, 'usdjpy']) – targetCor) | |
} | |
# optimise the function | |
optWeight <- optimize(corFun, lower = 0, upper = 2) | |
# plot the range of results | |
plot(sapply(seq(0, 2, 0.01), corFun), x = seq(0, 2, 0.01), | |
main = "Correlation of weighted US 2×10 curve trade with USDJPY", | |
xlab = "weighting on 2yr", | |
ylab = "Correlation with USDJPY") | |
abline(v = 1, col=8, lty=2, lwd=1) | |
abline(h = 0, col=2, lty=2, lwd=1) | |
points(optWeight$minimum, 0, pch=17, col=2, type='p') |
So we now have an R package that is compiled and delivers a linked C++ function into your R session (when loaded with the usual library( ... )
command).
One thing that’s going to happen when you use getCoef
in the real world is that you’re going to have problems with missing values.
If your missing value is in X, you will see:
R> getCoef(c(NA, X[-1]), Y) Error in getCoef(c(NA, X[-1]), Y) : BLAS/LAPACK routine 'DLASCL' gave error code -4
If the missing value is in Y, you will see (the harder to diagnose):
R> getCoef(X, c(NA, Y[-1])) [,1] [1,] NA [2,] NA
There are two possible solutions: changing the code to check to see if the variables that have been passed include any NA
values; or adding a function that fills in NA values somehow. The point of this blog is to show what happens when we add a new function — and how to do it right — so we’re going to take the second path.
A short google reveals that this is a solved problem: Romain Francois provided this solution on stackoverflow:
//[[Rcpp::export]] NumericVector naLocf(NumericVector x) { double *p=x.begin(), *end = x.end() ; double v = *p ; p++ ; while( p < end ){ while( p<end && !NumericVector::is_na(*p) ) p++ ; v = *(p-1) ; while( p<end && NumericVector::is_na(*p) ) { *p = v ; p++ ; } } return x; }
If you're like me, you'll be thinking: 'great, problem solved'.
Not quite … but let's try it to see how it breaks.
Save the above snippet into file naLOCF.cpp
, inside the /PAX/src/
folder, and recompile the package with:
$ CMD build C:\Users\abc\Dropbox\R\packages\PAX
You should see something like:
$ naLOCF.cpp:1:15: fatal error: Rcpp: No such file or directory
Followed by a warning that compilation has been terminated. Mine says the following:
compilation terminated.
make: *** [C:/PROGRA~1/R/R-35~1.1/etc/x64/Makeconf:215: naLOCF.o] Error 1
ERROR: compilation failed for package 'PAX'
* removing 'C:/Users/abc/AppData/Local/Temp/RtmpA7sPSX/Rinst36441e58481d/PAX'
In R CMD INSTALL
-----------------------------------
ERROR: package installation failed
So what went wrong?
We forgot the compileAttributes()
and package_native_routine_registration_skeleton
steps.
Recall from part two that you need to open an R terminal and do the following:
R> library(RcppArmadillo)
R> setwd("C:\Users\abc\Dropbox\R\packages\PAX")
R> Rcpp::compileAttributes()
R> tools::package_native_routine_registration_skeleton(dir = "path-to-PAX", character_only = FALSE)
NOTE this time you must set character_only = FALSE
Now copy the output to the init.c
file, and try again:
$ CMD build C:\Users\abc\Dropbox\R\packages\PAX
Last time I showed how you might speed up getting the coefficients from a linear regression. Comparisons once the code was compiled and loaded were, of course, flattering for the Rcpp solution.
But this misses the fact that compilation takes time — and at this stage we have to wait while Rcpp::sourceCpp
compiles the code each session.
On my system I’d have to do about 50mil regressions per session to repay the compilation time. That’s plausible as a once off, but most of the time it would not be worth it.
The solution is to build an R package that includes the C++
code. That way you pay the compilation tax only once. After your package is built and installed, you load the package the regular way with library(PAX)
— which is basically instantaneous.
Are you ready?
To build packages, you need to have Rtools
installed.
C:\Rtools
).Make sure you keep note of where you install Rtools — we are going to need it soon.cmd
and type in path — $ path
C:\Program Files\R\R-3.5.1\bin\x64
for R and C:\Rtools\bin
for Rtools.cmd
and check your path with $ path
. You should see the paths to R and Rtools (you may have to reset).Creating a RcppArmadillo package
R> RcppArmadillo.package.skeleton("PAX", path = "~/Dropbox/R/packages")
.../PAX/src/
R> setwd("C:\Users\abc\Dropbox\R\packages\PAX")
R> Rcpp::compileAttributes()
R> tools::package_native_routine_registration_skeleton(dir = "path-to-PAX", character_only = TRUE)
. NOTE: the character_only variable should be = TRUE the first time, and = FALSE if you’re updating the package.\PAX\src\init.c
… this tells R about your C++ functions.$ R CMD build C:\Users\abc\Dropbox\R\packages\PAX
… NOTE: complete paths always work; relative paths sometimes fail. You should see output similar to the below:
* checking for file 'C:/Users/abc/Dropbox/R/packages/PAX/DESCRIPTION' ... OK
* preparing 'PAX':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to process help pages
* saving partial Rd database
* cleaning src
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building 'PAX_1.0.tar.gz'
$ R CMD INSTALL PAX_1.0.tar.gz
… note that you don’t need the full path in this step. You should see some compilation stuff such as:
* installing *source* package 'PAX' ...
** libs
c:/Rtools/mingw_64/bin/g++ -std=gnu++11 -I"C:/PROGRA~1/R/R-35~1.1/include" -DNDEBUG -I"C:/Users/abc/R/rpax/Rcpp/include" -I"C:/Users/abc/R/rpax/RcppArmadillo/include" -fopenmp -O2 -Wall -mtune=generic -c RcppExports.cpp -o RcppExports.o
* Done(PAX)
Open up R and try R> library(PAX)
… now execute R> getCoef .. you should see the function and definition: R> getCoef
function (X,Y)
{
.Call('_PAX_getCoef', X, Y)
}
Every journey needs motivation … so let’s say you want to run a LOT of regressions. Additionally, let’s say you are really only interested in the coefficients.
In that case, lm.fit
in R does way too much work and we can speed it up with a few relatively simple tricks (TL;DR RcppArmadillo is fastest!)
First of all, it should be noted that it might not be worth the effort. You can get a long way without linking to C++ code. In the present slightly contrived case, the lm.fit
and .lm.fit
functions are much faster than lm
.
set.seed(1) ROWS <- 1e5 Y <- 1:ROWS + runif(ROWS, -1, 1) X <- 1:ROWS + rnorm(ROWS, 0, 3) # get coefs coef_1st <- function() lm(Y ~ X)$coefficients coef_2nd <- function() lm.fit(cbind(rep(1, length(X)), X), Y)$coefficients coef_3rd microbenchmark(coef_1st(), coef_2nd(), coef_3rd(), times = 1e4) # benchmark microbenchmark(coef_1st(), coef_2nd(), coef_3rd(), times = 1e4) Unit: milliseconds expr min lq mean median uq max neval coef_1st() 25.453486 34.005381 49.059493 45.734325 60.332143 1476.2446 10000 coef_2nd() 3.976818 5.264472 9.967156 7.899375 11.087321 177.8020 10000 coef_3rd() 2.868874 3.476643 7.131980 5.091950 7.207663 114.3153 10000
Using medians, lm.fit
is ~6x faster than lm
, and .lm.fit
is a little over 9x faster! These are big improvements for little effort. So if this is your exact problem, you can now stop reading.
HOWEVER if you have REAL work to do, you might still be interested in C++. And if you want to do regressions, you need linear algebra. Enter RcppArmadillo!
You could of course use the fastLmPure
function in the RcppArmadillo package. I have done that below.
require(RcppArmadillo) coef_4th <- function() { # note to make XX to get intercept from fastLmPure; # this behavior is the default for the other functions XX <- cbind(rep(1L, length(X)), X) fastLmPure(XX, Y)$coefficients }
Note that we need to prepare the data differently, by adding the column of 1s to the matrix (this is the intercept). This isn't free so i included that step in the benchmarking function.
Right away you should be thinking that we can do better … we could move the code that adds the column of 1s (creating XX
) into C++, and also do only the calculations we need.
This C++ code does just that:
#include // [[Rcpp::depends(RcppArmadillo)]] //[[Rcpp::export]] arma::colvec getCoef(const arma::vec & X, const arma::vec & Y) { // this function takes two numeric vectors // checks to make sure the dimensions match // returns the coefficients from a bivariate regression // TODO: add an intercept switch ... noInt int Xlen = X.size(); int Ylen = Y.size(); if( Xlen != Ylen ) Rcpp::stop("X and Y must have the same length"); arma::mat XX(Xlen, 2); for( int i = 0; i < Xlen; i++) { XX(i, 0) = 1; // constant XX(i, 1) = X(i); } // find coefficients arma::colvec coef = arma::solve(XX, Y); return coef; }
Now we source the C++ function using Rcpp::sourceCpp
, loading getCoef
as a function in the global environment.
R> Rcpp::sourceCpp("~/Dropbox/Cpp/R/fastLMCo.cpp")
As you would expect, our custom getCoef function is fastest, besting RcppArmadillo’s fastLmPure by ~25%
coef_5th <- function() getCoef(X, Y) microbenchmark(coef_1st(), coef_2nd(), coef_3rd(), coef_4th(), coef_5th(), times = 1e4) Unit: milliseconds expr min lq mean median uq max neval coef_1st() 24.688233 28.716522 33.826753 32.054957 35.927653 187.17681 10000 coef_2nd() 3.672581 4.328348 7.166927 4.916639 8.739248 87.31223 10000 coef_3rd() 2.508058 3.079421 5.142339 3.409392 4.531942 101.43583 10000 coef_4th() 2.242320 2.614726 3.578321 2.834551 3.240812 106.49090 10000 coef_5th() 1.986783 2.169972 2.467147 2.260406 2.530783 44.72023 10000]]>