Starting distributed computing

Quick Intro

As hospitals, care providers, and private companies collect more data, they develop rich databases that can be used to improve patient care (e.g. through precision medicine). Research institutions often cannot share their data with each other, however, out of privacy concerns and HIPAA compliance. This poses a hurdle to inter-institutional collaboration, and creates a research bottleneck. It is an unfortunate instance where good data security practices can create roadblocks to inter-institutional collaboration, which has the potential to solve problems such as bias in AI.

One way we may circumvent this issue is with distributed computation. Through an appropriately configured distributed computing service, it is possible to fit models on data that match by stacking vertically, but is otherwise electronically separate.

Partitioning data

The underlying premise is that most (all?) computations for modelling require multiple steps. If we take values calculated during an intermediate step of a computation performed at individual sites, then we can aggregate these values in a central location to produce a final model. Since the sites don’t talk to each other, and the central location only receives a summary statistic, none of the underlying information gets shared. Each institution or entity can hold on to its data and maintain its security while still helping the common good.

That is the vision and purpose of the distcomp R package, which makes the distributed computation process simpler through a series of GUIs that walk a user through the process. The only hang up is there are currently very few options for computations optimized for this distribution process.

That is why in this post, I am going to go through prototyping a distributed computation before adding it to the distcomp package. I am going to start small, by adding a linear regression (which was not included in the initial list of possible distributed computations).

Prototyping Locally

To do a proper distributed computation, you need to have multiple sites and a master controlling instance. This involves configuring a server or VM. But you don’t really need to do that to prototype a computation and make sure you can integrate values at some intermediate step.

Consider computing the linear regression for some data set by minimizing the residual sum of squares. Given some set of data points \((x_i, y_i), i=1,...,n\), we obtain a residual (error) value in prediction with a model \(r_i = f(x_i, \beta)\). If this model is linear, \(f(x) = \beta_1 x + \beta_0\), then we can optimize it by minimizing the sum of the residuals: \(\sum_{i=1}^n r_i^2\).

It is at this step that we can split up the computation. We have the master send out the parameters (e.g. \(\beta_0, \beta_1\), etc.) for the computation to each of the participating sites. I can re-create this scenario locally by simulating two separate data sets.

# general function for simulating a sample data set given parameters
sim.data <- function(mu, sig, amt, seed, mpar, nl){
  # Simulate data for the practice 
  set.seed(seed)
  x <- rnorm(n=amt, mean=mu, sd=sig)
  
  # create the "true" equation for the regression
  a.true <- mpar[1]
  b.true <- mpar[2]
  y <- x*a.true+b.true
  
  # set the noise level
  noise <- rnorm(n=amt, mean=0, sd=nl)
  d <- cbind(x,y,y + noise)
  colnames(d) <- c("x", "y_true","y")
  return(as.data.frame(d))
}

sim.data1 <-sim.data(10,2,100,2020,c(2,8),1)
sim.data2 <- sim.data(10,2,100,2019,c(2,8),1)
sites <- list(site1 = list(data=sim.data1), site2 = list(data=sim.data2))

head(sim.data1)
##           x   y_true        y
## 1 10.753944 29.50789 27.77910
## 2 10.603097 29.20619 28.21493
## 3  7.803954 23.60791 23.02240
## 4  7.739188 23.47838 23.86190
## 5  4.406931 16.81386 17.56053
## 6 11.441147 30.88229 29.95387
head(sim.data2)
##           x   y_true        y
## 1 11.477045 30.95409 30.10904
## 2  8.970479 25.94096 26.79889
## 3  6.719637 21.43927 20.75567
## 4 11.832074 31.66415 31.65345
## 5  7.465036 22.93007 21.52591
## 6 11.476496 30.95299 32.34477

Here we simulate two data sets with 100 observations, a mean of 10, and a standard deviation of 2. The true values for the linear model are a slope of 2 and an intercept of 8.

Then each site will calculate the residuals from its own data and send back the summary statistic - the sum of its squared residuals.

# define a residual sum of squares function to handle multiple sites
multi.min.RSS <- function(sites, par){
  rs <- 0
  # calculate the residuals from each data source
  for(site in sites){
    tmps <- par[1] + par[2] * site$data$x - site$data$y
    rs <- rs + sum(tmps^2) #c(rs, tmps)
  }
  # return the square and sum of the residuals
  return(rs)
}

All that is left to do is solve for each site. We can use base R’s optim function to do this.

param.fit <- optim(par=c(0,1),
                   fn = multi.min.RSS,
                   hessian = TRUE,
                   sites=sites)
print("Distributed linear model results:")
## [1] "Distributed linear model results:"
print(paste("Intercept: ", param.fit$par[1], " Slope: ", param.fit$par[2]))
## [1] "Intercept:  7.77183635600249  Slope:  2.00840909246344"

We can compare the result’s to R’s built-in linear model function, lm by stacking the data from the two “sites” and running a linear model on the full data set.

# stack the data frames vertically for later verification
sim.data3 <- as.data.frame(rbind(sim.data1, sim.data2)) 
print("Base R linear model on the full data set:")
## [1] "Base R linear model on the full data set:"
lm(y~x, data=sim.data3)
## 
## Call:
## lm(formula = y ~ x, data = sim.data3)
## 
## Coefficients:
## (Intercept)            x  
##       7.776        2.008

Pretty similar!

Click here to run the code.

Avatar
Eric Cramer
Biomedical Data Analyst, Stanford School of Medicine

My research interests include biomedical data mining, developing tools to improve bio-imaging, and creating novel machine learning algorithms for the biomedical research space.