R

Multiple linear regression in a distributed system

In a previous post I talked about adapting a linear regression algorithm so it can be used in a distributed system. Essentially, a master computer oversees computations run on local data, and the algorithm pauses midway through to send summary statistics to the master. In this way, the master receives enough information to reconstruct the model without seeing the underlying data. For a linear regression model, we can simply have the master iteratively pass candidate \(\beta\)s values to to the workers, which then return their local sum of the residual squares.

Starting distributed computing

Quick Intro As hospitals, care providers, and private companies collect more data, they develop rich databases that can be used to improve patient care (e.g. through precision medicine). Research institutions often cannot share their data with each other, however, out of privacy concerns and HIPAA compliance. This poses a hurdle to inter-institutional collaboration, and creates a research bottleneck. It is an unfortunate instance where good data security practices can create roadblocks to inter-institutional collaboration, which has the potential to solve problems such as bias in AI.