Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocks of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in the presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modeling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparse PLS methods is the link between the singular value decomposition (SVD) of a matrix (constructed from deflated versions of the original data) and least squares minimization in linear regression. We review four popular PLS methods for two blocks of data. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. We present various approaches to decrease the computation time and show how the whole procedure can be scalable to big data sets. The bigsgPLS R package implements our unified algorithm and is available at https://github.com/matt-sutton/bigsgPLS.
Bibliographical noteVersion archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.
- High dimensional data
- Lasso penalties
- Partial Least Squares
- Singular Value Decomposition