Robust PCA: The Power of Data Decomposition

Soya Kim
3 min readApr 29, 2024

--

Traditional PCA lays a solid foundation for dimensionality reduction, but when outliers come into play, it might falter. Enter Robust PCA (RPCA) — a game-changer in the realm of data analysis.

Reference: Lecture Note and Example from the High Dimensional Data Analytics course (GT-ISYE8803)

Traditional PCA:

  • Reduces data dimensions by finding a low-rank representation.
  • Represents data as 𝑋 = 𝐿 + 𝐸, distinguishing between low-rank and noise.
  • Often relies on truncated Singular Value Decomposition (SVD).

Robust PCA:

  • Effectively handles outliers, pivotal in tasks like face recognition and anomaly detection.
  • Formulated to minimize the rank of 𝐿 plus the sparsity of 𝑆, subject to 𝑀 = 𝐿 + 𝑆.
  • Despite its computational demand, Augmented Lagrangian Multipliers offer a robust solution.

The RPCA algorithm, though intricate, follows a structured approach:

Step 1️. Initialize parameters: 𝐿, 𝑆, Lagrange multipliers, and tuning parameters.

Step2️. Iteratively update 𝐿, 𝑆, and Lagrange multipliers until convergence.

Step 3️. Halt when the change in reconstruction error falls below a threshold.

Example Showcase: Accelerating Robust PCA with Random Sampling

In the ever-evolving landscape of data analysis, handling noisy data is a recurring challenge. RPCA steps in, offering a reliable method to decompose data into discernible components, even amidst outliers.

# Robust PCA
def robust_pca(D):

# Initialize parameters
m, n = D.shape
lam = 1e-2
mu = 1.25 / np.linalg.norm(D, 2)
rho = 1.5

Y = D.copy() / (D.copy().max()/lam)
low_rank = np.zeros((m, n))
sparse = np.zeros_like(low_rank)
d_norm = np.linalg.norm(D, 'fro')

tol = 1e-7
iter = 0
converged = False

start_time = time.time()

while not converged:
iter += 1
X = D - low_rank + Y/mu

# Soft threshold on scalars
sparse = np.sign(X) * np.maximum(np.abs(X) - lam/mu, 0)

# Soft threshold SVD
u, s, vh = np.linalg.svd(D - sparse + Y/mu, full_matrices=False)
svp = (s > 1/mu).sum()
low_rank = (u[:, :svp] * (s[:svp] - 1/mu)) @ vh[:svp]

# Update
Z = D - low_rank - sparse
Y = Y + mu * Z
mu = min(mu*rho, mu*1e7)

# Check for convergence
if np.linalg.norm(Z, 'fro') / d_norm < tol:
converged = True

end_time = time.time()
print('\nTotal number of iteration: ', iter)
print('Elapsed time:', end_time - start_time)

return low_rank, sparse, end_time - start_time

Part1. Robust PCA:

We witnessed RPCA’s prowess in untangling complex datasets, showcasing impressive results in separating low-rank structures from sparse outliers.

Figure 1. The result of RPCA for frame 100 and 150.

Part2. Accelerating Robust PCA:

Randomly and without replacement, sample 2000 rows and 60 columns among the original dataset (21600, 600).

Figure2. The result of RPCA using random sampled dataset.

By implementing random sampling techniques, we accelerated the process more than 5 times without compromising accuracy. The resulting representations highlighted the efficacy of this approach, achieving significant speed-up.

This significant improvement showcases the practicality and efficiency of leveraging random sampling in accelerating Robust PCA, making it an invaluable tool for handling large datasets efficiently.

#RobustPCA #DataAnalysis #MachineLearning #DataScience #OutlierDetection #AnomalyDetection #DimensionalityReduction #GT #ISYE8803 #HighDimensionalDataAnalytics

--

--

Soya Kim
Soya Kim

Written by Soya Kim

Data Scientist | Data Analyst | MS in Data Science @ University of Michigan | MS in Computational Data Analytics @ Georgia Tech

No responses yet