Traditional PCA lays a solid foundation for dimensionality reduction, but when outliers come into play, it might falter. Enter Robust PCA (RPCA) — a game-changer in the realm of data analysis.
Reference: Lecture Note and Example from the High Dimensional Data Analytics course (GT-ISYE8803)
Traditional PCA:
- Reduces data dimensions by finding a low-rank representation.
- Represents data as 𝑋 = 𝐿 + 𝐸, distinguishing between low-rank and noise.
- Often relies on truncated Singular Value Decomposition (SVD).
Robust PCA:
- Effectively handles outliers, pivotal in tasks like face recognition and anomaly detection.
- Formulated to minimize the rank of 𝐿 plus the sparsity of 𝑆, subject to 𝑀 = 𝐿 + 𝑆.
- Despite its computational demand, Augmented Lagrangian Multipliers offer a robust solution.
The RPCA algorithm, though intricate, follows a structured approach:
Step 1️. Initialize parameters: 𝐿, 𝑆, Lagrange multipliers, and tuning parameters.
Step2️. Iteratively update 𝐿, 𝑆, and Lagrange multipliers until convergence.
Step 3️. Halt when the change in reconstruction error falls below a threshold.
Example Showcase: Accelerating Robust PCA with Random Sampling
In the ever-evolving landscape of data analysis, handling noisy data is a recurring challenge. RPCA steps in, offering a reliable method to decompose data into discernible components, even amidst outliers.
# Robust PCA
def robust_pca(D):
# Initialize parameters
m, n = D.shape
lam = 1e-2
mu = 1.25 / np.linalg.norm(D, 2)
rho = 1.5
Y = D.copy() / (D.copy().max()/lam)
low_rank = np.zeros((m, n))
sparse = np.zeros_like(low_rank)
d_norm = np.linalg.norm(D, 'fro')
tol = 1e-7
iter = 0
converged = False
start_time = time.time()
while not converged:
iter += 1
X = D - low_rank + Y/mu
# Soft threshold on scalars
sparse = np.sign(X) * np.maximum(np.abs(X) - lam/mu, 0)
# Soft threshold SVD
u, s, vh = np.linalg.svd(D - sparse + Y/mu, full_matrices=False)
svp = (s > 1/mu).sum()
low_rank = (u[:, :svp] * (s[:svp] - 1/mu)) @ vh[:svp]
# Update
Z = D - low_rank - sparse
Y = Y + mu * Z
mu = min(mu*rho, mu*1e7)
# Check for convergence
if np.linalg.norm(Z, 'fro') / d_norm < tol:
converged = True
end_time = time.time()
print('\nTotal number of iteration: ', iter)
print('Elapsed time:', end_time - start_time)
return low_rank, sparse, end_time - start_time
Part1. Robust PCA:
We witnessed RPCA’s prowess in untangling complex datasets, showcasing impressive results in separating low-rank structures from sparse outliers.
Part2. Accelerating Robust PCA:
Randomly and without replacement, sample 2000 rows and 60 columns among the original dataset (21600, 600).
By implementing random sampling techniques, we accelerated the process more than 5 times without compromising accuracy. The resulting representations highlighted the efficacy of this approach, achieving significant speed-up.
This significant improvement showcases the practicality and efficiency of leveraging random sampling in accelerating Robust PCA, making it an invaluable tool for handling large datasets efficiently.
#RobustPCA #DataAnalysis #MachineLearning #DataScience #OutlierDetection #AnomalyDetection #DimensionalityReduction #GT #ISYE8803 #HighDimensionalDataAnalytics