Detecting Poison Attacks using Neural Collapse Geometry

There are specifically two types of Data Poison Attacks - Untargeted & Targeted. However, detecting untergetted attacks is fairly simple as the overall accuracy of the model decreases. But it becomes a problem when the attack is targeted. Until and unless, the query for that specific label/feature is not triggered, you will never understand that you are hacked! There are methods to identify Targeted Attacks but lets try to figure out if these attacks can be identified / corrected using embeddings’ representation, and thus utilising the Neural Collapse to detect poisoned labels/features.

Label flipping can be identified by comparing pairwise distance between two class means in the ETF of Neural Collapse. Backdoor Attacks will show the sub clusters in the embeddings space when obtaining ETF around the actual class mean. And its because adversaries might trigger a pattern in the feature space which will be eventually reflected as small dense cluster which is different from actual class means.

To correct these attacks, maybe we can project these new embeddings back to the ETF representation and can minimise the effects of new outliers.