Adversarial Concept Erasure in Kernel Space

Shauli Ravfogel; Francisco Vargas; Yoav Goldberg; Ryan Cotterell

Back to publications

Adversarial Concept Erasure in Kernel Space

Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, Ryan Cotterell

, 2022.

Abstract

The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how human-interpretable concepts, such as gender, are encoded in these representations would improve the ability of users to control the content of these representations and analyze the working of the models that rely on them. One prominent approach to the control problem is the identification and removal of linear concept subspaces – subspaces in the representation space that correspond to a given concept. While those are tractable and interpretable, neural network do not necessarily represent concepts in linear subspaces. We propose a kernalization of the linear concept-removal objective of [Ravfogel et al. 2022], and show that it is effective in guarding against the ability of certain nonlinear adversaries to recover the concept. Interestingly, our findings suggest that the division between linear and nonlinear models is overly simplistic: when considering the concept of binary gender and its neutralization, we do not find a single kernel space that exclusively contains all the concept-related information. It is therefore challenging to protect against all nonlinear adversaries at once.

Links

Cite this Paper

BibTeX


@Misc{publications/adversarial-concept-erasure-in-kernel-space,
  title = 	 {Adversarial Concept Erasure in Kernel Space},
  author = 	 {Ravfogel, Shauli and Vargas, Francisco and Goldberg, Yoav and Cotterell, Ryan},
  year = 	 {2022},
  url = 	 {/publications/adversarial-concept-erasure-in-kernel-space.html},
  abstract = 	 {The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how human-interpretable concepts, such as gender, are encoded in these representations would improve the ability of users to control the content of these representations and analyze the working of the models that rely on them. One prominent approach to the control problem is the identification and removal of linear concept subspaces -- subspaces in the representation space that correspond to a given concept. While those are tractable and interpretable, neural network do not necessarily represent concepts in linear subspaces. We propose a kernalization of the linear concept-removal objective of [Ravfogel et al. 2022], and show that it is effective in guarding against the ability of certain nonlinear adversaries to recover the concept. Interestingly, our findings suggest that the division between linear and nonlinear models is overly simplistic: when considering the concept of binary gender and its neutralization, we do not find a single kernel space that exclusively contains all the concept-related information. It is therefore challenging to protect against all nonlinear adversaries at once.}
}

Endnote

%0 Generic
%T Adversarial Concept Erasure in Kernel Space
%A Shauli Ravfogel
%A Francisco Vargas
%A Yoav Goldberg
%A Ryan Cotterell
%D 2022	
%F publications/adversarial-concept-erasure-in-kernel-space
%U /publications/adversarial-concept-erasure-in-kernel-space.html
%X The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how human-interpretable concepts, such as gender, are encoded in these representations would improve the ability of users to control the content of these representations and analyze the working of the models that rely on them. One prominent approach to the control problem is the identification and removal of linear concept subspaces -- subspaces in the representation space that correspond to a given concept. While those are tractable and interpretable, neural network do not necessarily represent concepts in linear subspaces. We propose a kernalization of the linear concept-removal objective of [Ravfogel et al. 2022], and show that it is effective in guarding against the ability of certain nonlinear adversaries to recover the concept. Interestingly, our findings suggest that the division between linear and nonlinear models is overly simplistic: when considering the concept of binary gender and its neutralization, we do not find a single kernel space that exclusively contains all the concept-related information. It is therefore challenging to protect against all nonlinear adversaries at once.

RIS


TY  - GEN
TI  - Adversarial Concept Erasure in Kernel Space
AU  - Shauli Ravfogel
AU  - Francisco Vargas
AU  - Yoav Goldberg
AU  - Ryan Cotterell
DA  - 2022/01/28	
ID  - publications/adversarial-concept-erasure-in-kernel-space
UR  - /publications/adversarial-concept-erasure-in-kernel-space.html
AB  - The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how human-interpretable concepts, such as gender, are encoded in these representations would improve the ability of users to control the content of these representations and analyze the working of the models that rely on them. One prominent approach to the control problem is the identification and removal of linear concept subspaces -- subspaces in the representation space that correspond to a given concept. While those are tractable and interpretable, neural network do not necessarily represent concepts in linear subspaces. We propose a kernalization of the linear concept-removal objective of [Ravfogel et al. 2022], and show that it is effective in guarding against the ability of certain nonlinear adversaries to recover the concept. Interestingly, our findings suggest that the division between linear and nonlinear models is overly simplistic: when considering the concept of binary gender and its neutralization, we do not find a single kernel space that exclusively contains all the concept-related information. It is therefore challenging to protect against all nonlinear adversaries at once.
ER  -

APA


Ravfogel, S., Vargas, F., Goldberg, Y. & Cotterell, R.. (2022). Adversarial Concept Erasure in Kernel Space.  Available from /publications/adversarial-concept-erasure-in-kernel-space.html.