When adversarial attacks become interpretable counterfactual explanations - IRT Saint Exupéry - Institut de Recherche Technologique Access content directly
Preprints, Working Papers, ... Year : 2022

When adversarial attacks become interpretable counterfactual explanations

Mathieu Serrurier
Franck Mamalet
Thomas Fel
Louis Béthune
Thibaut Boissin

Abstract

We argue that, when learning a 1-Lipschitz neural network with the dual loss of an optimal transportation problem, the gradient of the model is both the direction of the transportation plan and the direction to the closest adversarial attack. Traveling along the gradient to the decision boundary is no more an adversarial attack but becomes a counterfactual explanation, explicitly transporting from one class to the other. Through extensive experiments on XAI metrics, we find that the simple saliency map method, applied on such networks, becomes a reliable explanation, and outperforms the state-of-the-art explanation approaches on unconstrained models. The proposed networks were already known to be certifiably robust, and we prove that they are also explainable with a fast and simple method.
Fichier principal
Vignette du fichier
hkr_explainability_Arxiv.pdf (28.44 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03693355 , version 1 (10-06-2022)
hal-03693355 , version 2 (20-06-2023)
hal-03693355 , version 3 (02-02-2024)

Identifiers

Cite

Mathieu Serrurier, Franck Mamalet, Thomas Fel, Louis Béthune, Thibaut Boissin. When adversarial attacks become interpretable counterfactual explanations. 2022. ⟨hal-03693355v1⟩
165 View
72 Download

Altmetric

Share

Gmail Facebook X LinkedIn More