When adversarial attacks become interpretable counterfactual explanations

Mathieu Serrurier; Franck Mamalet; Thomas Fel; Louis Béthune; Thibaut Boissin

Preprints, Working Papers, ... Year : 2022

When adversarial attacks become interpretable counterfactual explanations

, , , ,

Mathieu Serrurier

Function : Author
PersonId : 740736
IdHAL : mathieu-serrurier
ORCID : 0000-0002-8959-1091
IdRef : 116980206

Franck Mamalet

Function : Author
PersonId : 751026
IdHAL : franck-mamalet

Thomas Fel

Function : Author
PersonId : 750192
IdHAL : thomas-fel
ORCID : 0000-0002-2202-4615

Louis Béthune

Function : Author
PersonId : 1173856
IdHAL : louis-bethune

Thibaut Boissin

Function : Author
PersonId : 1196816
IdHAL : thibaut-boissin

Abstract

We argue that, when learning a 1-Lipschitz neural network with the dual loss of an optimal transportation problem, the gradient of the model is both the direction of the transportation plan and the direction to the closest adversarial attack. Traveling along the gradient to the decision boundary is no more an adversarial attack but becomes a counterfactual explanation, explicitly transporting from one class to the other. Through extensive experiments on XAI metrics, we find that the simple saliency map method, applied on such networks, becomes a reliable explanation, and outperforms the state-of-the-art explanation approaches on unconstrained models. The proposed networks were already known to be certifiably robust, and we prove that they are also explainable with a fast and simple method.

Domains

Machine Learning [stat.ML] Artificial Intelligence [cs.AI] Machine Learning [cs.LG] Computer Vision and Pattern Recognition [cs.CV]

Fichier principal

hkr_explainability_Arxiv.pdf (28.44 Mo)

Origin : Files produced by the author(s)

Franck MAMALET : Connect in order to contact the contributor

https://hal.science/hal-03693355

Submitted on : Friday, June 10, 2022-2:21:44 PM

Last modification on : Wednesday, June 28, 2023-3:58:21 AM

Dates and versions

hal-03693355 , version 1 (10-06-2022)

hal-03693355 , version 2 (20-06-2023)

hal-03693355 , version 3 (02-02-2024)

Identifiers

HAL Id : hal-03693355 , version 1
ARXIV : 2206.06854

Cite

Mathieu Serrurier, Franck Mamalet, Thomas Fel, Louis Béthune, Thibaut Boissin. When adversarial attacks become interpretable counterfactual explanations. 2022. ⟨hal-03693355v1⟩

When adversarial attacks become interpretable counterfactual explanations

Abstract

Domains

Dates and versions

Identifiers

Cite

Export

Altmetric

Share