Attention module is not only a weight: Analyzing transformers with vector norms

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui

Research output: Contribution to journalArticlepeer-review


Because attention modules are core compo-nents of Transformer-based models that have recently achieved considerable success in nat-ural language processing, the community has a great deal of interest in why attention modules are successful and what kind of linguistic information they capture. In particular, pre-vious studies have mainly analyzed attention weights to see how much information the at-tention modules gather from each input to pro-duce an output. In this study, we point out that attention weights alone are only one of the two factors determining the output of self-attention modules, and we propose to incorpo-rate the other factor as well, namely, the trans-formed input vectors into the analysis. That is, we measure the norm of the weighted vec-tors as the contribution of each input to an output. Our analysis of self-attention modules in BERT and the Transformer-based neural machine translation system shows that the atten-tion modules behave very intuitively, contrary to previous findings. That is, our analysis re-veals that (1) BERT's attention modules do not pay so much attention to special tokens, and (2) Transformer's attention modules cap-ture word alignment quite well.

Original languageEnglish
JournalUnknown Journal
Publication statusPublished - 2020 Apr 21

ASJC Scopus subject areas

  • General

Fingerprint Dive into the research topics of 'Attention module is not only a weight: Analyzing transformers with vector norms'. Together they form a unique fingerprint.

Cite this