DEBERTA: DECODING-ENHANCED BERT WITH DIS- ENTANGLED ATTENTION glue榜首论文解读

一、概览

在这里插入图片描述

二、详细内容

abstract
a. 两个机制来improve bert和 roberta
ⅰ. disentangled attention mechanism
ⅱ. enhanced mask decoder
b. fine-tuning阶段
ⅰ. virtual adversarial training -> 提升泛化
c. 效果
ⅰ. 对nlu和nlg下游任务，提升都比较大
ⅱ. 用一半的训练数据，效果就比roberta-large好了
ⅲ. 48层的deberta，在21年6月登顶superGLUE
introduction
a. Disentangled attention（分散注意力）
ⅰ. content embedding+相对位置embedding
ⅱ. 所以强调的是相对位置？
b. Enhanced mask decoder
ⅰ. 出发点：预测mask的时候，有时候绝对位置也非常重要，这里引入绝对位置来辅助预测mask的token
ⅱ. DeBERTa在softmax层之前引入了绝对单词位置嵌入，其中模型基于单词内容和位置的聚合上下文嵌入来解码被屏蔽的单词
c. 对抗训练来提升fine-tuning下游任务的泛化能力
background
a. transformer
ⅰ. 标准的self-attention缺乏有效的机制去编码位置信息
ⅱ. 有论文显示相对位置编码比绝对位置编码更有效
ⅲ. mlm：mask 15% token来预测
1. 10%不变，10%随机词，80% mask
  b. deberta
  ⅰ. input
2. 每个位置有个{Hi, Pi|j}：代表content和相对位置信息
3. cross attention score:
4. 感觉就是把他们分开了，并且结合了content-to-position的信息
  ⅱ. enhanced mask decoder accounts for absolute word positions
5. 又来考虑绝对位置信息?
6. 出发点：只用相对位置，不用绝对位置信息也是不够的
7. 如何编码绝对位置？
  a. bert是在最开始的时候，利用了绝对的位置编码信息
  b. deberta是在encoder了后，softmax之前，才用了绝对的位置信息
  c. 总结一下，transformer层的时候，用相对位置信息，要decode mask的时候了，才添加绝对位置信息作为补偿信息，所以叫enhanced mask decoder
scalue-invariant fine-tuning
a. 正则化方法去提高泛化
b. perturbation：扰动?
c. 方法：在normalized的word embedding中添加扰动来实现
d. SiFT first normalizes the word embedding vectors into stochastic vectors, and then applies the perturbation to the normalized embedding vectors
e. SiFT首先将单词嵌入向量归一化为随机向量，然后将扰动应用于归一化的嵌入向量
实验
a. NLI比roberta_base高1个点，squad比roberta_base高2-3个点