Lite transformer

图片以及思想来源请参考论文 Lite Transformer with Long-Short Range Attention

瓶颈结构（bottleneck）是否真的有效

注意力机制被广泛应用在诸多领域，包括自然语言处理，图像处理和视频处理。它通过计算所有输入元素的点积来建模长短期关系。尽管非常有效，但是它庞大的计算量一直为人所诟病。

为了降低计算量，常用的方法是先通过一个线性投影层减少通道数 $d$ ，然后运用注意力机制，最后再增加通道数，也就是瓶颈结构。这种方法在减少计算量的同时，也降低了注意力层的信息提取能力，这在自然语言处理中更为糟糕，因为NLP中注意力层是主要的特征提取模块（在图像和视频处理中是卷积层）

将<a class= transformer的瓶颈展平可以增加注意力层相对前馈层的比例，有利于后续的优化" />

将transformer的瓶颈展平可以增加注意力层相对前馈层的比例，有利于后续的优化

典型的 Transformer 模块包含注意力层，后面加前馈层。注意力层的计算复杂度 $\mathcal{O}\left(4 N d^2+N^2 d\right)$ ，而前馈层计算复杂度 $\mathcal{O}\left(2 \times 4 N d^2\right)$ ，于是对于一个较短的序列 $N$ ，前馈层会消耗大量计算资源，然而前馈层并没有特征提取功能，因此瓶颈结构失效，它不仅达不到减少计算量的效果，反而还损害了特征提取能力。

长短期注意力（Long-Short Range Attention）

With a larger weight $w_{i j}$ (darker color), the $i$ -th word in the source sentence pays more attention to the $j$ -th word in the target sentence. And the attention maps typically have strong patterns: sparse and diagonal. They represent the relationships between some particular words: the sparse for the long-term information, and the diagonal for the correlation in small neighborhoods. We denote the former as “global” relationships and the latter as “local”.