Seq2seq for French to English translation

前几期已经粗浅的讲解了NLP中的大类：Classification，Sequence labelling，本期继续讲解Neural machine translation（NMT），使用在法语翻译为英语。

目前在NMT领域具有优势的序列模型主要基于RNN，CNN和Attention，模型中包括encoder和decoder，通过一个attention机制链接。

LSTM seq2seq：Sequence to Sequence Learning with Neural Networks
CNN seq2seq: https://arxiv.org/pdf/1705.03122.pdf
Attention: Attention Is All You Need

本期基于Attention Is All You Need论文实现NMT，部分代码借鉴于attention-is-all-you-need-pytorch。数据使用的Tab-delimited Bilingual Sentence Pairs。

相比于RNN和CNN的模型，Attention Is All You Need（下面简称AIAYN）模型更简单，使用 Multi-Head、 Self-attention等Attention代替CNN，RNN，在并行训练上有显著提升，大大减少了训练时间。

模型结构：

整体结构如本期封面图，AIAYN和其他常规模型一样，有encoder和decoder。其中encoder将输入序列 (x1,...,xn)解析为 z = (z1,...,zn)并传入decoder，在decoder中每个时间序列生成一个元素y，最后组成序列 (y1, ..., ym)。在encoder和decoder中都使用了 stacked self-attention and point-wise, fully connected。

Encoder由N个(论文中N=6)相同的串行链接，每层包括multi-head attention和position-wise，值得注意的是每层都使用了residual来缓解由于模型层数过深造成参数难以训练的问题。
Decoder与Encoder类似，只是多了一层multi-head attention来接入Encoder的输出，即上文提到的z = (z1,...,zn)，这里注意一下两层multi-head attention都使用的decode的输入作为 residual，这样做也是很好理解的。

AIAYN使用Positional Encoding替代RNN/CNN实现 the order of the sequence，论文公式如下：