PyTorch搭建Autoformer实现长序列时间序列预测

I. 前言

前面已经写了很多关于时间序列预测的文章：

深入理解PyTorch中LSTM的输入和输出（从input输入到Linear输出）
PyTorch搭建LSTM实现时间序列预测（负荷预测）
PyTorch中利用LSTMCell搭建多层LSTM实现时间序列预测
PyTorch搭建LSTM实现多变量时间序列预测（负荷预测）
PyTorch搭建双向LSTM实现时间序列预测（负荷预测）
PyTorch搭建LSTM实现多变量多步长时间序列预测（一）：直接多输出
PyTorch搭建LSTM实现多变量多步长时间序列预测（二）：单步滚动预测
PyTorch搭建LSTM实现多变量多步长时间序列预测（三）：多模型单步预测
PyTorch搭建LSTM实现多变量多步长时间序列预测（四）：多模型滚动预测
PyTorch搭建LSTM实现多变量多步长时间序列预测（五）：seq2seq
PyTorch中实现LSTM多步长时间序列预测的几种方法总结（负荷预测）
PyTorch-LSTM时间序列预测中如何预测真正的未来值
PyTorch搭建LSTM实现多变量输入多变量输出时间序列预测（多任务学习）
PyTorch搭建ANN实现时间序列预测（风速预测）
PyTorch搭建CNN实现时间序列预测（风速预测）
PyTorch搭建CNN-LSTM混合模型实现多变量多步长时间序列预测（负荷预测）
PyTorch搭建Transformer实现多变量多步长时间序列预测（负荷预测）
PyTorch时间序列预测系列文章总结（代码使用方法）
TensorFlow搭建LSTM实现时间序列预测（负荷预测）
TensorFlow搭建LSTM实现多变量时间序列预测（负荷预测）
TensorFlow搭建双向LSTM实现时间序列预测（负荷预测）
TensorFlow搭建LSTM实现多变量多步长时间序列预测（一）：直接多输出
TensorFlow搭建LSTM实现多变量多步长时间序列预测（二）：单步滚动预测
TensorFlow搭建LSTM实现多变量多步长时间序列预测（三）：多模型单步预测
TensorFlow搭建LSTM实现多变量多步长时间序列预测（四）：多模型滚动预测
TensorFlow搭建LSTM实现多变量多步长时间序列预测（五）：seq2seq
TensorFlow搭建LSTM实现多变量输入多变量输出时间序列预测（多任务学习）
TensorFlow搭建ANN实现时间序列预测（风速预测）
TensorFlow搭建CNN实现时间序列预测（风速预测）
TensorFlow搭建CNN-LSTM混合模型实现多变量多步长时间序列预测（负荷预测）
PyG搭建图神经网络实现多变量输入多变量输出时间序列预测
PyTorch搭建GNN-LSTM和LSTM-GNN模型实现多变量输入多变量输出时间序列预测
PyG Temporal搭建STGCN实现多变量输入多变量输出时间序列预测
时序预测中Attention机制是否真的有效？盘点LSTM/RNN中24种Attention机制+效果对比
详解Transformer在时序预测中的Encoder和Decoder过程：以负荷预测为例
(PyTorch)TCN和RNN/LSTM/GRU结合实现时间序列预测
PyTorch搭建Informer实现长序列时间序列预测
PyTorch搭建Autoformer实现长序列时间序列预测

上一篇文章讲了长序列时间序列预测的第一个模型Informer，Informer发表于AAAI 2021。这一篇文章讲第二个长序列时间预测模型Autoformer，Autoformer为NeurIPS 2021中《Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting》提出的模型，比Informer晚了差不多半年。

II. Autoformer

Autoformer的目的是待预测的长度远大于输入长度的长期序列预测问题，即就有有限的信息预测更长远的未来。

Autoformer的创新点如下：

长序列中的复杂时间模式使得Transformer中的注意力机制难以发现可靠的时序依赖。因此，Autoformer提出将时间序列分解嵌入到深度学习模型中，在这之前，分解一般都是用作数据预处理，如EMD分解等。可学习的分解可以帮助模型从复杂的序列数据中分解出可预测性更强的组分。
Transformer的时间复杂度较高，造成了信息利用的瓶颈。因此，Autoformer中基于随机过程理论，提出了Auto-correlation机制来代替了Transformer中的基于point-wise的self-attention机制，实现序列级(series-wise)连接和 $\log L)$ 的时间复杂度，打破信息利用瓶颈。

更具体的原理就不做讲解了，网上已经有了很多类似的文章，这篇文章主要讲解代码的使用，重点是如何对作者公开的源代码进行改动，以更好地适配大多数人自身的数据，使得读者只需要改变少数几个参数就能实现数据集的更换。

III. 代码

3.1 Encoder输入

传统Transformer中在编码阶段需要进行的第一步就是在原始序列的基础上添加位置编码，而在Autoformer中，输入由2部分组成，即Token Embedding和Temporal Embedding，没有位置编码。

我们假设输入的序列长度为(batch_size, seq_len, enc_in)，如果用过去96个时刻的所有13个变量预测未来时刻的值，那么输入即为(batch_size, 96, 13)。

3.1.1 Token Embedding

Autoformer输入的第1部分是对原始输入进行编码，本质是利用一个1维卷积对原始序列进行特征提取，并且序列的维度从原始的enc_in变换到d_model，代码如下：

class TokenEmbedding(nn.Module):
    def __init__(self, c_in, d_model):
        super(TokenEmbedding, self).__init__()
        padding = 1 if torch.__version__ >= '1.5.0' else 2
        self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=d_model,
                                   kernel_size=3, padding=padding, padding_mode='circular')
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')

    def forward(self, x):
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
        return x

输入x的大小为(batch_size, seq_len, enc_in)，需要先将后两个维度交换以适配1维卷积，接着让数据通过tokenConv，由于添加了padding，因此经过后seq_len维度不改变，经过TokenEmbedding后得到大小为(batch_size, seq_len, d_model)的输出。

3.1.2 Temporal Embedding

Autoformer输入的第2部分是对时间戳进行编码，即年月日星期时分秒等进行编码。这一部分与Informer中一致，使用了两种编码方式，我们依次解析。第一种编码方式TemporalEmbedding代码如下：

class TemporalEmbedding(nn.Module):
    def __init__(self, d_model, embed_type='fixed', freq='h'):
        super(TemporalEmbedding, self).__init__()

        minute_size = 4
        hour_size = 24
        weekday_size = 7
        day_size = 32
        month_size = 13

        Embed = FixedEmbedding if embed_type == 'fixed' else nn.Embedding
        if freq == 't':
            self.minute_embed = Embed(minute_size, d_model)
        self.hour_embed = Embed(hour_size, d_model)
        self.weekday_embed = Embed(weekday_size, d_model)
        self.day_embed = Embed(day_size, d_model)
        self.month_embed = Embed(month_size, d_model)

    def forward(self, x):
        x = x.long()

        minute_x = self.minute_embed(x[:, :, 4]) if hasattr(self, 'minute_embed') else 0.
        hour_x = self.hour_embed(x[:, :, 3])
        weekday_x = self.weekday_embed(x[:, :, 2])
        day_x = self.day_embed(x[:, :, 1])
        month_x = self.month_embed(x[:, :, 0])

        return hour_x + weekday_x + day_x + month_x + minute_x

TemporalEmbedding的输入要求是(batch_size, seq_len, 5),5表示每个时间戳的月、天、星期（星期一到星期七）、小时以及刻钟数（一刻钟15分钟）。代码中对五个值分别进行了编码，编码方式有两种，一种是FixedEmbedding，它使用位置编码作为embedding的参数，不需要训练参数；另一种就是torch自带的nn.Embedding，参数是可训练的。

更具体的，作者将月、天、星期、小时以及刻钟的范围分别限制在了13、32、7、24以及4。即保证输入每个时间戳的月份数都在0-12，天数都在0-31，星期都在0-6，小时数都在0-23，刻钟数都在0-3。例如2024/04/05/12:13，星期五，输入应该是(4, 5, 5, 13, 0)。注意12:13小时数应该为13，小于等于12:00但大于11:00如11:30才为12。

对时间戳进行编码的第二种方式为TimeFeatureEmbedding：

class TimeFeatureEmbedding(nn.Module):
    def __init__(self, d_model, embed_type='timeF', freq='h'):
        super(TimeFeatureEmbedding, self).__init__()

        freq_map = {'h': 4, 't': 5, 's': 6, 'm': 1, 'a': 1, 'w': 2, 'd': 3, 'b': 3}
        d_inp = freq_map[freq]
        self.embed = nn.Linear(d_inp, d_model)

    def forward(self, x):
        return self.embed(x)

TimeFeatureEmbedding的输入为(batch_size, seq_len, d_inp)，d_inp有多达8种选择。具体来说针对时间戳2024/04/05/12:13，以freq='h’为例，其输入应该是(月份、日期、星期、小时)，即(4, 5, 5, 13)，然后针对输入通过以下函数将所有数据转换到-0.5到0.5之间：

def time_features(dates, timeenc=1, freq='h'):
    """
    > `time_features` takes in a `dates` dataframe with a 'dates' column and extracts the date down to `freq` where freq can be any of the following if `timeenc` is 0: 
    > * m - [month]
    > * w - [month]
    > * d - [month, day, weekday]
    > * b - [month, day, weekday]
    > * h - [month, day, weekday, hour]
    > * t - [month, day, weekday, hour, *minute]
    > 
    > If `timeenc` is 1, a similar, but different list of `freq` values are supported (all encoded between [-0.5 and 0.5]): 
    > * Q - [month]
    > * M - [month]
    > * W - [Day of month, week of year]
    > * D - [Day of week, day of month, day of year]
    > * B - [Day of week, day of month, day of year]
    > * H - [Hour of day, day of week, day of month, day of year]
    > * T - [Minute of hour*, hour of day, day of week, day of month, day of year]
    > * S - [Second of minute, minute of hour, hour of day, day of week, day of month, day of year]

    *minute returns a number from 0-3 corresponding to the 15 minute period it falls into.
    """
    if timeenc == 0:
        dates['month'] = dates.date.apply(lambda row: row.month, 1)
        dates['day'] = dates.date.apply(lambda row: row.day, 1)
        dates['weekday'] = dates.date.apply(lambda row: row.weekday(), 1)
        dates['hour'] = dates.date.apply(lambda row: row.hour, 1)
        dates['minute'] = dates.date.apply(lambda row: row.minute, 1)
        dates['minute'] = dates.minute.map(lambda x: x // 15)
        freq_map = {
            'y': [], 'm': ['month'], 'w': ['month'], 'd': ['month', 'day', 'weekday'],
            'b': ['month', 'day', 'weekday'], 'h': ['month', 'day', 'weekday', 'hour'],
            't': ['month', 'day', 'weekday', 'hour', 'minute'],
        }
        return dates[freq_map[freq.lower()]].values
    if timeenc == 1:
        dates = pd.to_datetime(dates.date.values)
        return np.vstack([feat(dates) for feat in time_features_from_frequency_str(freq)]).transpose(1, 0)

当freq为’t’时，输入应该为[‘month’, ‘day’, ‘weekday’, ‘hour’, ‘minute’]，其他类似。当通过上述函数将四个数转换为-0.5到0.5之间后，再利用TimeFeatureEmbedding中的self.embed = nn.Linear(d_inp, d_model)来将维度从4转换到d_model，因此最终返回的输出大小也为(batch_size, seq_len, d_model)。

最终，代码中通过一个DataEmbedding_wo_pos类来将2种编码放在一起：

class DataEmbedding_wo_pos(nn.Module):
    def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1):
        super(DataEmbedding_wo_pos, self).__init__()

        self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
        self.position_embedding = PositionalEmbedding(d_model=d_model)
        self.temporal_embedding = TemporalEmbedding(d_model=d_model, embed_type=embed_type,
                                                    freq=freq) if embed_type != 'timeF' else TimeFeatureEmbedding(
            d_model=d_model, embed_type=embed_type, freq=freq)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, x_mark):
        x = self.value_embedding(x) + self.temporal_embedding(x_mark)
        return self.dropout(x)

3.2 Decoder输入

在Informer中，编码器和解码器的输入大同小异，都由三部分组成，而在Autoformer中，二者存在差别。

在解码器中，进行Token Embedding的不是原始的时间序列，而是seasonal part，这部分在3.3节中进行讲解。

3.3 Encoder与Decoder

完整的Autoformer代码如下：

class Autoformer(nn.Module):
    """
    Autoformer is the first method to achieve the series-wise connection,
    with inherent O(LlogL) complexity
    """
    def __init__(self, args):
        super(Autoformer, self).__init__()
        self.seq_len = args.seq_len
        self.label_len = args.label_len
        self.pred_len = args.pred_len
        self.output_attention = args.output_attention

        # Decomp
        kernel_size = args.moving_avg
        self.decomp = series_decomp(kernel_size)

        # Embedding
        # The series-wise connection inherently contains the sequential information.
        # Thus, we can discard the position embedding of transformers.
        self.enc_embedding = DataEmbedding_wo_pos(args.enc_in, args.d_model, args.embed, args.freq,
                                                  args.dropout)
        self.dec_embedding = DataEmbedding_wo_pos(args.dec_in, args.d_model, args.embed, args.freq,
                                                  args.dropout)

        # Encoder
        self.encoder = Encoder(
            [
                EncoderLayer(
                    AutoCorrelationLayer(
                        AutoCorrelation(False, args.factor, attention_dropout=args.dropout,
                                        output_attention=args.output_attention),
                        args.d_model, args.n_heads),
                    args.d_model,
                    args.d_ff,
                    moving_avg=args.moving_avg,
                    dropout=args.dropout,
                    activation=args.activation
                ) for l in range(args.e_layers)
            ],
            norm_layer=my_Layernorm(args.d_model)
        )
        # Decoder
        self.decoder = Decoder(
            [
                DecoderLayer(
                    AutoCorrelationLayer(
                        AutoCorrelation(True, args.factor, attention_dropout=args.dropout,
                                        output_attention=False),
                        args.d_model, args.n_heads),
                    AutoCorrelationLayer(
                        AutoCorrelation(False, args.factor, attention_dropout=args.dropout,
                                        output_attention=False),
                        args.d_model, args.n_heads),
                    args.d_model,
                    args.c_out,
                    args.d_ff,
                    moving_avg=args.moving_avg,
                    dropout=args.dropout,
                    activation=args.activation,
                )
                for l in range(args.d_layers)
            ],
            norm_layer=my_Layernorm(args.d_model),
            projection=nn.Linear(args.d_model, args.c_out, bias=True)
        )

    def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec,
                enc_self_mask=None, dec_self_mask=None, dec_enc_mask=None):
        # decomp init
        mean = torch.mean(x_enc, dim=1).unsqueeze(1).repeat(1, self.pred_len, 1)
        zeros = torch.zeros([x_dec.shape[0], self.pred_len, x_dec.shape[2]], device=x_enc.device)
        seasonal_init, trend_init = self.decomp(x_enc)
        # decoder input
        trend_init = torch.cat([trend_init[:, -self.label_len:, :], mean], dim=1)
        seasonal_init = torch.cat([seasonal_init[:, -self.label_len:, :], zeros], dim=1)
        # enc
        enc_out = self.enc_embedding(x_enc, x_mark_enc)
        enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)
        # dec
        dec_out = self.dec_embedding(seasonal_init, x_mark_dec)
        seasonal_part, trend_part = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask,
                                                 trend=trend_init)
        # final
        dec_out = trend_part + seasonal_part

        if self.output_attention:
            return dec_out[:, -self.pred_len:, :], attns
        else:
            return dec_out[:, -self.pred_len:, :]  # [B, L, D]

观察forward，主要的输入为x_enc, x_mark_enc, x_dec, x_mark_dec，下边依次介绍：

x_enc: 编码器输入，大小为(batch_size, seq_len, enc_in)，在这篇文章中，我们使用前96个时刻的所有13个变量预测未来24个时刻的所有13个变量，所以这里x_enc的输入应该是(batch_size, 96, 13)。
x_mark_enc：编码器的时间戳输入，大小分情况，本文中采用频率freq='h’的TimeFeatureEmbedding编码方式，所以应该输入[‘month’, ‘day’, ‘weekday’, ‘hour’]，大小为(batch_size, 96, 4)。
x_dec，解码器输入，大小为(batch_size, label_len+pred_len, dec_in)，其中dec_in为解码器输入的变量个数，也为13。在Informer中，为了避免step-by-step的解码结构，作者直接将x_enc中后label_len个时刻的数据和要预测时刻的数据进行拼接得到解码器输入。在本次实验中，由于需要预测未来24个时刻的数据，所以pred_len=24，向前看48个时刻，所以label_len=48，最终解码器的输入维度应该为(batch_size, 48+24=72, 13)。
x_mark_dec，解码器的时间戳输入，大小为(batch_size, 72, 4)。

为了方便理解编码器和解码器的输入，给一个具体的例子：假设某个样本编码器的输入为1-96时刻的所有13个变量，即x_enc大小为(96, 13)，x_mark_enc大小为(96, 4)，表示每个时刻的[‘month’, ‘day’, ‘weekday’, ‘hour’]；解码器输入为编码器输入的后label_len=48+要预测的pred_len=24个时刻的数据，即49-120时刻的所有13个变量，x_dec大小为(72, 13)，同理x_mark_dec大小为(72, 4)。

为了防止数据泄露，在预测97-120时刻的数据时，解码器输入x_dec中不能包含97-120时刻的真实数据，在原文中，作者用24个0来代替，代码如下：

dec_inp = torch.zeros_like(batch_y[:, -self.args.pred_len:, :]).float()
dec_inp = torch.cat([batch_y[:, :self.args.label_len, :], dec_inp], dim=1).float().to(self.device)

3.3.1 初始化

这部分代码如下：

# decomp init
mean = torch.mean(x_enc, dim=1).unsqueeze(1).repeat(1, self.pred_len, 1)
zeros = torch.zeros([x_dec.shape[0], self.pred_len, x_dec.shape[2]], device=x_enc.device)
seasonal_init, trend_init = self.decomp(x_enc)
# decoder input
trend_init = torch.cat([trend_init[:, -self.label_len:, :], mean], dim=1)
seasonal_init = torch.cat([seasonal_init[:, -self.label_len:, :], zeros], dim=1)

首先是：

mean = torch.mean(x_enc, dim=1).unsqueeze(1).repeat(1, self.pred_len, 1)

这一句代码首先将大小为(batch_size, 96, 13)的x_enc沿着seq_len维度求平均，然后再repeat变成(batch_size, 24, 13)，其中24表示pred_len，13表示96个时刻在13个变量上的平均值。

接着初始化大小为(batch_size, 24, 13)的全0矩阵zeros。

接着对x_enc进行分解得到两个趋势分量：

seasonal_init, trend_init = self.decomp(x_enc)

更具体来说，首先利用series_decomp模块来对x_enc也就是大小为(batch_size, seq_len=96, enc_in=13)的编码器输入进行分解，series_decomp代码如下所示：

class moving_avg(nn.Module):
    """
    Moving average block to highlight the trend of time series
    """
    def __init__(self, kernel_size, stride):
        super(moving_avg, self).__init__()
        self.kernel_size = kernel_size
        self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)

    def forward(self, x):
        # padding on the both ends of time series
        front = x[:, 0:1, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        end = x[:, -1:, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        x = torch.cat([front, x, end], dim=1)
        x = self.avg(x.permute(0, 2, 1))
        x = x.permute(0, 2, 1)
        return x
        
class series_decomp(nn.Module):
    """
    Series decomposition block
    """
    def __init__(self, kernel_size):
        super(series_decomp, self).__init__()
        self.moving_avg = moving_avg(kernel_size, stride=1)

    def forward(self, x):
        moving_mean = self.moving_avg(x)
        res = x - moving_mean
        return res, moving_mean

输入的x为x_enc，大小为(batch_size, seq_len=96, enc_in=13)，所谓的分解就是先利用一个nn.AvgPool1d来对x_enc进行池化，针对seq_len这个维度进行平均池化，即每kernel_size长度求一次平均。由于添加了padding，所以池化后大小不变，依旧为(batch_size, seq_len, enc_in)，池化后的这部分即为初始的季节趋势seasonal_init，x_enc减去seasonal_init即为trend_init，反映短期波动。

最后是处理decoder的初始输入：

# decoder input
trend_init = torch.cat([trend_init[:, -self.label_len:, :], mean], dim=1)
seasonal_init = torch.cat([seasonal_init[:, -self.label_len:, :], zeros], dim=1)

trend_init大小为(batch_size, 96, 13)，我们只取其后label_len=48个数据，然后和大小为(batch_size, 24, 13)mean进行拼接，大小变成(batch_size, 48+24=72, 13)，即trend_init的前部分为x_enc分解得到，后部分为x_enc中所有时刻的平均值。seasonal_init大小为(batch_size, 96, 13)，我们同样只取其后label_len个数据，然后再和0进行拼接得到(batch_size, 48+24=72, 13)，即seasonal_init的前部分为x_enc分解得到，后部分为0。

3.3.2 Encoder

Encoder过程如下所示：

enc_out = self.enc_embedding(x_enc, x_mark_enc)
enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)

这部分和上一篇文章讲的Informer类似，区别只是将注意力机制换成了AutoCorrelation，有关AutoCorrelation这里不做详细介绍。

3.3.3 Decoder

Decoder过程如下所示：

# dec
dec_out = self.dec_embedding(seasonal_init, x_mark_dec)
seasonal_part, trend_part = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask, trend=trend_init)
# final
dec_out = trend_part + seasonal_part

首先是利用seasonal_init进行Token Embedding编码，而不是Informer中的x_dec，seasonal_init大小和x_dec一致，都为(batch_size, 48+24=72, 13)。

接着是解码过程：

seasonal_part, trend_part = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask, trend=trend_init)

self.decoder的详细过程：

class Decoder(nn.Module):
    """
    Autoformer encoder
    """
    def __init__(self, layers, norm_layer=None, projection=None):
        super(Decoder, self).__init__()
        self.layers = nn.ModuleList(layers)
        self.norm = norm_layer
        self.projection = projection

    def forward(self, x, cross, x_mask=None, cross_mask=None, trend=None):
        for layer in self.layers:
            x, residual_trend = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask)
            trend = trend + residual_trend

        if self.norm is not None:
            x = self.norm(x)

        if self.projection is not None:
            x = self.projection(x)
        return x, trend

直接看forward，Decoder中有很多DecoderLayer，每个DecoderLayer要求的输入为x, cross和trend，x即为dec_out，大小为(batch_size, 48+24=72, 13)，cross即编码器输出enc_out，大小为(batch_size, seq_len=96, d_model)。trend即为trend_init，大小为(batch_size, 48+24=72, 13)，解码器利用dec_out和enc_out做AutoCorrelation得到x和残差：

class DecoderLayer(nn.Module):
    """
    Autoformer decoder layer with the progressive decomposition architecture
    """
    def __init__(self, self_attention, cross_attention, d_model, c_out, d_ff=None,
                 moving_avg=25, dropout=0.1, activation="relu"):
        super(DecoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.self_attention = self_attention
        self.cross_attention = cross_attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1, bias=False)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1, bias=False)
        self.decomp1 = series_decomp(moving_avg)
        self.decomp2 = series_decomp(moving_avg)
        self.decomp3 = series_decomp(moving_avg)
        self.dropout = nn.Dropout(dropout)
        self.projection = nn.Conv1d(in_channels=d_model, out_channels=c_out, kernel_size=3, stride=1, padding=1,
                                    padding_mode='circular', bias=False)
        self.activation = F.relu if activation == "relu" else F.gelu

    def forward(self, x, cross, x_mask=None, cross_mask=None):
        x = x + self.dropout(self.self_attention(
            x, x, x,
            attn_mask=x_mask
        )[0])
        x, trend1 = self.decomp1(x)
        x = x + self.dropout(self.cross_attention(
            x, cross, cross,
            attn_mask=cross_mask
        )[0])
        x, trend2 = self.decomp2(x)
        y = x
        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
        y = self.dropout(self.conv2(y).transpose(-1, 1))
        x, trend3 = self.decomp3(x + y)

        residual_trend = trend1 + trend2 + trend3
        residual_trend = self.projection(residual_trend.permute(0, 2, 1)).transpose(1, 2)
        return x, residual_trend

然后将残累加到初始的trend上，多层结束后最终返回x和trend。

finally，作者将x和trend相加得到解码器的最终输出：

dec_out = trend_part + seasonal_part

dec_out大小依旧为(batch_size, 48+24=72, 13)，后pred_len=24个即为预测值：

if self.output_attention:
    return dec_out[:, -self.pred_len:, :], attns
else:
    return dec_out[:, -self.pred_len:, :]  # [B, L, D]

一个细节，大小为(batch_size, 48+24=72, 13)的x_dec貌似没有用到，可能只是作者在写代码时为了匹配Informer的写法而保留的。

IV. 实验

首先是数据处理，原始Autoformer中的数据处理和我之前写的30多篇文章的数据处理过程不太匹配，因此这里重写了数据处理过程，代码如下：

def get_data(args):
    print('data processing...')
    data = load_data()
    # split
    train = data[:int(len(data) * 0.6)]
    val = data[int(len(data) * 0.6):int(len(data) * 0.8)]
    test = data[int(len(data) * 0.8):len(data)]
    scaler = StandardScaler()

    def process(dataset, flag, step_size, shuffle):
        # 对时间列进行编码
        df_stamp = dataset[['date']]
        df_stamp.date = pd.to_datetime(df_stamp.date)
        data_stamp = time_features(df_stamp, timeenc=1, freq=args.freq)
        data_stamp = torch.FloatTensor(data_stamp)
        # 接着归一化
        # 首先去掉时间列
        dataset.drop(['date'], axis=1, inplace=True)
        if flag == 'train':
            dataset = scaler.fit_transform(dataset.values)
        else:
            dataset = scaler.transform(dataset.values)

        dataset = torch.FloatTensor(dataset)
        # 构造样本
        samples = []
        for index in range(0, len(dataset) - args.seq_len - args.pred_len + 1, step_size):
            # train_x, x_mark, train_y, y_mark
            s_begin = index
            s_end = s_begin + args.seq_len
            r_begin = s_end - args.label_len
            r_end = r_begin + args.label_len + args.pred_len
            seq_x = dataset[s_begin:s_end]
            seq_y = dataset[r_begin:r_end]
            seq_x_mark = data_stamp[s_begin:s_end]
            seq_y_mark = data_stamp[r_begin:r_end]
            samples.append((seq_x, seq_y, seq_x_mark, seq_y_mark))

        samples = MyDataset(samples)
        samples = DataLoader(dataset=samples, batch_size=args.batch_size, shuffle=shuffle, num_workers=0, drop_last=False)

        return samples

    Dtr = process(train, flag='train', step_size=1, shuffle=True)
    Val = process(val, flag='val', step_size=1, shuffle=True)
    Dte = process(test, flag='test', step_size=args.pred_len, shuffle=False)

    return Dtr, Val, Dte, scaler