生成对抗网络

2021-01-02

生成对抗网络（GAN）

1. 入门简介

gan整体的损失函数

$\min_{G}\max_{D} V(G,D) = E_{x-P_{data}}\log D(x) + E_{z-P_z}\log (1-D(G(z)))$

训练时，先训练Discriminator、然后训练Generator，迭代直至目标函数收敛。

需要注意的是，一切损失计算都是在D（判别器）输出处产生的，而D的输出一般是fake/true的判断，所以整体上采用的是二分类交叉熵函数。

首先看一下maxD部分，因为训练一般是先保持G（生成器）不变训练D的。D的训练目标是正确区分fake/true，如果我们以1/0代表true/fake，则对第一项E因为输入采样自真实数据所以我们期望D(x)趋近于1，也就是第一项更大。同理第二项E输入采样自G生成数据，所以我们期望D(G(z))趋近于0更好，也就是说第二项又是更大。所以是这一部分是期望训练使得整体更大了，也就是maxD的含义了。

　　第二部分保持D不变，训练G，这个时候只有第二项E有用了，关键来了，因为我们要迷惑D，所以这时将label设置为1(我们知道是fake，所以才叫迷惑)，希望D(G(z))输出接近于1，也就是这一项越小越好，这就是minG。当然判别器D哪有这么好糊弄，所以这个时候判别器就会产生比较大的误差，误差会更新G，那么G就会变得更好了，这次没有骗过你，只能下次更努力了。

Discriminator的损失函数

$\max_D \log [D(x)] + \log [1 - D(G(z))]$

Generator的损失函数

$\min_G \log[1-D(G(z))]$

在（近似）最优判别器下，最小化生成器的loss等价于最小化$P_r$与$P_g$之间的JS散度。

下图中可以发现，所有的loss都是由判别器产生的。如果没有D，G不知道自己生成的结果如何，便得不到权重更新。

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torchvision import datasets,transforms
from torchvision.utils import save_image

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#数据集的加载
batch_size = 100
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.repeat(3,1,1)),
    transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))])
train_dataset = datasets.MNIST(root='./MNIST_data/', train=True, transform=transform, download=False)
test_dataset = datasets.MNIST(root='./MNIST_data/', train=False, transform=transform, download=False)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

class Generator(nn.Module):
    def __init__(self,input_dims, output_dims):
        super(Generator,self).__init__()
        self.fc1 = nn.Linear(input_dims,256)
        self.fc2 = nn.Linear(self.fc1.out_features, self.fc1.out_features*2)
        self.fc3 = nn.Linear(self.fc2.out_features, self.fc2.out_features*2)
        self.fc4 = nn.Linear(self.fc3.out_features, output_dims)
        
    def forward(self, x):
        x = F.leaky_relu(self.fc1(x),0.2)
        x = F.leaky_relu(self.fc2(x),0.3)
        x = F.leaky_relu(self.fc3(x),0.4)
        return torch.tanh(self.fc4(x))
    
class Discriminator(nn.Module):
    def __init__(self,input_dim):
        super(Discriminator,self).__init__()
        self.fc1 = nn.Linear(input_dim, 1024)
        self.fc2 = nn.Linear(self.fc1.out_features, self.fc1.out_features//2)
        self.fc3 = nn.Linear(self.fc2.out_features, self.fc2.out_features//2)
        self.fc4 = nn.Linear(self.fc3.out_features, 1)
    def forward(self, x):
        x = F.leaky_relu(self.fc1(x), 0.2)
        x = F.dropout(x, 0.3)
        x = F.leaky_relu(self.fc2(x), 0.2)
        x = F.dropout(x, 0.3)
        x = F.leaky_relu(self.fc3(x), 0.2)
        x = F.dropout(x, 0.3)
        return torch.sigmoid(self.fc4(x))
    
# 二分类交叉熵损失函数
criterion = nn.BCELoss()

lr = 0.0002
D_optimizer = optim.Adam(D.parameters(), lr = lr)
G_optimizer = optim.Adam(G.parameters(), lr = lr)

def G_train(x):
    G.zero_grad()
    z = Variable(torch.randn(batch_size,z_dim).to(device))
    # label全为1
    y = Variable(torch.ones(batch_size,1).to(device))
    
    G_output = G(z)
    D_output = D(G_output)
    
    G_loss = criterion(D_output, y)
    G_loss.backward()
    
    G_optimizer.step()
    
    return G_loss.data.item()

def D_train(x):
    D.zero_grad()
    # x原来的shape [batch_size,3,28,28]
    # 3个通道都是一样的，取一个通道就行
    x = x[:,0,:,:]
    x_real, y_real = x.view(-1, mnist_dim), torch.ones(batch_size, 1)
    x_real, y_real = Variable(x_real.to(device)), Variable(y_real.to(device))
    
    D_output = D(x_real)
    D_real_loss = criterion(D_output, y_real)
    #D_real_score = D_output
    
    z = Variable(torch.randn(batch_size, z_dim).to(device))
    x_fake, y_fake = G(z), Variable(torch.zeros(batch_size, 1).to(device))

    D_output = D(x_fake)
    D_fake_loss = criterion(D_output, y_fake)
    #D_fake_score = D_output

    # gradient backprop & optimize ONLY D's parameters
    D_loss = D_real_loss + D_fake_loss
    D_loss.backward()
    D_optimizer.step()
   
    return  D_loss.data.item()

# 训练模型
n_epoch = 200
for epoch in range(n_epoch):
    D_losses, G_losses = [], []
    for index,(x,_) in enumerate(train_loader):
        D_losses.append(D_train(x))
        G_losses.append(G_train(x))
    print('[%d/%d]: loss_d: %.3f, loss_g: %.3f' % (
            (epoch), n_epoch, torch.mean(torch.FloatTensor(D_losses)), torch.mean(torch.FloatTensor(G_losses))))

# 使用训练好的GAN生成图片
with torch.no_grad():
    test_z = Variable(torch.randn(batch_size, z_dim).to(device))
    generated = G(test_z)
    save_image(generated.view(generated.size(0), 1, 28, 28), './samples/sample_' + '.png')

2. 各式各样的GAN

2.1DCGAN

深度卷积生成对抗网络，在生成器中，对输入的一维向量不断进行转置卷积（上采样）最终生成对应的图像。在判别器中，则将输入的图像经过多层卷积最后经过sigmod函数进行二分类，判断这是原始数据图片还是生成器产生的图片。

def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1:
        m.weight.data.normal_(0.0, 0.02)
    elif classname.find('BatchNorm') != -1:
        m.weight.data.normal_(1.0, 0.02)
        m.bias.data.fill_(0)

class Generator(nn.Module):
    """
    input (N, in_dim)
    output (N, 3, 64, 64)
    """
    def __init__(self, in_dim, dim=64):
        super(Generator, self).__init__()
        def dconv_bn_relu(in_dim, out_dim):
            return nn.Sequential(
                nn.ConvTranspose2d(in_dim, out_dim, 5, 2,
                                   padding=2, output_padding=1, bias=False),
                nn.BatchNorm2d(out_dim),
                nn.ReLU())
        self.l1 = nn.Sequential(
            nn.Linear(in_dim, dim * 8 * 4 * 4, bias=False),
            nn.BatchNorm1d(dim * 8 * 4 * 4),
            nn.ReLU())
        self.l2_5 = nn.Sequential(
            dconv_bn_relu(dim * 8, dim * 4),
            dconv_bn_relu(dim * 4, dim * 2),
            dconv_bn_relu(dim * 2, dim),
            nn.ConvTranspose2d(dim, 3, 5, 2, padding=2, output_padding=1),
            nn.Tanh())
        self.apply(weights_init)
    def forward(self, x):
        y = self.l1(x)
        y = y.view(y.size(0), -1, 4, 4)
        y = self.l2_5(y)
        return y

class Discriminator(nn.Module):
    """
    input (N, 3, 64, 64)
    output (N, )
    """
    def __init__(self, in_dim, dim=64):
        super(Discriminator, self).__init__()
        def conv_bn_lrelu(in_dim, out_dim):
            return nn.Sequential(
                nn.Conv2d(in_dim, out_dim, 5, 2, 2),
                nn.BatchNorm2d(out_dim),
                nn.LeakyReLU(0.2))
        self.ls = nn.Sequential(
            nn.Conv2d(in_dim, dim, 5, 2, 2), nn.LeakyReLU(0.2),
            conv_bn_lrelu(dim, dim * 2),
            conv_bn_lrelu(dim * 2, dim * 4),
            conv_bn_lrelu(dim * 4, dim * 8),
            nn.Conv2d(dim * 8, 1, 4),
            nn.Sigmoid())
        self.apply(weights_init)        
    def forward(self, x):
        y = self.ls(x)
        y = y.view(-1)
        return y

2.2 Conditional GAN

CGAN的目标函数与原始的并无太大不同，只不过加了一个限定条件。

$\min_G \max_D V(D,G) = E_{x-p_{data}}[\log(D(x|y))] + E_{z-p_z}[\log[1 - D(G(z|y))]]$

# G(z)
class Generator(nn.Module):
    # initializers
    def __init__(self):
        super(generator, self).__init__()
        self.fc1_1 = nn.Linear(100, 256)
        self.fc1_1_bn = nn.BatchNorm1d(256)
        # 处理label one-hot向量的
        self.fc1_2 = nn.Linear(10, 256)
        self.fc1_2_bn = nn.BatchNorm1d(256)
        
        self.fc2 = nn.Linear(512, 512)
        self.fc2_bn = nn.BatchNorm1d(512)
        self.fc3 = nn.Linear(512, 1024)
        self.fc3_bn = nn.BatchNorm1d(1024)
        self.fc4 = nn.Linear(1024, 784)

    # weight_init
    def weight_init(self, mean, std):
        for m in self._modules:
            normal_init(self._modules[m], mean, std)

    # forward method
    def forward(self, input, label):
        x = F.relu(self.fc1_1_bn(self.fc1_1(input)))
        y = F.relu(self.fc1_2_bn(self.fc1_2(label)))
        # 把两个向量进行合并
        x = torch.cat([x, y], 1)
        x = F.relu(self.fc2_bn(self.fc2(x)))
        x = F.relu(self.fc3_bn(self.fc3(x)))
        x = F.tanh(self.fc4(x))

        return x

class Discriminator(nn.Module):
    # initializers
    def __init__(self):
        super(discriminator, self).__init__()
        self.fc1_1 = nn.Linear(784, 1024)
        # 处理label one-hot向量 batch_size * 10
        self.fc1_2 = nn.Linear(10, 1024)
        self.fc2 = nn.Linear(2048, 512)
        self.fc2_bn = nn.BatchNorm1d(512)
        self.fc3 = nn.Linear(512, 256)
        self.fc3_bn = nn.BatchNorm1d(256)
        self.fc4 = nn.Linear(256, 1)

    # weight_init
    def weight_init(self, mean, std):
        for m in self._modules:
            normal_init(self._modules[m], mean, std)

    # forward method
    def forward(self, input, label):
        x = F.leaky_relu(self.fc1_1(input), 0.2)
        y = F.leaky_relu(self.fc1_2(label), 0.2)
        
        x = torch.cat([x, y], 1)
        x = F.leaky_relu(self.fc2_bn(self.fc2(x)), 0.2)
        x = F.leaky_relu(self.fc3_bn(self.fc3(x)), 0.2)
        x = F.sigmoid(self.fc4(x))

        return x

def normal_init(m, mean, std):
    if isinstance(m, nn.Linear):
        m.weight.data.normal_(mean, std)
        m.bias.data.zero_()

结合介绍的两种，可以定义cDCNGAN模型（就是把Linear全连接层换为了ConvTranspose2d或Conv2d卷积层）。

2.3 Bidirectional GAN

讲述$BiGAN$的两篇论文分别为：

Donahue, Jeff, Philipp Krähenbühl, and Trevor Darrell. “Adversarial feature learning.” arXiv preprint arXiv:1605.09782 (2016).

Dumoulin, Vincent, et al. “Adversarially learned inference.” arXiv preprint arXiv:1606.00704 (2016).

网络架构

目标函数 $\min_{G,E}\max_D V(D,E,G)$

代码参考：https://github.com/fmu2/Wasserstein-BiGAN

2.4 WGAN

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875.（gradient clipping）

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems (pp. 5767-5777).（gradient penalty）

参考：https://zhuanlan.zhihu.com/p/25071913（令人拍案叫绝的WGAN）。

$Wasserstein$距离也被称为$Earth mover’s$距离（推土机距离）。Wasserstein距离相比KL散度、JS散度的优越性在于，即便两个分布没有重叠，Wasserstein距离仍然能够反映它们的远近。

我们可以构造一个含参数$w$、最后一层不是非线性激活层的判别器网络$f_w$，在限制$w$不超过某个范围的条件下，使得

$L = E_{x-P_r}[f_w(x)] - E_{x-P_G}[f_w(x)]$

尽可能取到最大，此时$L$就会近似真实分布与生成分布之间的Wasserstein距离（忽略常数倍数$K$）。

注：判别器要迭代训练多次。而生成器只训练一次。

WGAN在原生的GAN做出的改进：

G和D的损失函数不用对数
不要用基于动量的优化算法（包括momentum和Adam），推荐RMSProp，SGD也行
D最后一层去掉$sigmod$二分类函数
采用gradient clipping和gradient penalty（改进）

原始GAN存在的问题：

判别器越好，生成器越容易产生梯度消失。
训练不稳定，容易导致$collapse mode$。

2.5 StackGAN由文本生成高分辨率图像

Zhang, Han, et al. “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.” Proceedings of the IEEE international conference on computer vision. 2017.

2.6 GANomaly异常检测

网络架构：

可以看出，模型包含两个encoder、一个decoder（相当于生成器）和一个判别器。模型划分为三个部分：第一部分为一个自动编码器，包含一个encoder（$G_E$）、一个decoder（$G_D$），这一部分被记为$G$；第二部分为一个encoder，记为$E$；第三部分为一个判别器网络，记为$D$。前两部分也被称为G-Net。

输入图片数据$x$经过一个encoder（$G_E$）编码为向量$z$，decoder（$G_D$）将向量$z$还原为原尺寸图像数据$\hat x$，另一个encoder（$E$）将$\hat x$又编码为向量$\hat z$。将$x$和$\hat x$输入判别器网络（$D$）判断图片是原始图片还是生成器生成的图片。