이미지 생성 모델 간단 정리하기 | GAN, Autoencoder, Diffusion

sean11

|2024. 12. 9. 09:00

1. GAN

1.1. 일반 구조

Generative Adversarial Network(GAN)은 판별자(Discriminator)와 생성자(Generator)를 동시에 학습하는 생성형 모델입니다. '판별자'는 입력 이미지와 생성 이미지가 진짜인지 판별하며, 생성 이미지 여부를 판단하도록 학습합니다. '생성자'는 잠재 변수를 입력받아 학습 데이터의 분포에 가까운 이미지를 생성하게 됩니다. 생성 여부를 판단하지 못하도록 훈련을 진행하게 됩니다. 이런 식으로 판별자와 생성자가 서로 반대의 목표를 달성하기 위해 프로세스를 진행하고 있는 것을 적대적이라고 표현합니다.

1.2. 주요 모델

Conditional GAN

Conditonal GAN은 GAN의 학습 과정에 조건을 주입해서 학습을 진행하고, 그 조건에 따라 이미지를 생성하게 됩니다.

Papers with Code - Conditional Generative Adversarial Nets

#2 best model for Human action generation on NTU RGB+D 120 (FID (CS) metric)

paperswithcode.com

Pix2Pix

Pix2Pix는 Conditonal GAN과 같은 구조이지만, 이미지를 조건으로 이미지를 변환하는 방법입니다. 이미지를 조건으로 하기 때문에 학습을 위한 이미지 매칭이 추가적으로 필요합니다.

Papers with Code - Pix2Pix Explained

Pix2Pix is a conditional image-to-image translation architecture that uses a conditional GAN objective combined with a reconstruction loss. The conditional GAN objective for observed images $x$, output images $y$ and the random noise vector $z$ is: $$ \mat

paperswithcode.com

CycleGAN

CycleGAN은 Pix2Pix가 이미지 쌍이 존재해야만 했던 제약(왜냐면 이미지 쌍이 있다면 데이터셋이 많이 필요하게 됨..)을 극복하기 위해 나타난 구조입니다. 이미지 쌍이 없이 Cycle Consistency Loss를 추가적으로 활용합니다.

Papers with Code - CycleGAN Explained

CycleGAN, or Cycle-Consistent GAN, is a type of generative adversarial network for unpaired image-to-image translation. For two domains $X$ and $Y$, CycleGAN learns a mapping $G : X \rightarrow Y$ and $F: Y \rightarrow X$. The novelty lies in trying to enf

paperswithcode.com

StarGAN

기존 모델들은 새로운 생성 유형(도메인)을 만들려면 매번 모델을 학습해야 했지만, 단일 생성 모델만으로 여러 도메인을 반영할 수 있는 구조를 만들기 위해 StarGAN이 고안됐습니다. StarGAN은 손실함수는 GAN의 훈련을 위한 손실함수, Cycle Consistency 손실함수, 그리고 도메인을 판단하기 위한 손실함수가 추가되어 3가지 손실함수를 사용하게 됩니다.

Papers with Code - StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

🏆 SOTA for Image-to-Image Translation on RaFD (Classification Error metric)

paperswithcode.com

Progressive GAN

고해상도 이미지를 생성하기 위해서는 여러 스텝을 통해야 하기 때문에 훈련에 시간이 많이 소요가 됩니다. Progressive GAN은 저해상도 이미지 생성 구조부터 단계적으로 증강해 작은 이미지와 큰 해상도 이미지의 결과를 가중합하는 방식으로 고해상도 이미지를 생성합니다. 이를 통해 적은 훈련 비용으로도 빠른 수렴이 가능한 모델 구조를 제안했습니다.

Papers with Code - Progressive Growing of GANs for Improved Quality, Stability, and Variation

#4 best model for Image Generation on LSUN Horse 256 x 256 (Clean-FID (trainfull) metric)

paperswithcode.com

StyleGAN

StyleGAN은 Progressive GAN에서 스타일을 주입하는 구조입니다. GAN에서 잠재 공간은 가우시안 분포를 가정하고 있지만, 실제 학습 데이터의 분포는 선형적으로 분포하기 때문에 가우시안 분포를 활용하는 것은 적절하지 않습니다. 따라서 실제 데이터의 특성을 반영할 수 있도록 mapping 함수를 도입해 변환된 값을 입력합니다.

Papers with Code - StyleGAN Explained

StyleGAN is a type of generative adversarial network. It uses an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature; in particular, the use of adaptive instance normalization. Otherwise it follo

paperswithcode.com

2. 오토인코더

Auto Encoder (AE)

AE는 인코더와 디코더로 구성된 구조로 입력 이미지를 저차원의 잠재 공간으로 매핑하여(인코더), 잠재 변수를 입력으로 사용해 원본 이미지를 복원(디코더)하면서 학습을 진행합니다. 여기서 유의해야 할 점은 잠재 변수는 입력 이미지보다 차원이 낮아야 한다는 점입니다.

Papers with Code - AutoEncoder Explained

An Autoencoder is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder). Image: Michael Massi

paperswithcode.com

Variational AE (VAE)

VAE는 인코더와 디코더로 구성되어 있고, 잠재 공간의 '분포'를 가정해서 학습합니다. 여기서 '분포를 가정'한다는 말의 의미는 평균과 표준편차를 만들고 가우시안 분포를 따르는 잠재 변수를 샘플링하고, 샘플링된 잠재 변수를 받아서 원본 데이터 공간에 재구성을 하게 됩니다.

Papers with Code - VAE Explained

A Variational Autoencoder is a type of likelihood-based generative model. It consists of an encoder, that takes in data $x$ as input and transforms this into a latent representation $z$, and a decoder, that takes a latent representation $z$ and returns a r

paperswithcode.com

Vector Quantized-VAE (VQ-VAE)

VQ-VAE는 사전 정의한 K개의 임베딩은 이산적인 잠재 공간(Codebook)을 가정해 학습에 사용하고 있습니다. 다만, 이산적인 잠재 표현을 얻기 때문에 텍스트나 음성과 같은 데이터셋이 적합합니다.

Papers with Code - VQ-VAE Explained

VQ-VAE is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather

paperswithcode.com

3. Diffusion Models

Denoising Diffusion Probablistic Model (DDPM)

DDPM은 입력이미지를 forward process를 통해 잠재 공간으로 변환하고, reverse process를 통해 복원하면서 이미지를 생성하는 방식입니다. forward process에서는 가우시안 노이즈를 추가하고, reverse process에서는 추가된 노이즈를 추정하면서 제거합니다.

Papers with Code - Denoising Diffusion Probabilistic Models

#2 best model for Image Generation on LSUN Bedroom (FID-50k metric)

paperswithcode.com

Denoising Diffusion Implicit Model (DDIM)

DDIM은 기존 DDPM에서 처럼 모든 스텝이 확률적인 샘플링 프로세스가 아니라, 일부만 적용하는 방식으로 바꾸게 됩니다. 즉, 생성 과정에서 모든 스텝을 통해서 생성하는 것이 아니라 일부만 reverse process를 적용하게 됩니다.

Papers with Code - Denoising Diffusion Implicit Models

Implemented in 25 code libraries.

paperswithcode.com

Classifier-free Gudiance (CFG)

Classifier Guidance는 역전파 상황에서 노이즈를 추정할 때 학습한 분류기의 기울기(그래디언트)를 사용해 임의의 클래스로 샘플링하는데 사용되는 방법입니다. DDIM에 Classifier Guidance를 적용하기 위해 데이터에 대한 likelihood가 높아지도록 score 함수에 클래스를 조건부로 넣어 reverse process를 진행시키게 됩니다. 이렇게 하면 이미지 생성에 제어력을 높여 조건부 이미지를 생성하는데 활용할 수 있습니다.

하지만, 이 경우 기존 확산 모델 외에 새로운 분류기가 필요하고 매번 분류가 필요하기 때문에 효율성에 문제가 생길 수 있습니다. 따라서 이러한 문제를 해결하기 위해 등장한 것이 분류기가 없는 CFG입니다. CFG는 모델 자체에서 조건부와 무조건부 에측을 결합하는 방식으로 정의합니다. 이로 인해 추가 모델없이 효율성을 높인 방식을 활용할 수 있게 됩니다.

Papers with Code - Classifier-Free Diffusion Guidance

Implemented in 10 code libraries.

paperswithcode.com

Latent Diffusion Model (LDM)

LDM은 Diffusion 모델 학습 시 이미지 대신에 인코더를 통해 추출된 저차원의 잠재 변수를 사용하여 고해상도 이미지 학습에 필요한 비용을 절감하고, 입력된 이미지 생성 조건을 cross attention 연산으로 결합합니다. LDM은 여러 변형 모델이 존재하는데 현재 가장 유명하고 널리 사용되는 것은 Stable Diffusion 모델입니다.

Papers with Code - High-Resolution Image Synthesis with Latent Diffusion Models

#2 best model for Layout-to-Image Generation on LayoutBench (AP metric)

paperswithcode.com

Stable Diffusion

Stable Diffusion은 '22년 8월에 Stability AI에서 발표한 오픈소스 라이선스의 Text-to-Image 생성 모델입니다. 구조는 크게 Autoencoder, Image Information Creator, Text Encoder로 구성되며 각 구성요소별 역할은 다음과 같습니다.

Autoencoder : 이미지가 아닌 인코더를 통해 추출된 저차원의 잠재 변수 사용
Image Information Creator : Noise scheduler에서는 잠재 표현과 노이즈 정도를 입력받아 노이즈를 만들고, U-Net 구조에서 생성된 노이즈가 추가된 Noisy latent와 시간 임베딩을 받아 노이즈를 예측. Text Encoder의 결과물(임베딩)과 결합해서 noisy latent(Q)와 토큰 임베딩(K, V)을 cross attention 연산 수행
Text Encoder : CLIP 텍스트 인코더를 사용해 입력된 텍스트를 임베딩으로 변환

Stable Diffusion의 학습은 Noisy latent, 토큰 임베딩, 시간 임베딩을 받아 노이즈를 예측하고, 실제 노이즈와 예측 노이즈의 차이를 학습합니다. 추론할 때는 가우시안 노이즈를 따르는 노이즈 상태에서 시작해서 토큰 임베딩을 입력받아 U-Net에서 노이즈를 예측하고 예측 노이즈를 제거하는 과정을 반복합니다. 그렇게 만들어진 최종 latent를 오토인코더의 디코더를 통해 이미지로 변환합니다.

Stable Diffusion은 뛰어난 성능으로 Stable Diffusion2, Stable Diffusion XL (SDXL), SDXL Turbo와 같이 후속 모델이 나오고 있습니다. 또한, 비디오 분야에도 활용되는 모델이 나오는 등 다양한 연구가 이뤄지고 있습니다.

4. 참고자료

[1] https://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html

[2] 강필성. "Image Generation 1: Image Generation Models Overview". boostcamp ai tech

[3] 강필성. "Image Generation 2: Stable Diffusion & Evaluation". boostcamp ai tech

'Note > Deep Learning' 카테고리의 다른 글

딥러닝 모델 경량화를 위한 Knowledge Distillation 기본 개념 정리 \| 지식 증류, KD (0)	2024.12.25
딥러닝 모델 경량화를 위한 Pruning 기본 개념 정리 (0)	2024.12.24
이미지 생성 모델의 평가 지표 정리 \| Inception Score, FID, CLIP Score (0)	2024.12.06
[NLP] LLM 평가지표와 주요 데이터셋 간단 정리 \| MMLU, HellaSwag, HumanEval, G-Eval (1)	2024.12.05
[NLP] 오픈소스 LLM과 학습방법 \| LLaMA, Alpaca (0)	2024.12.05