본문 바로가기

가슴으로 이해하는/경량화, Quantization

[논문 리뷰] GhostSR: Learning Ghost Features for Efficient Image Super-Resolution

https://arxiv.org/abs/2101.08525

 

GhostSR: Learning Ghost Features for Efficient Image Super-Resolution

Modern single image super-resolution (SISR) system based on convolutional neural networks (CNNs) achieves fancy performance while requires huge computational costs. The problem on feature redundancy is well studied in visual recognition task, but rarely di

arxiv.org

 

 

1. Motivation

1) Heavy single image super-resolution (SISR) models

- FLOPs for processing a single 224×224 image

  ×2 EDSR: 2270.9G  /  ResNet50: 4.1G

2) Previous lightweight SISR still use CONV which has feature redundancy

- Previous lightweight SISR

   - IDN link: knowledge distillation

   - ESRN link: NAS

   - PAN link: pixel attention scheme

- Feature redundancy in deep CNN

   - SISR need to preserve overall texture and color -> similar features

4) GhostNet is still slow!

- GhostNet link: generating ghost features using depth-wise CONV

- the latency with 256 × 256 input image on a single GPU V100 platform

  CONV with 64 output channels: 0.15ms

  32-channel CONV + 32-channel depth-wise CONV: 0.19ms

 

2. Method

Overview

1) Generating ghost features using shift operation (FLOPs-free)

 

- benefits of shift operation

   - high-frequency features, texture info

   - larger receptive field

   - more efficient, faster

 

- learnable shift

   - trainable W

   - Gumbel-Softmax trick

relaxing W with noise N sampled from decaying Gumbel distribution
softmax on proxy soft weight W' (softmax regularizes values to 0~1 and sum to 1)
feed-forward1
back-prop

2) Clustering to find intrinsic features (when given pre-trained model)

- vectorize filters from [c_o, c_i , s, s] to [c_o, c_i × s × s]

- apply clustering (k-means)

- select filters which are closest to each center (if a cluster with only one -> select that as intrinsic) -> 흠..

- when clustering adjacent layers, we use previous indices to screen out the useful channel of filters -> clustering again -> repeat...

- if from scratch, c1, c2 set by order

 

3) Algorithm

 

3. Experiments & Results

- EDSR x2: Params, FLOPS, Latency 1/2 without performance degradation

- CARN_M: performance increased

 

- similar results with regular CONV

- simply reducing width -> performance degradation

- shift -> DW: performance slightly increased but latency largely increased

 

CARN
Ablation

- no learnable shift = simply copy

- no pre-trained = trained from scratch

 

Qualitative ablation (copy, shift based on pre-trained & clustering)

4. etc.