【ICLR23论文】Can CNNs Be More Robust Than Transformers?

news/2024/7/19 11:16:36 标签: cnn, transformer, iclr

文章目录

  • 0 Abstract
  • 1 Introduction
  • 2 Related Works
  • 3 Settings
    • 3.1 CNN Block Instantiations
    • 3.2 Computational Cost
    • 3.3 Robustness Benchmarks
    • 3.4 Training Recipe
    • 3.5 Baseline Results
  • 4 Component Diagnosis
    • 4.1 Patchief Stem
    • 4.2 Large Kernel Size
    • 4.3 Reducing Activation And Normalization Layers
  • 5 Components Combination
  • 6 Knowledge Distillation
  • 7 Larger Models
  • 8 Conclusion
  • Acknowledgement
  • Reference

Article Reading Record

在这里插入图片描述

0 Abstract

  • Transformers are inherently (本质上) more robust than CNNs
  • we question that belief by closely examining the design of Transformers
  • simple enough to be implemented in several lines of code, namely a) patchifying(修补) input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers.

1 Introduction

  • ViT offers a completely different roadmap—by applying the pure self-attention-based architecture to sequences of image patches, ViTs are able to attain competitive performance on a wide range of visual benchmarks compared to CNNs.

在这里插入图片描述
dubbed (被称为) vanilla(普通)

2 Related Works

  • Vision Transformers.
  • CNNs striking back ( 反击 )
  • ConvNeXt, shifting the study focus from standard accuracy to robustness
  • Out-of-distribution (分布) robustness.

ResNet Bottleneck block

counterpart (对口/对应方/同行)

在这里插入图片描述

corruption (腐败/堕落) rendition (再现) inherently (本质)

  • we show CNNs can in turn outperform Transformers in out-of-distribution robustness.

3 Settings

thoroughly (彻底)

3.1 CNN Block Instantiations

(实例化)

在这里插入图片描述

3.2 Computational Cost

mitigate (减轻) the computational cost loss

roughly(大约)

3.3 Robustness Benchmarks

extensively (广泛) evaluate

contains synthesized (合成) images with shape-texture(纹理) conflicting cues

image corruption(损坏)

which contains natural renditions (再现) of ImageNet object classes with different textures and local image statistics(统计)

3.4 Training Recipe

deliberately (故意) apply the standard 300-epoch DeiT training recipe

3.5 Baseline Results

we use “IN”, “S-IN”, “IN-C”, “IN-R”, and “IN-SK” as abbreviations(缩写) for “ImageNet”, “Stylized-ImageNet”, “ImageNet-C”, “ImageNet-R”, and “ImageNet-Sketch”.

4 Component Diagnosis

( 组件 )( 诊断 )

These designs are as follows: 1) patchifying
input images (Sec. 4.1), b) enlarging the kernel size (Sec. 4.2), and finally, 3) reducing the number
of activation layers and normalization layers (Sec. 4.3)

4.1 Patchief Stem

ViT adopts a much more aggressive down-sampling strategy by partitioning (分区) the input image into p×p non-overlapping (非重叠) patches and projects each patch with a linear layer

have investigated (调查) the importance of

when employing (使用) the 8×8 patchify stem

albeit (尽管) potentially (潜在) at the cost of clean accuracy

is boosted (提高) by at least 0.6%

play a vital(重要) role in closing the robustness gap between CNNs and Transformers.

在这里插入图片描述

4.2 Large Kernel Size

One critical(关键) property (性质) that distinguishes the self-attention operation from the classic convolution operation is its ability to operate on the entire input image or feature map, resulting in a global receptive(接收) field.

The importance of capturing long-range (远程)dependencies (依赖)has been demonstrated (证明) for CNNs even

In this section, we aim to mimic (模仿) the behavior of the self-attention block

the performance gain gradually saturates(饱和)

an unfair(不公平的) comparison to some extent.(程度)

在这里插入图片描述

在这里插入图片描述

4.3 Reducing Activation And Normalization Layers

(规范化层)

在这里插入图片描述

在这里插入图片描述

The optimal (最优) position

在这里插入图片描述
在这里插入图片描述

5 Components Combination

explore the impact of combining all the proposed components on the model’s performance.

along with the corresponding (相应) optimal (最优) position for placing the normalization and activation layer

An exception (异常) here is ResNet-Inverted-DW

we empirically (经验) found that using a too-large kernel size

6 Knowledge Distillation

when the model roles are switched (互换), the student model DeiT-S remarkably outperforms the teacher model ResNet-50 on a range of robustness benchmarks

7 Larger Models

To demonstrate (演示) the effectiveness of our proposed models on larger scales
在这里插入图片描述

8 Conclusion

By incorporating (合并) these designs into ResNet,
we have developed a CNN architecture that can match or even surpass (超越) the robustness of a Vision Transformer model of comparable size.

We hope our findings prompt researchers to reevaluate(重新评估)the robustness comparison between Transformers and CNNs, and inspire further investigations (调查) into
developing more resilient (弹性) architecture designs

Acknowledgement

This work is supported by a gift from Open Philanthropy (慈善), TPU Research Cloud (TRC) program, and Google Cloud Research Credits program.

Reference

https://github.com/UCSC-VLAA/RobustCNN

https://arxiv.org/pdf/2206.03452.pdf


欢迎在评论区提问和讨论原Paper


http://www.niftyadmin.cn/n/5121870.html

相关文章

OSCAR数据库上锁问题如何排查

关键字 oscar lock 问题描述 oscar 数据库上锁问题如何排查 解决问题思路 准备数据 create table lock_test(name varchar(10),age varchar(10));insert into lock_test values(ff,10); insert into lock_test values(yy,20); insert into lock_test values(ll,30);sessio…

多跳推理真的可解释吗?10.24

多跳推理真的可解释吗 摘要1 引言2 相关工作2.1 多跳推理2.2 基于规则的推理2.3 可解释性评估 3 基础知识4 基准测试4.1 数据集构建4.2 评估框架4.3 近似可解释性评分4.4 Benchmark with Manual Annotation4.5 使用挖掘规则的基准 实验 摘要 近年来,多跳推理在获取…

吃豆人C语言开发—Day2 需求分析 流程图 原型图

目录 需求分析 流程图 原型图 主菜单: 设置界面: 地图选择: 游戏界面: 收集完成提示: 游戏胜利界面: 游戏失败界面 死亡提示: 这个项目是我和朋友们一起开发的,在此声明一下…

变分模态分解(VMD)原理-附代码

1 VMD算法原理 VMD的思想认为待分解信号是由不同IMF的子信号组成的。VMD为避免信号分解过程中出现模态混叠,在计算IMF时舍弃了传统信号分解算法所使用的递归求解的思想,VMD采用的是完全非递归的模态分解。与传统信号分解算法相比,VMD拥有非递…

如何在两个月内学会Python编程?——最佳学习计划指南

Python编程已经成为互联网时代最重要的技能之一,不仅对编程新手,对于从事数据科学、网站开发和自动化任务的专业人士也是必备的技能。你是否想要学习Python编程,但不知道如何安排时间和方法?你是否担心学习过程太长、太枯燥、太难…

力扣每日一题67:二进制求和

题目描述: 给你两个二进制字符串 a 和 b ,以二进制字符串的形式返回它们的和。 示例 1: 输入:a "11", b "1" 输出:"100" 示例 2: 输入:a "1010", b "…

【算法】模拟退火算法(SAA,Simulated Annealing Algorithm)

模拟退火算法(SAA)简介 模拟退火算法(SAA,Simulated Annealing Algorithm)的灵感来源于工艺铸造流程中的退火处理,随着铸造温度升高,分子运动趋于无序,徐徐冷却后,分子运…

Geoda-双变量空间自相关

Geoda-双变量空间自相关https://mp.weixin.qq.com/s/cOkgBCf5ljlVJkWoIwkzxw 之前空间自相关—莫兰指数中简单写了一下如何在ArcGIS中计算莫兰指数,本次简要演示在Geoda中计算双变量空间自相关的步骤。案例数据是武汉市资源环境承载力指数(RECC&#xf…