Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models　２　背景（バックグラウンド）全文翻訳

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models

https://www.researchgate.net/publication/366093127_Diffusion_Art_or_Digital_Forgery_Investigating_Data_Replication_in_Diffusion_Models

の２　背景（バックグラウンド）の全文翻訳になります。

２　背景（バックグラウンド）

2. Background
Below, we review the background and related work in image retrieval, generative models, and memorization literature.

2. 背景（バックグラウンド）
まずはじめに、以下に、画像検索（image retrieval）、生成モデル（generative models）、記憶方法（memorization）に関する文献の背景と関連研究について説明します。

[Image retrieval and copy detection. ]

The process of searching a database for images containing reference features from a source image is known as image retrieval. The related task of inexact copy detection requires high semantic similarity between the source and match [17]. Image retrieval works with image descriptors based on all types of neural networks [3, 53]. High-performance descriptors can be fine-tuned specifically for retrieval after unsupervised training [49, 50] using structure-from-motion (SfM) or contrastive objectives [14, 27]. A natural basis for image retrieval methods are self-supervised models that inherently learn strong feature descriptors, matching similar images to similar representations [11,13,15,28,30]. A particularly relevant SSL method for our purposes is DINO [12], which is shown to perform competitively on instance retrieval tasks.

Recent approaches adopt strong vision transformers as architectural backbones for retrieval [6, 19, 26, 34, 58]. Historical progress in this field is tracked by public image similarity challenges [18]. A recent SOTA approach is SSCD [47], which builds on previous work in self-supervised representation learning and optimizes a descriptor for copy detection using entropic regularization and an array of task-specific data augmentations.

ソース画像から参照画像（リファレンス）の特徴を含む画像をデータベースから検索する処理は、画像検索として知られています。関連するタスクについての不正確なコピーの検出を認定するには、元画像とマッチした画像の間に高い意味的な類似性が要求されます[17]。画像検索は、あらゆる種類のニューラルネットワークに基づく画像記述子（image descriptors）を用いて行われます[3, 53]。高性能の記述子は、（※１）運動からの構造復元（structure-from-motion）（SfM）や対照的なオブジェクト（contrastive objectives）を用いたの教師なし学習[49, 50] の後、検索専用にファインチューニング（fine-tuned）することができます[14, 27]。画像検索手法の自然な基礎となるのは、強い特徴記述子（strong feature descriptors）を本質的に学習する自己教師ありモデルです。それらは、類似画像を類似表現にマッチングさせます[11,13,15,28,30]。我々の目的に特に適したSSL手法はDINO [12]であり、実例を検索するタスク（instance retrieval tasks）において競争力のある性能を示します。

　最近のアプローチでは、検索のためのアーキテクチャのバックボーンとして強力なビジョン変換器（vision transformers）を採用しています[6, 19, 26, 34, 58]。

この分野の歴史的な進歩は、公共のイメージ画像の類似度に関する課題調査[18]によって追跡されています。最近のSOTAアプローチ（SOTA approach）はSSCD [47]であり、自己教師あり表現学習における過去の研究を基に、エントロピック正則化とタスク固有のデータ拡張の配列を用いて、コピー検出のための記述子を最適化します。

（※１）

運動からの構造復元（structure-from-motion）（SfM）

structure from motion の概要 - MATLAB & Simulink - MathWorks

https://jp.mathworks.com/help/vision/ug/structure-from-motion.html

[Memorization in deep learning. ]

While it is widely known and discussed that large models can memorize their data, there is no universally accepted definition of memorization. To ML theorists, memorization is synonymous with overfitting [2, 21, 23]. In the field of membership in#ference attacks, one seeks to determine whether a chosen image was part of the training set [8, 32, 63, 64]. Indeed, it has been shown that models retain a memory of the contents of their training set, particularly when training samples are repeated [64]. Note that membership inference can be done by reconstructing original training data from the model [63], although this is not the goal of most membership inference methods. The problem of explicitly reconstructing images from the training set of a classifier is known as model inversion, and recent research has been able to do this with both convolutional and transformer models [25, 67]. However, it is crucial to note the relationship of memorization, membership inference, inversion and replication: A generative model that memorizes data might allow for model inversion or only membership inference, yet the same model might never spontaneously generate the training data by accident.

[深層学習における記憶方法］

　大規模なモデルがデータを記憶することは広く知られており、そのことについての議論はなされていますが、普遍的に受け入れられている記憶の定義はまだ存在しません。機械学習理論家（ML theorists）にとって、記憶はオーバーフィッティングと同義であるとみなされています[2, 21, 23]。メンバーシップ推論攻撃の分野では、選ばれた画像がトレーニングセットの集合の一部であったかどうかを判断しようとします [8, 32, 63, 64]。実際、特に学習サンプルが繰り返される場合、モデルは学習セットの内容を記憶していることが示されている[64]。なお、ほとんどのメンバーシップ推論手法の目的ではないものの、モデルから元の学習データを再構成することによって、メンバーシップ推論を行うこともできます[63]。分類器（classifier）の学習セットから明示的に画像を再構成する問題はモデルインバージョン（model inversion）として知られており、最近の研究では，畳み込みモデルとトランスフォーマーモデル（transformer models）の両方でこれを行われる可能性があります[25, 67]。
　しかし、次に述べる、記憶（memorization）、メンバーシップ推論（membership inference）、反転（inversion）、複製（replication）の関係に注意することは非常に重要です。「データを記憶する生成モデルは、モデルの反転やメンバーシップの推論だけを可能にするかもしれず、同じモデルが偶然に学習データを自発的に生成することはないだろう」という関係性についてです。

[Memorization in language. ]

It is well known that generative language models risk replication from their training set [9, 10] and the amount of replicated data is broadly proportional to the size of the model, amount of duplication of the data point in the training set, and the amount of prompting. Interestingly, such replication behavior occurs even for models that are not overfitting to their training data [33,60].

[言語分野における記憶］

　言語生成モデルの分野においては、学習元データから文章が複製される危険性があることはよく知られており[9, 10]、複製されるデータの量はモデルの大きさ、学習データ中のデータ点の重複量、プロンプトの量に大きく比例します。興味深いことに、このような複製動作は、学習データに対してオーバーフィットしていないモデルでも起こっています[33,60]。

[Diffusion models. ]

Diffusion is a process for converting samples from a Gaussian noise distribution into samples from an arbitrary and more complex distribution, such as the distribution of natural images.

We consider several variants of diffusion models. Stable Diffusion is a state-of-the-art text-conditional latent diffusion model [54], trained on the LAION database [57]. The version we analyze in this work (v1.4) was initially trained on over 2B images and then fine-tuned with 600M images from the LAION Aesthetics v2 5+ subset, which is filtered for image quality. We search for matches only in the much smaller 12M LAION Aesthetics v2 6+ split to keep storage costs manageable.

[拡散モデル］

　この分野における拡散とは、ガウスノイズ分布からのサンプルを、自然画像の分布のような、任意の、より複雑な分布からのサンプルに変換するためのプロセスです。

　ここで私達は拡散モデルのいくつかの変種を検討しました。Stable Diffusionは最新のテキスト条件付き潜在拡散モデル[54]であり、LAIONデータベース[57]でトレーニングされたものです。この研究で分析するバージョン（v1.4）は、最初2Billion(=20億)枚以上の画像で学習され、その後画質のためにフィルタリングされたLAION Aesthetics v2 5+サブセットからの600Million(=6億)枚の画像で微調整されたものです。今回私達は容量のコスト（storage costs）を管理しやすくするために、それよりずっと小さい小さい12MのLAION Aesthetics v2 6+分割でのみマッチングを検索しました。

[Related work.]

Replication behavior in GANs has been studied in a number of works. Meehan et al [39] describe a hypothesis test that discerns whether generated images are on average closer to the training data than a random sample from a hold-out set. Note that this test is at the population level, and is not designed to flag individual instances of replication. Feng et al.[24] study the conditions that lead GANs to replicate training data. They look for copies in pixel-space and find that such replications are inversely proportional to dataset complexity and dataset size. Webster et al [63] show on face datasets that GANs canoccasionally replicate. Interestingly, these models can produce novel images of known identities from the training data without making verbatim copies. FID scores for ranking GANs favor models that memorize training data [4], leading toward a search for measures of generalization without memorization [29]. This includes “authenticity scores” that detect replication [1], but only in the form of noisy pixel-by-pixel copies of the training data. Similarly, authors of large-scale diffusion models have investigated image replication themselves [40], reducing replication through training data de-duplication, and checking for simple nearest-neighbor matches.

[関連論文］

　GANにおける複製動作は多くの著作で研究されています。Meehanら[39]は、生成された画像が平均してホールドアウト集合からのランダムなサンプルよりも学習データに近いかどうかを識別する仮説検定について述べています。このテストは母集団レベルであり、複製の個々のインスタンスにフラグを立てるように設計されていないことに注意してください。Fengら[24]は、GANが学習データを複製する条件について研究しています。彼らはピクセル空間における複製を探し、そのような複製はデータセットの複雑さとデータセットに反比例していることを発見しました。Websterら[63]は顔のデータセットにおいて、ときおり、GANが複製を行うことを示しました。興味深いことに、これらのモデルは逐語的な複製を行うことなく、学習データに含まれる特徴を含む新しい画像を生成することができます。GANのランキングのためのFIDスコアは、学習データを記憶するモデルに有利であり[4]、記憶の工程を踏まない一般化の尺度の探索につながります[29]。これには複製を検出する「真正性スコア」[1]が含まれますが、訓練データ内でもノイズの多い、ピクセル単位のコピーの形のものしか検出できません。同様に、大規模拡散モデルの制作者も、画像の複製を調査を行い[40]、彼らは学習データの重複を排除し、単純な最近傍（Nearest neighbor）の一致をチェックすることにより、複製を削減しています。

References
（引用文献）

[2] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al.

A closer look at memorization in deep networks.

In International conference on machine learning, pages 233–242. PMLR, 2017.

[3] Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. Neural codes for image retrieval.

In European conference on computer vision, pages 584–599. Springer, 2014.

[6] Maxim Berman, Herve J ´ egou, Andrea Vedaldi, Iasonas ´ Kokkinos, and Matthijs Douze.

Multigrain: a unified im#age embedding for classes and instances. arXiv preprint arXiv:1902.05509, 2019.

[8] Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer.

Membership Inference Attacks From First Principles.

In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914, May 2022.

[9] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quan#tifying Memorization Across Neural Language Models.

arxiv:2202.07646[cs], Feb. 2022.

[10] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel.

Extracting Training Data from Large Language Models.

In 30th USENIX Security Symposium (USENIX Se#curity 21), pages 2633–2650, 2

[11] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features.

In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.

[12] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.

Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.

[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations.

In International conference on machine learning, pages 1597–1607. PMLR, 2020.

[14] Wei Chen, Yu Liu, Weiping Wang, Erwin M Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, and Michael S Lew.

Deep learning for instance retrieval: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[15] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers.

In-Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.

[17] Matthijs Douze, Herv ´e J´egou, Harsimrat Sandhawalia, Laurent Amsaleg, and Cordelia Schmid. Evaluation of gist descriptors for web-scale image search.

InProceedings ofthe ACM International Conference on Image and Video Retrieval, pages 1–8, 2009.

[18] Matthijs Douze, Giorgos Tolias, Ed Pizzi, Zo¨e Papakipos, Lowik Chanussot, Filip Radenovic, Tomas Jenicek, Maxim Maximov, Laura Leal-Taix´e, Ismail Elezi, et al.

The 2021 image similarity dataset and challenge. arXiv preprint arXiv:2106.09672, 2021.

[19] Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, and Herve J ´ egou.

Training vision transformers for image retrieval.

arXiv preprint arXiv:2102.05644, 2021.

[21] Vitaly Feldman. Does learning require memorization? a short tale about a long tail.

In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing,

pages 954–959, 2020.

[23] Vitaly Feldman and Chiyuan Zhang.

What neural networks memorize and why: Discovering the long tail via influence estimation.

Advances in Neural Information Processing Sys#tems, 33:2881–2891, 2020.

[24] Qianli Feng, Chenqi Guo, Fabian Benitez-Quiroz, and Aleix M Martinez.

When do gans replicate? on the choice of dataset size.

In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6701–6710, 2021.

[26] Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik.

Multi-scale orderless pooling of deep convolu#tional activation features.

In European conference on com#puter vision, pages 392–407. Springer, 2014

[27] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus.

Deep image retrieval: Learning global representations for image search.

In European conference on computer vi#sion, pages 241–257. Springer, 2016.

[28] Jean-Bastien Grill, Florian Strub, Florent Altche, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,

Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al.

Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.

[30] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick.

Masked autoencoders are scalable vision learners.

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.

[32] Hailong Hu and Jun Pang.

Membership Inference Attacks against GANs by Leveraging Over-representation Regions.

In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, pages 2387–2389, New York, NY, USA, Nov. 2021.

Association for Computing Machinery.

[33] Matthew Jagielski, Om Thakkar, Florian Tramer, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, and Chiyuan Zhang.

Measuring Forgetting of Memorized Train#ing Examples. arxiv:2207.00099[cs], June 2022.

[34] Young Kyun Jang and Nam Ik Cho. Self-supervised prod#uct quantization for deep unsupervised image retrieval.

InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12085–12094, 2021.

[39] Casey Meehan, Kamalika Chaudhuri, and Sanjoy Dasgupta.

A non-parametric test to detect data-copying in generative

models. In International Conference on Artificial Intelli#gence and Statistics, 202

[40] Alex Nichol, Aditya Ramesh, Pamela Mishkin, Prafulla Dariwal, Joanne Jang, and Mark Chen.

DALL·E 2 Pre#Training Mitigations, June 2022

[47] Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection.

In Proceedings of the IEEE/CVF Con#ference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022.

[49] Filip Radenovic, Giorgos Tolias, and Ond ˇrej Chum. Cnn image retrieval learns from bow: Unsupervised fine-tuning

with hard examples. In European conference on computer vision, pages 3–20. Springer, 2016.

[50] Filip Radenovic, Giorgos Tolias, and Ond ˇrej Chum. Fine#tuning cnn image retrieval with no human annotation.

IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.

[53] Ali S Razavian, Josephine Sullivan, Stefan Carlsson, and At#suto Maki.

Visual instance retrieval with deep convolutional networks.

ITE Transactions on Media Technology and Applications, 4(3):251–258, 2016.

[54] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer.

High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.

[57] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts#man, et al.

Laion-5b: An open large-scale dataset for training next generation image-text models.

arXiv preprint arXiv:2210.08402, 2022.

[58] Chull Hwan Song, Jooyoung Yoon, Shunghyun Choi, and Yannis Avrithis. Boosting vision transformers for image re#trieval.

arXiv preprint arXiv:2210.11909, 2022.

[60] Kushal Tirumala, Aram H Markosyan, Luke Zettlemoyer, and Armen Aghajanyan.

Memorization without overfitting: Analyzing the training dynamics of large language models.

arXiv preprint arXiv:2205.10770, 2022.

[63] Ryan Webster, Julien Rabin, Loic Simon, and Frederic Jurie.

This Person (Probably) Exists. Identity Membership Attacks Against GAN Generated Faces. arxiv:2107.06018[cs], July 2021.

[64] Yuxin Wen, Arpit Bansal, Hamid Kazemi, Eitan Borgnia,Micah Goldblum, Jonas Geiping, and Tom Goldstein.

Canary in a Coalmine: Better Membership Inference with Ensembled Adversarial Queries. arxiv:2210.10750[cs], Oct.2022.

説明用ブログ

解説用ブログです

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models　２　背景（バックグラウンド）全文翻訳

２　背景（バックグラウンド）

２ 背景（バックグラウンド）

２　背景（バックグラウンド）