Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Open AccessArticle

Peer-Review Record

On Isotropy of Multimodal Embeddings

Information 2023, 14(7), 392; https://doi.org/10.3390/info14070392

by Kirill Tyshchuk^1,*,†, Polina Karpikova^1,†, Andrew Spiridonov^1,†, Anastasiia Prutianova¹, Anton Razzhigaev^1,2 and Alexander Panchenko^1,2,*

Reviewer 1:

Aleksander Wawer

Reviewer 2:

Muhammad Rafi

Information 2023, 14(7), 392; https://doi.org/10.3390/info14070392

Submission received: 21 April 2023 / Revised: 18 June 2023 / Accepted: 22 June 2023 / Published: 10 July 2023

(This article belongs to the Special Issue Information Visualization Theory and Applications)

Round 1

Reviewer 1 Report

The paper describes a number of experiments related to isotropy of embeddings from OpenAI's CLIP and other methods. In their experiments, authors use mostly Procrustes, least squares and whitening, but other transformations such as PCA/SVD are also used.

Unfortunately, it's a bit difficult to see a clear goal and coherent structure in the paper. In the current form, it is a mixture of experiments somehow linked to isotropy, but it's not clear how.

Also I do not feel that the paper solves any real, important problem or demonstrates a novel method. I tend to agree with the statement in line 54 that "importance of isotropy in transformer embeddings is disputable." Unfortunately, this undermines the relevance of this paper. Moreover, the proposed transformations are controversial as they degrade the results in terms of loss (section 4.3) and zero-shot accuracy (section 4.4). There might be some benefit in cosine-related tasks, though it's not very clearly demonstrated in section 4.2. The benefit from dimensionality reduction and distribution alignment is marginal.

Section 4. is a bit chaotic selection of various experiments. Sections 4.1 and 4.2 fit into an overall (but broad) goal of exploring isotropy of CLIP embeddings, but I do not see much point behind sections 4.5-4.8. Why should we care about PCA transformation results or multilingual distances, for example? What's the motivation?

Least Squares problem (LSTSQ) problem is briefly introduced at line 104, but not enough explanation is provided. Specifically, we do not know 1) how many parameters are used in this model, 2) what type of a model it is (linear?), 3) what problem is solved by this model, what are independent and dependent variables.

"Transformer-XL" embeddings appear in a number of lines in the paper but no citation or explanation is provided, so it's not possible to know what are these. What is the reason of their use?

As for a journal paper candidate, unusually many references are just preprints without peer-review (eg. arxiv). I would expect most of the bibliography to be either conference or journal papers.

Author Response

Dear reviewer,

First of all we would like to thank you for a thorough analysis of our work and suggestions. We would like to express our sincere gratitude to the reviewer for valuable time and thoughtful assessment of our paper. In this revision, we have tried our best to address remarks. Besides, we performed proofreading of the text.

In the following, remarks addressed are accompanied by the corresponding comment for better navigation on the performed changes.

Best regards,
Kirill, Polina, Andrew, Anastasiia, Anton, and Alexander

>>> The paper describes a number of experiments related to isotropy of embeddings from OpenAI's CLIP and other methods. In their experiments, authors use mostly Procrustes, least squares and whitening, but other transformations such as PCA/SVD are also used.

Unfortunately, it's a bit difficult to see a clear goal and coherent structure in the paper. In the current form, it is a mixture of experiments somehow linked to isotropy, but it's not clear how.

Also I do not feel that the paper solves any real, important problem or demonstrates a novel method. I tend to agree with the statement in line 54 Section 4. is a bit chaotic selection of various experiments. Sections 4.1 and 4.2 fit into an overall (but broad) goal of exploring isotropy of CLIP embeddings, but I do not see much point behind sections 4.5-4.8. Why should we care about PCA transformation results or multilingual distances, for example? What's the motivation?

Our paper is aimed at the visualisation and analysis of multimodal embedding space and the ways to improve it.

Sections 4.5-4.8 also contribute to this purpose. We decided to make our study multifaceted as there are a lot of sensible ways to shed light on our topic. We moved the results of section "PCA for classification" into section 4.4 "CIFAR-100 zero-shot accuracy" for clarity. The section on zero-shot classification shows that although our transforms change the embeddings significantly, they do not break the CLIP's capabilities and in some cases improve them. The section on linear probes proves the same point through the lens of representation learning: the reduction of dimensionality preserves the useful information necessary for classification via logistic regression.

We believe that the lack of consensus on the importance of isotropy only makes it more important to investigate this issue more comprehensively. Although our distribution alignment methods improve the zero-shot accuracy only in limited cases, we see it as a proof of concept for these novel ideas that may inspire more effective methods in future work. We also consider the observation of the anisotropy at initialization a valuable insight into the properties of modern architectures.

>>> Least Squares Problem (LSTSQ) is briefly introduced at line 104, but not enough explanation is provided. Specifically, we do not know 1) how many
parameters are used in this model, 2) what type of a model it is (linear?), 3) what problem is solved by this model, what are independent and dependent variables.

We agree that the name "least squares problem" may be too vague. We add the definition, noting that we are indeed talking about the linear transforms. We also provide links to the documentation of these problems from the linear algebra packages that we used.

>>> "Transformer-XL" embeddings appear in a number of lines in the paper but no citation or explanation is provided, so it's not possible to know what are these. What is the reason of their use?

We use this model only in Table 2 and note that its metrics originate from the corresponding table in Wang et al., "Improving neural language generation with spectrum control". Upon review, we decided to exclude it altogether for clarity, as we use BERT as an example of a transformer model in our experiments.

>>> As for a journal paper candidate, unusually many references are just preprints without peer-review (eg.arxiv). I would expect most of the bibliography to be either conference or journal papers.

We have replaced the citation sources with more credible ones instead of Arxiv, where possible.

Reviewer 2 Report

There are several issues that are not adequately addressed in the paper. Here are some of the queries:

1. The embedding space demonstrate a strong anisotropic space such that most of the vectors fall within a narrow cone, the linear / non-linear transformation to these contextual embedding is still a unexplored area. Your work merely seems to revalidate the same idea. It will be good if you extend the work to find the relationship between isolated clusters and low dimensional manifolds.

2. In multilingual embedding space once again a major finding is closeness in the space. What is still a challenge to know how outlier dimensions and degenerated space correlated to specific language space.

3. How the degenerated dimension of anisotropic distribution related to specific language and domain.

4. There should be a proper justification to use CLIP loss and zero-shot accuracy in this paper. The domain adaptation study is also not very convincing. There should be some discussion and insight from the results obtained.

5. There should be a discussion related to limitations of this work.

The language and presentation of the manuscript requires a major revision as well.

Author Response

Dear reviewer,

In the following, remarks addressed are accompanied by the corresponding comment for better navigation on the performed changes.

Best regards,
Kirill, Polina, Andrew, Anastasiia, Anton, and Alexander

>>> There are several issues that are not adequately addressed in the paper. Here are some of the queries:

3. How the degenerated dimension of anisotropic distribution related to specific language and domain.

Our paper is aimed at the visualisation and analysis of the CLIP embedding space and the ways to improve it. Apart from revalidating the embedding anisotropy, we also provide the analysis of the methods for its mitigation and the insight of its presence at the initialization. Your feedback and insights are greatly appreciated. While we regret that time constraints prevent us from conducting these additional experiments at this stage, we acknowledge the value they could bring to future work.

>>> 4. There should be a proper justification to use CLIP loss and zero-shot accuracy in this paper. The domain adaptation study is also not very convincing. There should be some discussion and insight from the results obtained.

5. There should be a discussion related to limitations of this work.

We added some details justifying the chosen metrics. Although our distribution alignment methods improve the zero-shot accuracy only in limited cases, we see it as a proof of concept for these novel ideas that may inspire more effective methods in future work. We added more details about the limitations, insights and need for future investigations to the conclusion:

"Our study empirically establishes that CLIP embeddings, exhibiting a noticeable anisotropy, reside within a conical structure. We demonstrated that such a formation emerges at initialization and remains largely unchanged due to the absence of any regularization concerning the absolute location of embeddings in CLIP's objective function. To fully understand and address this phenomenon, further exploration of the CLIP architecture is needed.

We found that the isotropy of embeddings can be restored through a simple linear transformation, such as whitening. Furthermore, we identified a method for conducting a learnable linear transformation that, in some cases, can improve performance without incurring substantial computational costs. Although the current scope of these methods is somewhat limited, they could be potentially utilized during or after training to shape an embedding space with desired properties.

The anisotropic characteristic of embeddings extends to the multilingual context as well. In addition, we used the metric properties of the multilingual embeddings to confirm their strong correspondence with the original embeddings. This underscores the consistency of the anisotropic property across diverse linguistic scenarios within the CLIP model."

>>> The language and presentation of the manuscript requires a major revision as well.

We have made corrections to the language throughout the manuscript with the help of automated solutions and colleagues fluent in English

Round 2

Reviewer 1 Report

Thank you for addressing my concerns.

Author Response

Dear reviewer,

Our team would like to express our gratitude to your work.

In the current version we tried to further improve writing and add more information on motivation and related work.

Reviewer 2 Report

I am happy to see that many of my suggestions are now part of the paper and the manuscript is quite improved. I am afraid that there are still some places where the insight from the experiments need more explanations like: Linear probe evaluation and Multilingual CLIP evaluation for example.

I strongly recommend to improve the language of the paper and recommend to seek help from some native English speaker.

Author Response

Dear reviewer,

We are happy that our edits in the previous round improved the manuscript. Our team would like to thank you for most insightful recommendations and comments.

In this version we tool into account language issues and your suggestions regarding the missing explanations. Besides, some motivation and related papers were added to provide to the reader more context of the conducted study.

Article Menu

On Isotropy of Multimodal Embeddings

Further Information

Guidelines

MDPI Initiatives

Follow MDPI