I read this article and oh my god, are people doing PCA for reducing the dimensions of #LLM embeddings? I don't have any more polite way of saying it; that is pure stupidity.
No, these embeddings do not have principal dimensions! They span practically all the dimensions. Your dataset will just create an illusion that some dimensions are correlated when in reality they aren't.
Using PCA just shows people don't understand what these embeddings are.
Furthermore, people are using way too long embeddings. Using embeddings of over 1k dimensions will make all distances approximately equal, and rounding errors will start to dominate.
They compare their method with learning to hash methods and all kinds of misinformed methods which probably also use too long embedding vectors.
Separately they tested 8-bit quantization of their thousand-dimensional embedding vectors and found it performs better. I could have told them this beforehand; it's roughly equivalent to dimensionality reduction with a random projection matrix. And this works, better than PCA, because LLM embeddings are holographic. Reducing the dimensionality with a random projection is analogous to decreasing the resolution which is analogous to quantization.
But it works better if you have some supervised training set to rank the queries to results.
And in any case you don't want to vector search match queries to documents like everyone still keeps doing, but you want to generate oranges to oranges indices where you generate example queries for documents and match query embeddings to example query embeddings. Oranges to oranges.