GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Embed Notice

HTML Code

Corresponding Notice

  1. Embed this notice
    Tero Keski-Valkama (tero@rukii.net)'s status on Saturday, 22-Jun-2024 06:32:52 JSTTero Keski-ValkamaTero Keski-Valkama

    I read this article and oh my god, are people doing PCA for reducing the dimensions of #LLM embeddings? I don't have any more polite way of saying it; that is pure stupidity.

    No, these embeddings do not have principal dimensions! They span practically all the dimensions. Your dataset will just create an illusion that some dimensions are correlated when in reality they aren't.

    Using PCA just shows people don't understand what these embeddings are.

    Furthermore, people are using way too long embeddings. Using embeddings of over 1k dimensions will make all distances approximately equal, and rounding errors will start to dominate.

    They compare their method with learning to hash methods and all kinds of misinformed methods which probably also use too long embedding vectors.

    Separately they tested 8-bit quantization of their thousand-dimensional embedding vectors and found it performs better. I could have told them this beforehand; it's roughly equivalent to dimensionality reduction with a random projection matrix. And this works, better than PCA, because LLM embeddings are holographic. Reducing the dimensionality with a random projection is analogous to decreasing the resolution which is analogous to quantization.

    But it works better if you have some supervised training set to rank the queries to results.

    And in any case you don't want to vector search match queries to documents like everyone still keeps doing, but you want to generate oranges to oranges indices where you generate example queries for documents and match query embeddings to example query embeddings. Oranges to oranges.

    https://arxiv.org/abs/2205.11498?ref=cohere-ai.ghost.io

    In conversationabout a year ago from rukii.netpermalink

    Attachments

    1. No result found on File_thumbnail lookup.
      http://stupidity.No/
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.