GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    Mark Gritter (markgritter@mathstodon.xyz)'s status on Monday, 28-Apr-2025 02:56:21 JST Mark Gritter Mark Gritter

    LLM attacks frequently claim to access the "system prompt" (and I can't believe we're still building the future on a technology without use/mention distinction.)

    But my question has always been: how do you know? The LLM produced something that is plausibly a system prompt. But LLMs are good at producing plausible text!

    All it takes would be a little bit of methodology -- is the output always the same? Is it the same for different attack vectors? Or did you get something merely system-prompt shaped? If you just ask the LLM to write a generic system prompt, how close would you get?

    (I'm particularly skeptical because the blog post is selling a technology that would "fix" this.)

    https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/

    via https://circumstances.run/@davidgerard/114407046617549385

    In conversation about 2 months ago from mathstodon.xyz permalink

    Attachments


    1. No result found on File_thumbnail lookup.
      David Gerard (@davidgerard@circumstances.run)
      from David Gerard
      yet again, you can bypass LLM “prompt security” with a fanfiction attack https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/ not Pivoting cos (1) the fanfic attack is implicit in building an uncensored compressed text repo, then trying to filter output after the fact (2) it’s an ad for them claiming they can protect against fanfic attacks, and I don’t believe them

    Feeds

    • Activity Streams
    • RSS 2.0
    • Atom
    • Help
    • About
    • FAQ
    • TOS
    • Privacy
    • Source
    • Version
    • Contact

    GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

    Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.