Notices by lain (lain@fediffusion.art), page 2

Embed this notice
lain (lain@fediffusion.art)'s status on Thursday, 20-Jun-2024 21:22:11 JST lain

> Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions.

https://arxiv.org/abs/2406.11717
In conversation about a year ago from fediffusion.art permalink
Attachments
1. Domain not in remote thumbnail source whitelist: arxiv.org
  
  Refusal in Language Models Is Mediated by a Single Direction
  
  Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
Embed this notice
lain (lain@fediffusion.art)'s status on Friday, 17-May-2024 22:23:18 JST lain

I have noticed that, somewhat contrary to what I would have expected, religious / spiritual people have little problems with LLMs (in fact, they have a lot of 'discussions with chatgpt about faith' podcasts out), while the people who are deadly afraid of it and think it will be the downfall of society are overwhelmingly materialist progressive types. I have some ideas about this but for now it's just an observation.

In conversation about a year ago from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Monday, 11-Mar-2024 22:16:55 JST lain
in reply to
- kaia
@kaia I think he really mostly wants AI to be open. But we'll see what 'open sourcing' actually means.

In conversation Monday, 11-Mar-2024 22:16:55 JST from gnusocial.jp permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Monday, 11-Mar-2024 21:30:51 JST lain

👀
In conversation Monday, 11-Mar-2024 21:30:51 JST from fediffusion.art permalink
Attachments
1. Untitled attachment
  https://fediffusion.art/media/c6655bc1ea1dad2ab3dc789c7638617c30b0e7e37b8b22f0729e2d217550deb7.png
Embed this notice
lain (lain@fediffusion.art)'s status on Friday, 09-Feb-2024 23:38:20 JST lain
in reply to
- kaia
- Moon
@Moon @kaia it's still crazy what we can do right now with local models, and you haven't even finetuned it, just prompt engineering.

In conversation Friday, 09-Feb-2024 23:38:20 JST from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Tuesday, 06-Feb-2024 19:40:25 JST lain
- Big Richard
And the results are in!

The winner of last week's image generation contest is... @Big_Richard !!

Thanks everyone who participated, hope you'll be joining us again for the next one!

You can see the poll here: https://fediffusion.art/notice/AbKn4V4OymI7E4kykS
In conversation Tuesday, 06-Feb-2024 19:40:25 JST from fediffusion.art permalink
Attachments
1. Untitled attachment
  https://fediffusion.art/media/822601ea02d720ce5e22568ede71f5c7a44c4c0a7c4c78da68d6539cb0ecf41f.png
2. Domain not in remote thumbnail source whitelist: fediffusion.art
  
  lain (@lain@fediffusion.art)
  
  Hello everyone! Here are the submissions for our first contest! The theme this week was the classic of image generation, the 1girl! Very varied, also check out the original post for comments by th...
Embed this notice
lain (lain@fediffusion.art)'s status on Wednesday, 31-Jan-2024 18:04:36 JST lain

the new llava release is 🔥
In conversation Wednesday, 31-Jan-2024 18:04:36 JST from fediffusion.art permalink
Attachments
Embed this notice
lain (lain@fediffusion.art)'s status on Tuesday, 30-Jan-2024 04:08:34 JST lain

what's the songlist?
In conversation Tuesday, 30-Jan-2024 04:08:34 JST from fediffusion.art permalink
Attachments
1. Untitled attachment
  https://fediffusion.art/media/ea02c8fd6605ac56dcc9388f0301bc1a24a577c3a27c3610f381272183389e00.png
Embed this notice
lain (lain@fediffusion.art)'s status on Saturday, 27-Jan-2024 23:54:34 JST lain
- 受不了包
@shibao you and me both buddy

In conversation Saturday, 27-Jan-2024 23:54:34 JST from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Sunday, 07-Jan-2024 03:51:06 JST lain

> https://huggingface.co/microsoft/phi-2/commit/7e10f3ea09c0ebd373aebc73bc6e6ca58204628d

What are they all doing? Is this all safeguarding against regulation? Make it all free and open before it can be banned?
In conversation Sunday, 07-Jan-2024 03:51:06 JST from fediffusion.art permalink
Attachments
1. Domain not in remote thumbnail source whitelist: cdn-thumbnails.huggingface.co
  
  Upload 3 files · microsoft/phi-2 at 7e10f3e
  
  We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Embed this notice
lain (lain@fediffusion.art)'s status on Tuesday, 02-Jan-2024 21:03:41 JST lain
Whoops! new year happened and I missed the end of the vote!

We have three winners this time with a three-way tie of 6 votes each, @guizzy, @kaiaskutes and @Elliptica, congratulations! Thank you all for participating, let's get some good generations going in 2024!

RT: https://fediffusion.art/objects/a5b51368-1518-4795-bbdc-ae254baae4ea
In conversation Tuesday, 02-Jan-2024 21:03:41 JST from fediffusion.art permalink
Attachments
Embed this notice
lain (lain@fediffusion.art)'s status on Sunday, 24-Dec-2023 03:31:02 JST lain
in reply to
- lhl
@lhl ??? How does this work?

In conversation Sunday, 24-Dec-2023 03:31:02 JST from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Wednesday, 20-Dec-2023 07:36:17 JST lain

> I expect the only people who are nonplussed by the power of LLMs are those with a soft spot for occultism of some sort—those who think words are magical. Let me explain.
> Let me repeat: there is so much abstract structure in our language—the patterns are so overwhelmingly clear, consistent, and objective—that by mindlessly figuring out the probability of one symbol following another, a machine can effectively reason better than the average person for a large number of cases.

https://steve-patterson.com/why-language-machines-do-not-have-souls/
In conversation Wednesday, 20-Dec-2023 07:36:17 JST from fediffusion.art permalink
Attachments
1. Domain not in remote thumbnail source whitelist: steve-patterson.com
  
  Why Language Machines do not have Souls
  
  from Steve Patterson
  
  It’s been nine months since GPT4 was released. I’m still trying to make sense of things. There’s a dearth of level-headed analysis out there. Most people’s analysis seems to be framed by science fiction novels, or they are still using frameworks inherited from the pre-GPT world, which did not anticipate the success of LLMs. Even …
Embed this notice
lain (lain@fediffusion.art)'s status on Wednesday, 20-Dec-2023 07:36:15 JST lain
in reply to
- Ruru! 🦉
@lonelyowl i mean, yeah, after seeing a machine do it, it's "no surprise", but it sure was a surprise before it happened

In conversation Wednesday, 20-Dec-2023 07:36:15 JST from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Monday, 18-Dec-2023 21:52:47 JST lain

Reminder that we still have a CONTEST going on! Because I'm traveling, the deadline is extended to FRIDAY, the 22nd.

Get your entries in before it's too late!

RT: https://fediffusion.art/objects/a597ce7e-5ebf-40ec-8242-d15241ebdd2a
In conversation Monday, 18-Dec-2023 21:52:47 JST from fediffusion.art permalink
Attachments
1. Untitled attachment
Embed this notice
lain (lain@fediffusion.art)'s status on Sunday, 10-Dec-2023 02:28:38 JST lain
in reply to
- Sexy Moon
- lain
please contribute with your resurrected 4090 power, @Moon

In conversation Sunday, 10-Dec-2023 02:28:38 JST from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Saturday, 09-Dec-2023 20:12:29 JST lain

NEW CONTEST!

Once again we're doing a week-long AI image creation contest! This time the topic is:

STORY ILLUSTRATIONS

Ever read a story and imagined what the scene would look like? Well, now you can show it to all of us! Pick a scene from any story or novel you like and create an image of it. Please tell us which story you are taking inspiration from!

The voting will start one week from now, so get your entries in before that.

Here's an example: A scene from the Yasutaka Tsutsui story "Standing Woman".
In conversation Saturday, 09-Dec-2023 20:12:29 JST from fediffusion.art permalink
Attachments
1. Untitled attachment
  https://fediffusion.art/media/911f94c10ae19df62d92711687ffb893383b01e2ca674e9a20b545839b817881.png
Embed this notice
lain (lain@fediffusion.art)'s status on Wednesday, 06-Dec-2023 01:21:15 JST lain

if a 30gb thinking file revealed mitsu's pregnancy i'll literally chuckle

In conversation Wednesday, 06-Dec-2023 01:21:15 JST from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Wednesday, 06-Dec-2023 01:21:14 JST lain
in reply to
- Guizzy
@guizzy better than going to a doctor in canada, i heard...

In conversation Wednesday, 06-Dec-2023 01:21:14 JST from fediffusion.art permalink
Embed this notice
lain (lain@fediffusion.art)'s status on Saturday, 02-Dec-2023 02:05:15 JST lain
in reply to
- ロミンちゃん
@romin they are releasing so many AI tools all the time, it's crazy.

In conversation Saturday, 02-Dec-2023 02:05:15 JST from fediffusion.art permalink

After
Before

Public

Notices by lain (lain@fediffusion.art), page 2

User actions

Following 0

Followers 0

Groups 0

Statistics

Feeds