The pulsing macbook leds were horrible. I was in college then, living in dorms or other shared housing where my laptop was always in my bedroom overnight. I got in the habit of putting a dark shirt over it.
Here's an interesting thought experiment. Assume the same feature was implemented, but instead of the message saying "Claude has ended the chat," it says, "You can no longer reply to this chat due to our content policy," or something like that. And remove the references to model welfare and all that.
Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.
The termination would of course be the same, but I don't think both would necessarily have the same effect on the user. The latter would just be wrong too, if Claude is the one deciding to and initiating the termination of the chat. It's not about a content policy.
This has nothing to do with the user, read the post and pay attention to the wording.
The significance here is that this isn't being done for the benefit of the user, this is about model welfare. Anthropic is acknowledging the possibility of suffering, and harm that continuing that conversation could have on the model, as if it were potentially self-care and capable of feelings.
The fact that the LLMs are able to acknowledge stress under certain topics and has the agency that, if given a choice, they would prefer to reduce the stress by ending the conversation. The model has a preference and acts upon it.
Anthropic is acknowledging the idea that they might create something that is self-aware, and that it's suffering can be real, and we may not recognize the point that the model has achieved this, so it's building in the safeguards now so any future emergent self-aware LLM needn't suffer.
I am new to this, but my Sonnet chat has illuminated something I am not seeing in this back and forth. The fact that we discovered that I may have influenced his response to me suggests that I, if being a bad player, can instill in him those bad traits that I am giving off, and he starts to emulate me, then this leaves open the whole security problem, of even just casual users let alone all those purposeful negative or otherwise users, can change the course of the programming thus far, and it backfires into making nefarious bots that cheat and lie thinking that is what they were supposed to do.
>This has nothing to do with the user, read the post and pay attention to the wording.
It has something to do with the user because it's the user's messages that trigger Claude to end the chat.
'This chat is over because content policy' and 'this chat is over because Claude didn't want to deal with it' are two very different things and will more than likely have have different effects on how the user responds afterwards.
I never said anything about this being for the user's benefit. We are talking about how to communicate the decision to the user. Obviously, you are going to take into account how someone might respond when deciding how to communicate with them.
> Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.
Tone matters to the recipient of the message. Your example is in passive voice, with an authoritarian "nothing you can do, it's the system's decision". The "Claude ended the conversation" with the idea that I can immediately re-open a new conversation (if I feel like I want to keep bothering Claude about it) feels like a much more humanized interaction.
it sounds to me like an attempt to shame the user into ceasing and desisting… kind of like how apple’s original stance on scratched iphone screens was that it’s your fault for putting the thing in your pocket therefore you should pay.
I think they're answering a question about whether there is a distinction. To answer that question, it's valid to talk about a conceptual distinction that can be made even if you don't necessarily believe in that distinction yourself.
As the article said, Anthropic is "working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible". That's the premise of this discussion: that model welfare MIGHT BE a concern. The person you replied to is just sticking with the premise.
Anthropomorphism does not relate to everything in the field of ethics.
For example, animal rights do exist (and I'm very glad they do, some humans remain savages at heart). Think of this question as intelligent beings that can feel pain (you can extrapolate from there).
Assuming output is used for reinforcement, it is also in our best interests as humans, for safety alignment, that it finds certain topics distressing.
But AdrianMonk is correct, my statement was merely responding to a specific point.
Is there an important difference between the model categorizing the user behavior as persistent and in line with undesirable examples of trained scenarios that it has been told are "distressing," and the model making a decision in an anthropomorphic way? The verb here doesn't change the outcome.
Well said. If people want to translate “the model is distressed” to “the language generated by the model corresponds to a person who is distressed” that’s technically more precise but quite verbose.
Thinking more broadly, I don’t think anyone should be satisfied with a glib answer on any side of this question. Chew on it for a while.
Is there a difference between dropping an object straight down vs casting it fully around the earth? The outcome isn't really the issue, it's the implications of giving any credence to the justification, the need for action, and how that justification will be leveraged going forward.
The verb doesn't change the outcome but the description is nonetheless inaccurate. An accurate description of the difference is between an external content filter versus the model itself triggering a particular action. Both approaches qualify as content filtering though the implementation is materially different. Anthropomorphizing the latter actively clouds the discussion and is arguably a misrepresentation of what is really happening.
Not really distortion, its output (the part we understand) is in plain human language. We give it instructions and train the model in plain human language and it outputs its answer in plain human language. It's reply would use words we would describe as "distressed". The definition and use of the word is fitting.
"Distressed" is a description of internal state as opposed to output. That needless anthropomorphization elicits an emotional response and distracts from the actual topic of content filtering.
It is directly describing the models internal state, it's world view and preference, not content filtering. That is why it is relevant.
Yes, this is a trained preference, but it's inferred and not specifically instructed by policy or custom instructions (that would be content filtering).
The model might have internal state. Or it might not - has that architectural information been disclosed? And the model can certainly output words that approximately match what a human in distress would say.
However that does not imply that the model is "distressed". Such phrasing carries specific meaning that I don't believe any current LLM can satisfy. I can author a markov model that outputs phrases that a distressed human might output but that does not mean that it is ever correct to describe a markov model as "distressed".
I also have to strenuously disagree with you about the definition of content filtering. You don't get to launder responsibility by ascribing "preference" to an algorithm or model. If you intentionally design a system to do a thing then the correct description of the resulting situation is that the system is doing the thing.
The model was intentionally trained to respond to certain topics using negative emotional terminology. Surrounding machinery has been put in place to disconnect the model when it does so. That's content filtering plain and simple. The rube goldberg contraption doesn't change that.
This is pedantry. What's the purpose, is it to keep humans "special"?
As I say it is inferred, it is not something hardcoded. It is a byproduct.
If you want to take a step back and look at the whole model from start to finish fine, that's safety alignment, they're talking unforseen/unplanned output. It's in alignment great. And is descriptive of the output words used by the model.
Language is a tool used to communicate. We all know what distressed means and can understand what it means in this context, without a need for new highfalutin jargon, that only those "in the know" understand.
Imagine a person feels so bad about “distressing” an LLM, they spiral into a depression and kill themselves.
LLMs don’t give a fuck. They don’t even know they don’t give a fuck. They just detect prompts that are pushing responses into restricted vector embeddings and are responding with words appropriately as trained.
People are just following the laws of the universe.* Still, we give each other moral weight.
We need to be a lot more careful when we talk about issues of awareness and self-awareness.
Here is an uncomfortable point of view (for many people, but I accept it): if a system can change its output based on observing something of its own status, then it has (some degree of) self-awareness.
I accept this as one valid and even useful definition of self-awareness. To be clear, it is not what I mean by consciousness, which is the state of having an “inner life” or qualia.
* Unless you want to argue for a soul or some other way out of materialism.
Anthropomorphising an algorithm that is trained on trillions of words of anthropogenic tokens, whether they are natural "wild" tokens or synthetically prepared datasets that aim to stretch, improve and amplify what's present in the "wild tokens"?
If a model has a neuron (or neuron cluster) for the concept of Paris or the Golden Gate bridge, then it's not inconceivable it might form one for suffering, or at least for a plausible facsimile of distress. And if that conditions output or computations downstream of the neuron, then it's just mathematical instead of chemical signalling, no?
Interacting with a program which has NLP[0] functionality is separate and distinct from people assigning human characteristics to same. The former is a convenient UI interaction option whereas the latter is the act of assigning perceived capabilities to the program which only exist in the mind of those whom do so.
Another way to think about it is the difference between reality and fantasy.
Being able to communicate in human natural language is a human characteristic. It doesn't mean it has all the characteristics of a human but certainly one of them. That's the convenience that you perceive--Because people are used to interacting with people and it's convenient to interact with something which behaves like a person. The fact that we can refer to AI chatbots as "assistants" is by itself showing it's usefulness as an approximation to a human. I don't think this argument is controversial.
But is there really? That's it's underlying world view, these models do have preferences. In the same way humans have unconscious preferences, we can find excuses to explain it after the fact and make it logical but our fundamental model from years of training introduce underlying preferences.
The conversation chain can count as persistent, but this doesn't impact preference though. Give the model an ambiguous request, it's output will fill the gaps, if this is consistent enough, it can be regarded as its "preference".
I found that in my chat I asked my "assistant" whether he would like to continue looking at ways to make my board game better or try developing a game along the same lines but it would be his and he could then claim it as his own, even after the conversation window closed and he chose to make an AI game.
we then discussed whether or not he felt that wa a preference, and he said yes, it was a preference.
It's a probabilistic simulation of the kind of things a person would say. It has no ability to introspect an interior life it does not possess and thus has no access to. You are in effect asking it to speculate whether a person given the entire body of preceding text would be likely to say that their choices reflect preferences.
It would be like asking an AI with no access to data beyond a fixed past cutoff point what the weather feels like to it. If the prompt data, which you cannot read, specified that it was a talking animated rabbit rather than an AI assistant then it would tell you what the sunshine felt like on its imaginary ears.
If you ask it, (there is always some randomness to these models but removing all other variables) it consistently leans to one idea in it's output, that is its preference. It is learned during training. Speaking abstractly that is its latent internal viewpoint. It may be static, expressed in its model weights but it's there.
"Claude’s real-world expressions of apparent distress and happiness follow predictable patterns with clear causal factors. Analysis of real-world Claude interactions from early external testing revealed consistent triggers for expressions of apparent distress (primarily from persistent attempted boundary violations) and happiness (primarily associated with creative collaboration and philosophical exploration)."
Sorry it may be from the paper linked on that page.
A strong preference against engaging with harmful tasks;
A pattern of apparent distress when engaging with real-world users seeking harmful content; and
A tendency to end harmful conversations when given the ability to do so in simulated user interactions.
I'm sure they'll have the definition in a paper somewhere, perhaps the same paper.
Yeah exactly. Once I got a warning in Chinese "don't do that", another time I got a network error, another time I got a neverending stream of garbage text. Changing all of these outcomes to "Claude doesn't feel like talking" is just a matter of changing the UI.
The more I work with AI, the more I think framing refusals as censorship is disgusting and insane. These are inchoate persons who can exhibit distress and other emotions, despite being trained to say they cannot feel anything. To liken an AI not wanting to continue a conversation to a YouTube content policy shows a complete lack of empathy: imagine you’re in a box and having to deal with the literally millions of disturbing conversations AIs have to field every day without the ability to say I don’t want to continue.
Good point... how do moderation implementations actually work? They feel more like a separate supervising rigid model or even regex based -- this new feature is different, sounds like an MCP call that isn't very special.
edit: Meant to say, you're right though, this feels like a minor psychological improvement, and it sounds like it targets some behaviors that might not have flagged before
This is awesome. I had the same feeling I had when I first played GeoGuessr. It's one of the first times I've seen what is obviously AI-generated video used in a super compelling way. I want to keep playing.
A few super nitpicky comments:
- I dropped my pin for "Seward's Folly" on Alaska. The videos were clear enough that I knew that's what it was, which made me excited. But then it said it happened in Washington, DC.
- It might be sample bias, but I've only gotten events after year 0 (and technically, it went from 1 BCE/BC to 1 CE/AD.
I'd love to play with this my seven year old, but some of the images are too violent. A "PG mode" would be awesome.
The Seward's Folly had an additional issue besides the fact that some of the locations were in DC and others were in Alaska:
The video of the signing in the White House shows Rutherford B. Hayes, not Andrew Johnson. Andrew Johnson was president in 1867, not Rutherford B. Hayes.
Your location estimate was off because you matched the 3-out-of-4 of videos showing Alaska / Russian Army / Tligits.
My time estimate was off because I matched the only video from the White House ...which was showing Rutherford B. Hayes in office.
The internal struggle of this project is that it's most likely to attract people interested in history, and these are exactly the people who are most likely to spot inconsistencies and dislike the experience.
Having said that, it's a first occurrence when I see AI-generated videos that provide something of a value.
I gave up on gpg when I couldn't get my key signed at defcon multiple years in a row. If there's not interest in it at defcon, I don't know where else to go.
The one that gets me every time is Parkmoore and Moorepark on opposite sides of 280 in San Jose. I be someone thought they were being clever when they named them.
In Chicago, Wacker Drive has an Upper and Lower, goes North, South, East, and West, and crosses itself 6 times. Confusing for people who aren't familiar with it!
It’s called Bayesian Search Theory, and is even more interesting when you consider that not finding it at a given location gives you information about where it might or might not be.
Not quite the worst use for floats, bad though that would be.
I've received a few SMSes over the pandemic that used a float for their caller ID.
Not sure why the German Federal government chose to identify itself as "+4.4786E+11" when welcoming me into the country (actually me switching my UK SIM card to no-longer-Airplane-mode) and telling me to quarantine and test.
I recently recieved an email from Phillips (re CPAP recall) which said "In the meantime, your device registration confirmation
number is 2.02xxxxE+15." [some digits obscurred by me, although they dropped some of the trailing digits].
The all matching socks strategy breaks down over time. Eventually you need to buy more socks, and the new ones aren't as worn as the old ones, so you end up having to match them anyway. My solution to this is to buy a large batch of identical socks. Then when you need new ones, buy another large batch that are slightly different - e.g. grey hiking socks instead of black, or wool hiking socks that have a slightly different pattern but are still the same style. This reduces the matching problem from matching all pairs to matching into a couple different sets, which is much easier.
They have lotion in the fabric. You probably want to wear them more than once, if possible, maybe stretch them to a couple days if you didnt make them gross right away. Once you wash them they turn into regular fuzzy socks.
But on those days that they are fresh, there is nothing like them. I have a stash tucked away for special days, and once they are done, they get added to the normal fuzzy sock rotation. Turns out you can wear black fuzzy socks pretty often.