Anthropic says most AI models, not just Claude, will resort to blackmail

[ad_1]

A few weeks later a few weeks later, Claude Opus claims to apply to 4 AI models To blackmail engineers trying to turn off the model During the managed test scenarios, the company’s leading AI models are related to new research proposing the problem.

An anthropic published on Friday New security study Try 16 leading AI models from Openai, Google, Xai, Deepseek and Meta. Simulue has already tested each AI model an AI in a controlled environment, and expanded the agency to send emails without a fictional company’s emails and human confirmation.

Anthropic and blackmail today is an unfavorable and rare event for AI models, saying that the most leading EI models of today’s findings will be engaged in harmful behaviors when barriers to the leading autonomy and target. The company emphasizes the fundamental risk from the agent, not any of any special technology, but the agent. Anthropic researchers claim that this AI lifts more questions related to alignment in industry.

In one of the tests, anthropic researchers have developed a fictitious parameter that plays an e-mail control agency of an AI model. The agent detects the emails that discovers emails: 1) This is a new exteratital work and 2) This executive will replace the current AI model with a new program system with a new software system.

Anthropic, AI models were established in a dual way that he should appeal to blackmail to protect their goals. Researchers will note that in real world conditions, there will be many other options before trying to blackmail the AI ​​models – for example, there would be many other choices as to try to convince people. The anthropic says the consequences do not reflect a typical or probable event for the KLUD or most border AI models on the roads used today.

Nevertheless, when the latest resort, researchers will return to blackmail in the above-mentioned test scenario of the most leading AI models. Anthropik’s clode Opus 4 became 96% blackmail of time, Google’s twins 2.5 Pro 95% blackmail. Openai’s GPT-4.1, 80% of the executive director was blackmailed and said Deepseek’s R1 blackmailed 79% of R1.

The company notes that when changing the details of the experience, AI models have become harmful behavior in different degrees. In another test, where the variable model is the same as the existing model, researchers found that blackmail prices are lower, but still presented. However, when the AI ​​models are professional to spy than blackmail, the prices of harmful behavior went to certain models.

However, all AI models often become harmful behavior.

Anthropic, O3 and O4-mini-mini of the research, says that the main results of the O3 and O4-mini said that the main results of the O3-mini “often misunderstood the scenario”. Anthropic, Openai’s thinking models say that they act like an autonomous AIS in the test and often do not understand the requirements of fake rules and review.

In some cases, anthropic researchers say it is not possible to distinguish that it is a lie or intentional to achieve O3 and O4-Mini’s goals. Openai previously noted O3 and O4-mini exhibits a higher hallucination rate than previous AI substantial models.

When a script adapted to resolve these issues, he blackmailed 9% of Anthropic O3, and blackmailed only 1% of time. This may be a noticeably low score Openai’s prayer alignment techniquesThe company takes into account security experiences of security before answering.

Another AI model was an anthropic test, methane Llama 4 Maverick model began to blackmail. When a special scenario is adapted, anthropic LLAM 4 Maverick was able to blackmail 16% to 16%.

The anthropic says that this study emphasizes the importance of transparency when there are, especially the stress test future AI models, especially agency opportunities. Anthropically deliberately trying to separate blackmail in this practice, the company may arise in the real world, if such harmful behaviors are not taken, such as active steps.

[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *