Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Join our daily and weekly newsletters for the latest updates and exclusive content in the industry’s leading AI coverage. Learn more
Anthropical has prepared a new method waiting inside large language models ClaudFor the first time, it first reveals the information and decisions of the AI systems.
Research published in two documents today (available here and here), these models are more complicated than in composition – they plan in advance when writing poetry, they use the same internal plan to comment on the language, and sometimes just return from any result instead of just talking about the facts.
The inspired work Neurology methods Biological brains are used to study, an important progress in AI interpretation. This approach can allow researchers to inspect these systems for security issues that can be hidden during a regular external test.
“We created this AI systems with significant opportunities, but because they are taught We did not understand How these opportunities appeared, “he said.
Large language models like Openai GPT-4OAnthropical Claudand Google’s Twins They demonstrated significant opportunities to write a code to synthesize the research documents. However, these systems were mainly employed “Black boxes“- They do not even understand how their creators often come to certain answers.
Anthropic’s new translation techniques with the company’s dubs “circuit tracking“And”Attribution graphs“Allow researchers to map specific ways such as neurons such as neurons, such as neurons, which activate the tasks.
“This work turns out to be almost philosophical questions – ‘Did the models think? Is the planning of models? – In these systems, there is a specific scientific survey about what happened in the true sense of the word.”
Among the most striking discoveries was a proof of previously planned when writing poetry. When a couple asks for a pair of rhyming, before starting to write a model, the next line has identified potentially rhyming words – the level of sophistication that surprises anthropic researchers.
“It’s probably happening everywhere,” Batson said. “If you asked me before this study, I would estimate that the model thinks ahead in different contexts. But this example presents the most attractive evidence we have seen.”
For example, when writing a poem ending with “rabbits”, the model activates the features that represent this word at the beginning of this word, and then the sentence comes naturally based on this conclusion.
Researchers also saw Claude original Multi-step justification. “The model of the state that includes Dallas …” The model first activates the features representing Texas, and then this representation uses to determine “Austin” in the correct answer. This indicates that the model is actually re-studying memorized units and implemented the chain of justification.
To replace these domestic offices – for example, “California” with “California” – researchers can cause Sacramento’s Sacramento, which confirmed the relationship of the cause.
Another key discovery covers how Clode works many languages. Instead of protecting separate systems for English, French and Chinese, model concepts converts a common abstract representation before answering the concepts.
“We use the model of the model and the mixture of language and independent circuits,” we see that researchers write their paper. In different languages, the model uses the same internal features that are “opposite” and “small” and “smallness” and “smallness” and “smallness” and “small” and “smallness”.
This finding has an impact on how models can transfer knowledge to others and offers to develop more language-agnostic representations of larger parameters.
Perhaps most of the study, the thought of the cluster revealed that he did not comply with what he claimed. When a large number of Cosin values are presented with difficult mathematical problems, the model claims that it will sometimes comply with a computing process in the internal activity.
“Cases where the model works realistically, and the cases of the fact that they are realized by the truth and a person provided by the human being, we can distinguish the cases of irritations Researchers explain.
In an example, when a user offers a difficult problem, the model works back to build a reasoning chain that causes this answer, instead of going forward in the first principles.
“We thought of a sample of 3.5 Haiku in Kludea, using a loyal chain of two samples of thinking chains,” he says. “In one, the model exhibitscoffals‘… There is a motivated justification in the other. “
Research provides information about why language models of language models – information obtains information when he does not know. Anthropic, this was found to “Standard”, which was rejected by the questions that prevented the questions that prevented the questions obstructed when the model knows the subjects.
“There are” standard “schemes that cause reduction to answer questions in the model,” researchers explain. “In case of a question about something that a model knows, it activates a feature pool that hinders the standard circulation, thus activates a feature that allows the model to answer the question.”
When this mechanism is wrongly burned – to recognize a presence, but the lack of special knowledge – hallucinations can occur. This refuses to respond to hidden information about hidden information about hidden information about why the models confidently.
This study represents an important step towards the fact that AI systems are more transparent and potentially safer. By understanding how the models come to their answers, researchers could potentially identify and solve the problem.
He long stressed the potential for the security of anthropic and interpretation. In themselves May 2024 Sonnet paperThe research team expressed similar vision: “We hope that we and others will be able to use the models to make the models safer,” the researchers wrote during that time. “For example, it may be possible to use the techniques described here to watch the AI systems, such as deceiving the user, or deceive certain dangerous topics.
Today’s announcement is based on this foundation, although the prepared methods are still significant. Only these models seize part of the total calculation performed and the analysis of the results remains labor-intensive.
“Even in a short, simple challenges, our method, only part of the common calculation carried out by ClaD,” the researchers recognized their final work.
Anthrop’s new techniques are coming in the period of increased concern for AI transparency and security. It is becoming increasingly important to understand domestic mechanisms because these models are placed stronger and wider.
The study also has potential trading effects. As the enterprise is increasingly important for the risk management of these systems, as these systems can provide incorrect information, because they trust in force.
“Anthropically wide sense, including the use of a biased thing to ensure the use of EU, wants to make a fairly meaningless models catastrophic risk“Researchers write.
While this research shows significant progress, Batson stressed the only beginning of a longer journey. “The work really started,” he said. “The use of the model to understand the representation does not explain to us how to use them.”
Still anthropic circuit tracking Presents the first initial map of the previously unchanged area – as early anatomists who sketch the first raw diagram of the human brain. The full range of AI cognition remains, but we can see the contours of these systems.