Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Study accuses LM Arena of helping top AI labs game its benchmark


A new paper The AI ​​lab, the selected group of LM Arena, the selected group of LM Arena, MIT and AI2, Chatbot Arena, which helps the Group of competitors, accuses the Usher arena of the organization behind the LM Arena.

According to the authors, the LM Arena allowed several variants of several industries such as META, Openai, Google and Azamon, then allowed to privatize several variants of AI models, and then to privatize several of the AI ​​models, and then to privatize several options. This facilitated these companies to achieve the best point in the platform leadership panel, although every company, the author of the authors.

“Only one handful [companies] This personal test said that there is a special trial volume that is available and some [companies] The people are more than others, “said Coone’s AI research and research author, Sarah Hooker, an interview with TechCrunch.” It’s gamifying. “

In 2023, Berkeley was created as an academic research project, Chatbot Arena, turned to go-benchmarka for AI companies. It works by responding by asking two different AI models by asking “battle” and users to choose the best. Competing topics in the arena are not rare to compete under a pseudonym.

Over time, the sounds contribute to the account of a model – and eventually place its chatbot arena leadership board. While participating in many commercial actors in the chatbot arena, the LM Arena has long maintained its benches for a long time.

However, this is not what the authors of the paper found.

One AI company was able to combine 27 models in the Chatbot Arena between Meta, Meta, Tech Giant Llama and March 4. In the beginning, Meta clearly reveals the account of a model of only one model – a model of a model from the top of the Chatbot Arena leader board.

Techcrunch event

Berkeley, CA
|
June 5


The book is now

A table was fired. (Credit: Singh et al.)

Techcrunch, LM Arena co-founder and UC Berkeley Professor Ion Stoica, said that the case is full of “inaccuracy” and “suspicious analysis”.

“We are committed to a fair, community-run assessments and invite all model providers to provide more models for testing and improving their performance related to the human choice.” “If a model provider chooses to present more tests than another model provider, this does not mean that the second model provider is unfair.”

Supposedly popular labs

The authors of the paper began to investigate in November 2024, and some AI companies were given a discounted access to the chatbot arena. In general, a five-month extension measured more than 2.8 million Chatbot Arena fights.

The authors said that LM Arena’s meta, Openai and Google, including Meta, Openai and Google, including Meta Arena to appear at a higher level. This increased sample ratio has given an unfair advantage to these companies, the authors claim.

Using additional information from LM Arena, another benchmark can improve the activities of a model over the arena, another benchmark saves 112%. But LM Arena said X in x In this arena, the harsh performance does not directly connect the ChatBot Arena performance.

Hooker said that AI could get a specific entry, but to increase the transparency in the LM arena, it was unaware of knowing that it was present in the LM arena.

One X in xLM Arena said several claims on the paper did not reflect the reality. The organization pointed to a Blog Post Earlier this week, this week shows that earlier labs appear in the battles of Chatbot Arena than the study of models.

An important restriction of the study is that the AI ​​models are “self-recognition” to determine which AI models are in a private test in the chair. The authors wanted the AI ​​models several times several times and they contacted the answers to the answers to classify the models – in a way that was not foolish.

However, Hooker, when the authors reach the LM arena to share the initial results of the authors, the organization did not argue.

TechCrunch, Google, Openai and Amazon – All this was recorded in the study – for comments. None responded immediately.

LM arena in hot water

On paper, the authors call the LM Arena to carry out a number of changes aimed at prepting more “fair” to Chatbot Arena. For example, the authors say that the LM Arena can make an open and transparent restriction on the number of private tests AI laboratories can explicate scores from these tests.

One Writing in X, LM Arena rejected these proposals, published information about the final test Since March 2024. The Evaluation Organization also said that “the AI ​​community models cannot test for themselves,” he said, “he said,” he said.

Researchers say that LM Arena can adjust the selection of the chalking of the Arena to ensure that all models in the arena appear in the same number of battles. LM Arena clearly accepts this recommendation and said it would create a new example algorithm.

Paper, Meta comes in a few weeks after the start of the game criteria in the chatbot arena around the start of the above-mentioned Llama 4. Meta optimized one of the 4th models for “negotiations”, which helps to achieve an effective results on the leadership board of the arena. However, the company has never left the optimized model and vanilla version He performed even worse At the chatbot arena.

In this case, LM Arena said that the meter should be more transparent in the benchking approach.

In the beginning of this month he announced that LM Arena The launch of a companyby plans to train capital from investors. The study rely on whether the Private Benchmark Organization’s Gloss Or The Process The Practice allows you to trust to assess AI models without evaluating AI models.

Update 4/30/25 PT: 9:35 The previous version of this story includes comments from the Google Deepmind engineer who said that Coone’s work is inaccurate. The researcher said that Google sent 10 models to the LM arena for the tests issued in January-March in January-March, only one of the company’s open source team sent only one.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *