Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Crowdsourced AI benchmarks have serious flaws, some experts say


AI Laboratories are increasingly reliant on painful benchmarking platforms Chatbot Arena to try the strong and weaknesses of the latest models. However, some experts say there are serious problems with this approach From ethical and academic point of view.

Laboratories, including Openai, Google and Meta in the last few years, turned to users who help assess opportunities for upcoming models. When a model is affordable, the laboratory behind it will often evaluate this account as proof of a meaningful development.

However, Emily Benderin, Professor of Washington Linguity and Emily Bender, who is co-author of the book of the book “AI Joh”, has a defective approach. Bender takes a special issue with a chatBot Arena, who chooses the responsibilities of volunteers by choosing two anonymous models and choosing their choice.

“It is necessary to measure certain things to be valid, and it is necessary to create reliability – that is, the construction of interest is well defined and measurements are related to the construction of the construction,” he said. “Chatbot Arena did not show that the voting in fact for an exit over the other, but they can be identified.”

AI’s co-founder AI company AI, the co-founder Asmelash Hadgu, said he thinks that prices like Chatbot Arena are “to encourage exaggerated claims.” Hadgu pointed out the last dispute, where the methane Llama 4 Maverick. A version of Maverick to consider the meta chatBot Arena wellOnly in favor of releasing a model The worse processing version.

“Assessments are not static data sets, but called” hadgu “.” [models] for work. “

Earlier, Hadgu and Kristine Gloria, who heads the initiative of the Aspen Institute, and the work of Model Appraisers, also made the work of the model. Gloria said the AI ​​laboratories should learn from the errors of the information labeling industry notorious for operator practice. (There were some laboratories defendant the same.)

“In general, the process of painful benchmarking is valuable and reminiscent of citizenship initiatives,” said Gloria. “Ideally helps to bring additional prospects to give a depth in both data evaluations and subtle regulation. However, it should not be metric not only for accurate adjustment of prices.

Matt Frederickson, the General Director of Gray Swan AI, which works for the models, said that volunteers are included in the platform of Gray Swan for a number of reasons, including “learn new skills and practice.” (Gray Swan also separates money prizes for some tests).

“[D]Evelopers should also rely on domestic criteria, algorithmic red teams and bring a more open approach or a special domain experience, but also the creators of the criteria, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise, or otherwise they must answer. “

Alex Atallah, the General Director of Model Marketplace OpenRower, which appeared with Openai to provide early access to users recently Openai’s GPT-4.1 models“Enough” models open tested and comparison “not enough,” he said. Thus, Wei-Lin Chiang, UC Berkeley made one of the founders of the AI ​​doctoral and Chatbot Arena.

“Of course, we support the use of other tests,” Chiang said. “Our goal is to create a reliable, open space that measures the advantages of our society on different AI models.”

Chiang is a result of a defect in the design of the Chatbot Arena in the design of the Chatbot Arena in the design of the Chatbot Arena, Maverick Benchmark. LM Arena took steps to prevent future inconsistencies, and took steps to update the policy to “strengthen our commitment to fair, reproducational assessments”, including Chiang.

“Our society is not here as voluntary or model testers,” said Chiang. “People use a lm arena because we give them an open, transparent place to deal with AI and give collective feedback. Leader board is faithfully reflecting the sound of society and welcomes the sound of society.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *