Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Meta cheated on an AI criterion and it’s hilarious. According to Kylie Robison Doubts later began percolating Meta, Llama 4 released two new AI models based on 4 large language models the weekend. New models are a small model designed for fast inquiries, a smaller model and Maverick, OpenAli’s GPT-4O to become a super effective opponent (Miyazaki Apocalypse’s Harbinger).
Meta, which announced them in the blog post, did what each AI did with a great issue now. They threw all the high technical information to brag that the AI of the AI’s AI is easier and more efficient: Google, Openai and anthropic. These release labels are always suitable for researchers and most AI obsessive, but are useless for the rest, but the most useless deep technical information and criteria. The announcement of the meter was not different.
However, many AI obsesses immediately have a shocking benchmark result in the result of meta. Maverick collected an Elo score number 1417 in LMARENA. Lmana It is an open source cooperation that users can vote for the best performance. The higher result is better and Maverick’s 1417, LMMarena’s Leadership Board 2, only GPT-4O and only Gemini 2.5 Pro is set below 2.5 PRO. The whole AI ecosystem was crashed with surprise in the results.
Then we started drilling and quickly admitted that the users of the users of the crushing Maverick model of crushing Maverick in Meta, Lmarena. The company has programmed this model to chat more than usual. Effectively, the benchmark was fascinated.
It doesn’t seem to be happy with the charm of Lmarena. “The interpretation of the meter’s policy did not correspond to what we expected from model providers” It was said in a statement in X. “Meta, ‘Llama-4-Maverick-03-26-26-03-26-Experience’, this confusion is a special model to optimize the choice.
I love LMarena’s optimism here because a benchmark feels like a transition right in consumer technology, and I suspect this trend will continue. I once covered the consumer technology for a year, once I ran one of the wider benchmarking laboratories in the industry, and many telephone and laptop producers saw their scores with water with water. They mixed with screen brightness for better battery life and sent the broublare-free versions of laptops to get better performance scores.
Now the AI models are also talking to the scores of the scores. The reason for this is that this is the reason why I will not consider it not to be cultivated, and is hopeless to distinguish large language models of these companies from each other. If each model can help you write a shitty English paper five minutes before class, you will need another reason to distinguish your choice. “My model uses less energy and performs the task 2.46% faster,” but it may not seem like the biggest arrogance for everyone, but it is important. 2.46% faster than anyone still.
As this Ais continues to grow up and actual consumer products, we will start to see more benchmark bragging. I hope we will see other things. User interfaces will start to change, as the GPT section of the CHATGPT application will be more common to Goofy stores. These companies need to prove that the models are the best model and benchmarks. Not when the conversation can make the system easily.