Debates over AI benchmarking have reached Pokémon

[ad_1]

Pokémon is not safe from the AI ​​benchmarking argument.

Last week, a X in x Google’s latest Gemini model went to the original Pokémon video game trilogy by claiming that the anthropic flagship model exceeded the flagship model. It is reported that the twins reached the lavender city of a developer’s twisting flow; Claude was Sticked at Mount Mount Until the end of February.

However, this is what the article cannot mention what is the twins.

Like Users in Reddit He pointed out, the developer, which protects the flow of the Gemini, set up a special minimap that helped to identify “tiles” in the game such as trees capable of the model. This reduces the need for twins to analyze screenshots before making gameplay decisions.

Now Pokémon is a semi-serious AI price in the best – a little a little model claims that the capabilities of a little bit. But this have It is an instructive example of how different applications of a benchmark can affect the results.

For example, anthropic declare Benchmark SWE-DENCH designed to assess the coding skills of a model is two points for the latest Anthropic 3.7 Sonnet model. Claude 3.7 Sonnet has achieved 62.3% accuracy in SWE-DENCH, but 70.3% with an anthropic-developing “special scaffent”.

Recently meta nice adjustable One of the new models to perform well in a certain benchmark, LM Arena, a version of Llama 4 Maverick. This Vanilla version Model scores are significantly bad in the same assessment.

Given this AI criteria – Pokémon entered – there is imperfect measures The waters also threaten mud to start with special and non-standard applications. That is, if the models appear easier to compare because they are released, it is unlikely to appear.



[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *