Anthropic aims to fund the development of a new and more comprehensive generation of AI benchmarks
Anthropic is launching a new initiative to fund the creation of innovative benchmarks designed to assess the performance and impact of AI models, including generative models like its own Claude. Announced on Monday, this program will provide financial support to third-party organizations capable of “effectively measure advanced capabilities in AI models,” as stated in Anthropic’s blog post. Interested parties can submit applications for evaluation on an ongoing basis.
Anthropic emphasized that this investment aims to enhance AI safety across the board, offering valuable tools to benefit the entire AI ecosystem. The company acknowledged the challenges in developing high-quality, safety-relevant evaluations and noted that the demand for such evaluations currently exceeds the supply.
The program addresses a critical issue in AI: the benchmarking problem. Existing benchmarks often fail to reflect how average users interact with AI systems. Additionally, many older benchmarks, especially those predating modern generative AI, may no longer accurately measure what they were intended to.
This initiative is part of Anthropic’s broader effort to ensure AI safety and reliability, providing a crucial step towards creating more effective and relevant evaluation tools for advanced AI capabilities.