Researchers from Standford, Princeton, and Cornell have developed a new benchmark to better evaluate coding abilities of large language models (LLMs). Called CodeClash, the new benchmark pits LLMs ...
Covers mathematics, general intelligence, general awareness, and general science. Exam includes CBT, physical efficiency test, document verification, and medical examination. 100 questions for 100 ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results