Researchers propose a novel benchmark called N to evaluate large language models (LLMs) for long contexts by eliminating literal matches between search contexts and relevant information. The study assessed 12 popular LLMs and found that while they perform well with short contexts, their accuracy significantly decreases as

3m read timeFrom portkey.ai
Post cover image

Sort: