r/MachineLearning45d ago1 min read

[D] We audited LoCoMo: 6.4% of the answer key...

[Projects are still submitting new scores on LoCoMo as of March 2026.](https://github.com/snap-research/locomo/issues/34) We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. ## LoCoMo LoCoMo ([Maharana et al., ACL 2024](https://

Read original on reddit.com