Skip to content
Why comparing average scores is the wrong way to evaluate LLM prompts (and what to do instead) — txtfeed | txtfeed