I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.
I assumed chunking was a solved problem. Pick a text splitter, set 512 tokens, add some overlap, move on. After running structured experiments across three different data types, that assumption collapsed. The best chunker for markdown documentation actively hurt performance on code. The winner changed completely depending on what I was chunking. TL;DR Data type Winner Headline metric Markdown docs HeadingAwareChunker MRR 0.755 vs SlidingWindow 0.687 PDFs RecursiveChar (512 tok) Context Recall 0.
Comment
Sign in to join the discussion.
Loading comments…