Algorithms·Mar 11, 2026·5 min read

The 1 million source problem

Curating quality from a few hundred sources is a known problem. Curating from millions is a different problem entirely. Here's how txtfeed scales the pipeline without paying for content.

Most content aggregators handle a few hundred to a few thousand sources. The pipeline is manageable: human editors pick what to ingest, the algorithm ranks within a known pool, and quality control is mostly about catching the occasional spam slip-through. txtfeed's design target is a million sources by year three. That's a different problem.

At scale, three things break. First, human curation of the source list itself becomes the bottleneck — there isn't enough editor time to vet every new RSS feed someone wants to add. Second, deduplication gets expensive — when the same story is on Reddit, HN, three news sites, and four blogs, the system needs to recognize that without doing pairwise comparisons against everything else in the pool. Third, source-quality decay is invisible — a feed that was great last year might be a clickbait farm this year, and you won't know unless you measure.

txtfeed's approach to all three is to make the system self-managing. Sources don't get manually approved — anyone can submit an RSS URL via the bookmarklet, and if enough users save content from that source, it gets promoted into the main pipeline. Sources don't get manually demoted either — the upvote ratio is tracked per source, and feeds with consistently low ratios get their pipeline weight reduced automatically.

Deduplication runs on semantic fingerprints, not URL matching. Every piece gets embedded into a vector space and compared to recent pieces. Near-duplicates get merged into a single card with a verified-by-N-sources badge — so cross-source coverage becomes a quality signal instead of a deduplication headache. The cost of running this scales linearly with content volume, not quadratically with source count.

The harder problem is source-quality decay. A feed that was high-signal six months ago might be a content farm now. txtfeed handles this by re-scoring every source weekly based on the previous week's vote ratio. Sources that drop below a threshold lose 50% pipeline weight automatically. Sources that consistently produce hidden content (3+ user reports per week) drop to zero. No editorial review needed.

The lesson generalizes to any platform that wants to scale curation without scaling editorial headcount: every step that humans currently do has to become an algorithm, and every algorithm has to be able to update itself based on user behavior. The platforms that get stuck at 10K sources are the ones that still have a human in the source-approval loop. The platforms that scale past it are the ones where humans only intervene on edge cases.

See it for yourself. No signup required.

Open txtfeed

The 1 million source problem

Read next

The 1 million source problem

Read next