Dev.to3d ago1 min read

War Story: We Migrated from Hugging Face Inference API to Self-Hosted LLMs and Cut Latency by 60%

In Q3 2024, our 12-person backend team at a Series B fintech hit a hard ceiling: p99 latency for our LLM-powered transaction categorization API was 2.8 seconds, 40% of our user-facing SLA, and we were burning $22k/month on Hugging Face Inference API (HFIA) usage fees. Our product team was pushing to add LLM-powered receipt scanning and fraud detection, but HFIA's rate limits (100 req/min for our tier) and 12-second cold starts made those features impossible to ship. After a 6-week migration to s

Read original on dev.to