Skip to content
VentureBeat

IndexCache, a new sparse attention optimizer...

Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length.The technique applies to models using the DeepSeek Sparse Attention architecture, including t
Read original on venturebeat.com
0
0

Comment

Sign in to join the discussion.

Loading comments…

Related

r/startups

Linear just announced "Issue Tracking is...

So we've been obsessing over "the moat" at my startup lately. It's the classic question: what actually protects us long-term? Nothing tunes the senses on this like watching somene else's moat disappear. In our weekly eng meeting, we were figuring out how to handle customer bug reports post-launch. One idea was routing them straight to Linear for triage. Normal stuff. We tabled it. More important stuff to do. Here's the thing though, my co-founder has basically stopped using Linear. We have a coo

reddit.com
19
9
r/startups

The labor in my business is falling apart......

I started a Junk Removal company at 21 and grew it to $500k in revenue within 2 years. The market I was in was very small, so I moved to a bigger city after year three and started my second branch. 1.5 years later, and this new branch is at 200K in revenue, but the first branch is struggling. We're 1.5 hours apart, and I keep a manager to run things, but this week he went on vacation, and the entire thing fell apart. The guys aren't showing up; when they do, they're not wearing uniforms, and nob

reddit.com
11
13

Liked this? Start your own feed.

Your own feed is waiting.
0
0