Hacker news

  • Top
  • New
  • Past
  • Ask
  • Show
  • Jobs

Furiosa: 3.5x efficiency over H100s (https://furiosa.ai)

211 points by written-beyond 4 days ago | 157 comments | View on ycombinator

roughly 4 days ago |

I am of the opinion that Nvidia's hit the wall with their current architecture in the same way that Intel has historically with its various architectures - their current generation's power and cooling requirements are requiring the construction of entirely new datacenters with different architectures, which is going to blow out the economics on inference (GPU + datacenter + power plant + nuclear fusion research division + lobbying for datacenter land + water rights + ...).

The story with Intel around these times was usually that AMD or Cyrix or ARM or Apple or someone else would come around with a new architecture that was a clear generation jump past Intel's, and most importantly seemed to break the thermal and power ceilings of the Intel generation (at which point Intel typically fired their chip design group, hired everyone from AMD or whoever, and came out with Core or whatever). Nvidia effectively has no competition, or hasn't had any - nobody's actually broken the CUDA moat, so neither Intel nor AMD nor anyone else is really competing for the datacenter space, so they haven't faced any actual competitive pressure against things like power draws in the multi-kilowatt range for the Blackwells.

The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall, and the only way to make the economics of, eg, a Blackwell-powered datacenter make sense is to assume that the entire economy is going to be running on it, as opposed to some useful tools and some improved interfaces. Otherwise, the investment numbers just don't make sense - the gap between what we see on the ground of how LLMs are used and the real but limited value add they can provide and the actual full cost of providing that service with a brand new single-purpose "AI datacenter" is just too great.

So this is a press release, but any time I see something that looks like an actual new hardware architecture for inference, and especially one that doesn't require building a new building or solving nuclear fusion, I'll take it as a good sign. I like LLMs, I've gotten a lot of value out of them, but nothing about the industry's finances add up right now.

zmmmmm 4 days ago |

What can it actually run? The fact their benchmark plot refers to Llama 3.1 8b signals to me that it's hand implemented for that model and likely can't run newer / larger models. Why else would you benchmark such an outdated model? Show me a benchmark for gpt-oss-120b or something similar to that.

Barathkanna 4 days ago |

For those wondering how this differs from Nvidia GPUs:

Nvidia = flexible, general-purpose GPUs that excel at training and mixed workloads. Furiosa = purpose-built inference ASICs that trade flexibility for much better cost, power efficiency, and predictable latency at scale.

KronisLV 4 days ago |

I think it's actually really cool to focus on efficiency over just raw performance! The page for the cards themselves goes into more detail and has a pretty nice graph: https://furiosa.ai/rngd

You can see them admit that RNGD will be slower than a setup with H100 SXM cards, but at the same time the tokens per second per watt is way better!

Actually, I wonder how different that is from Cerebras chips, since they're very much optimized for speed and one would think that'd also affect the efficiency a whole bunch: https://www.cerebras.ai/

torginus 4 days ago |

These things never pan out.

The reasons why this almost never works is one of the following:

- They assume they can move hardware complexity (scheduling etc, access patterns into software). The magic compiler/runtime never arrives.

- They assume their hard-to-program but faster architecture will get figured out by devs. It won't.

- They assume a certain workload. The workload changes, and their arch is no longer optimal or possibly even workable.

- But most importantly, they don't understand the fundamental bottlenecks, which is usually memory bandwidth. Even if you increase the paper specs, like FLOPS total, FLOPS/W etc. youre usually limited by how much you can read from memory. Which is exactly as much as their competitors. The way you can overcome this is by cleverness and complexity (cache lines, smarter algorithms, acceleration structures etc), but all these require a complex computer to run with all those coherent cache hierarchies, branching and synchronization logic etc. Which is why folks like NVIDIA keep going on despite facing this constant barrage of would-be disruptors.

In fact this continue to be more and more true - memory bandwidth relies on transcievers on the chip edge, and if the size of the chips doesn't increase, bandwidth doesn't increase automatically on newer process nodes. Latency doesn't improve at all. But you get more transistors to play with, which you can use to run your workload more cleverly.

In fact I don't rule out the possibility of CPU based massively parallel compute making a comeback.

darknoon 4 days ago |

really weird graph where they're comparing to 3x H100 PCI-E which is a config I don't think anyone is using.

they're trying to compare at iso-power? I just want to see their box vs a box of 8 h100s b/c that's what people would buy instead, and they can divide tokens and watts if that's the pitch.

whimsicalism 4 days ago |

Got excited, then I saw it was for inference. yawns

Seems like it would obviously be in TSMCs interest to give preferential taping to nvidia competitors, they benefit from having a less consolidated customer base bidding up their prices.

jszymborski 4 days ago |

Is it reasonable for me not to be able to read a single word of a text-based blog post because I don't have WebGL enabled?

kuil009 4 days ago |

The positioning makes sense, but I’m still somewhat skeptical.

Targeting power, cooling, and TCO limits for inference is real, especially in air-cooled data centers.

But the benchmarks shown are narrow, and it’s unclear how well this generalizes across models and mixed production workloads. GPUs are inefficient here, but their flexibility still matters.

nycdatasci 4 days ago |

Is this from 2024? It mentions "With global data center demand at 60 GW in 2024"

Also, there is no mention of the latest-gen NVDA chips: 5 RNGD servers generate tokens at 3.5x the rate of a single H100 SXM at 15 kW. This is reduced to 1.5x if you instead use 3 H100 PCIe servers as the benchmark.

zvqcMMV6Zcr 4 days ago |

It misses most important information, price and how quick they can ship. If they can actually deliver and take slice of market share from NVidia then it would make me happy.

pama 4 days ago |

The title sounds interesting but I get errors and no content on my iPhone15 because it is unable to initialize WebGL. Why do people still link content to such capabilities? Where has simple HTML / CSS gone these days?

Edit: from comments and reading the one page that loads, this is still the 5nm tech they announced in 2024, hence the H100 comparison, which feels dated given the availability of GB300.

grosswait 4 days ago |

How usable is this in practice for the average non AI organization? Are you locked into a niche ecosystem that limits the options of what models you can serve?

richwater 4 days ago |

This is from September 2025, what's new?

bicepjai 2 days ago |

Are all these improvement over custom kernel efficiency code ? Can we bring these to consumer RTX and Pro cards ?

After I read the article :) The improvements in FuriosaAI's NXT RNGD Server are primarily driven by hardware innovations, not software or code changes.

vfclists 4 days ago |

Why is their website demanding WebGL?

LTL_FTC 4 days ago |

The server seems cool but the networking seems insufficient for data centers.

galaxyLogic 4 days ago |

How is this possible? Doing AI with "dual AMD EPYC processors". I thought you needed to have GPUs or something like that to do the matrix multiplications needed to train LLMs? Is that conventional wisdom wrong?

nl 4 days ago |

So inference only and slower than B200s?

Maybe they are cheap.

kalmyk 4 days ago |

that's a nice rack