Hacker news

Top
New
Past
Ask
Show
Jobs

Launch HN: Expanse (YC P26) – Unlock Wasted GPU Capacity

101 points by ismaeel_bashir 5 days ago | 27 comments | View on ycombinator

ray__ 5 days ago |

This is a cool idea—I know from snooping on sumbit scripts and node utilization on the HPC that I use at my institution that most submissions leave some compute on the table (and many of them are egregiously bad). I'd probably vote in favor of sending every submitted sbatch script through an LLM (at least for everyone else, I'd would prefer tuning my own usage myself :) ).

Presumably the underlying model here is also an LLM? To what degree is it "fine-tuned", or is it just given a set of tools to build a good picture of cluster usage?

flounder3 5 days ago |

One traditional enterprise goal of 40% utilization was to cover DR/failovers, so one region could take on 100% of traffic from another, with 20% headroom.

I'm curious about the granularity of contracts around granting/selling excess capacity. Are they short term? Can the owner evict those workloads (with a penalty)?

boringperson 5 days ago |

> Datacenters run at roughly 30% to 40% effective utilisation

I wonder what is stopping datacenters from passing this benefit to customers by launching better tuned plans. For example, t series EC2 instances on AWS.

FattiMei 4 days ago |

HPC student here, still learning. If I understood your problem statement, users of the cluster reserve resources far greater than what was needed for their computation. They fear that if the allocated resources are not enough, then their program will crash and lose partial results.

Can you give an example of typical execution on the cluster? Is it a problem of number of hours allocated or number of compute cores?

If I'm running a PDE simulation, and I allocate n machines I want to use all of them, so there is no risk of idle machines. It's not trivial to estimate a priori the amount of time required for my simulation to complete, so I overestimate. But when the simulation is complete (even before the deadline), the resources get freed and can be used right away for another job

Maybe the problem is when many users are greedy. Also MPI simulations are difficult (if not impossible, correct me) to change dynamically: when a simulation is started with that number of ranks, I can't add new ranks at will if the resources are available

Thank you for the patience for everyone that answers

iroddis 5 days ago |

This is really cool, and definitely needed.

Do you do any tracking of resource consumption over the runtime of a job? We have many jobs that use the requested memory only for a portion of the runtime, and are otherwise compute bound. It would be nice to be able to learn the profiles through time of jobs and layer them to get better resource utilization.

rjpruitt16 5 days ago |

I have been working on open source traffic shaper for agents. I think it may help you better with prediction if requests don’t stampede you

https://www.linkedin.com/posts/rahmi-pruitt-a1bb4a127_agentn...

mike_d 5 days ago |

From a security perspective this is a non-starter. If you leave your MongoDB instance open and I steal the telemetry you are collecting, I can reverse engineer the data into meaningful insights into cluster workloads. So all your potential national security customers or IP sensitive customers (finance, biotech, etc) are immediately out.

Any competent enterprise risk team is going to give a hard no to a SaaS application being in the critical path for on-prem business critical workloads. So there goes Fortune 100 too.

If you are successful and better schedule workloads you are just deferring upgrades and expansions. The customers Dell/HPE/etc. sales rep is going to freak out, some vice presidents are going to go golfing together, and all the remaining high value customers don't renew.

What you are really left with is the "small and medium business" clusters that are purpose specific. They are running 100% on a handful of tasks that can probably be hand tuned.

This sounds like really cool technology, I just don't see the business. Hopefully you'll consider open sourcing it soon.

syngrog66 5 days ago |

I'm writing book on perf optimization, love to ask you questions sometime. email me (in my bio here) if interested. thanks!

mike_d 5 days ago |

Your "OS Wastage Scanner" is grammatically incorrect. It's "waste."

undefined 3 days ago |

undefined

joemorrison607 4 days ago |

[flagged]

Shaurya_Sharma 5 days ago |

[flagged]

zeckalpha 4 days ago |

[dead]

keynha 5 days ago |

[dead]

jalospinoso 5 days ago |

[flagged]

lowellniles 5 days ago |

[flagged]

Ozzie-D 5 days ago |

[flagged]