Nvidia RTX 6000 Pro power efficiency testing

in #technologyyesterday (edited)

Number Six is my newer AI server I built about a month ago. I have done a lot of testing and tweaking in that time, but I finally got around to do a more thorough power efficiency test.

The server mainly revolves around the two Nvidia RTX 6000 Pro 600W cards. My initial testing showed only around 4% loss in performance by implementing a power limit of 300W per card. This effectively cut the power ceiling of the cards by 50%, and yielded around 43% power savings. A trade I was more than happy to do.

After some discussion on Twitter, I decided to spend a few and do more thorough testing now I have had time to tweak performance and get my desired model running well.

My daily driver is GLM Air 4.5 FP8 until they get around to releasing 4.6 they promised months ago. I typically see around 95 tokens/sec when just asking a simple question and as much as 195 tokens/sec when doing more complex and agentic tasks.

My testing is for 250W, 300W, 360W, and 600W (stock).

250W

Input token throughput (tok/s): 1071.01
Output token throughput (tok/s): 525.69
Total token throughput (tok/s): 1596.71

300W

Input token throughput (tok/s): 1216.33
Output token throughput (tok/s): 597.02
Total token throughput (tok/s): 1813.35

360W

Request throughput (req/s): 2.46
Input token throughput (tok/s): 1263.23
Output token throughput (tok/s): 620.04
Total token throughput (tok/s): 1883.27

600W

Input token throughput (tok/s): 1274.46
Output token throughput (tok/s): 625.55
Total token throughput (tok/s): 1900.02

These tokens/sec seem high, but this is simulating a multi user workload which will perform considerably better than a single user making one request.

Peak performance is of course at 600W for 625.55 tokens/second with the lowest performance at 250W giving 525.69 tokens/second. When looking at everything, 300W is a clear winner with 597.02 tokens/second.

If you look at the actual power draw, this gets really interesting though.

250W actually uses more power overall, the tests take longer but actually has peak spikes higher than 300W. If you look closely at the graph you can see the 250W test hits as high as 862W where as the 300W test peaked at 821W. The average wattage is fairly similar between these two tests.

Performance & Efficiency Comparison

Per-card limitSystem power (measured)Total tok/s% of max throughputOutput tok/sMedian TTFTMedian ITLTokens per WattEfficiency vs 600W
250 W814 W1 59784.0 %526229.8 s20.68 ms1.963+27 %
300 W816 W1 81395.4 %597201.9 s17.79 ms2.223+44 %
360 W990 W1 88399.1 %620195.5 s17.33 ms1.902+23 %
600 W (max)1 229 W1 900100 %626196.6 s17.27 ms1.546baseline

Summary – vs full 600 W mode

Per-card limitSystem powerPower saved vs maxThroughput loss vs max
300 W per card816 W–34 %–4.6 %+44 %
360 W per card990 W–19 %–0.9 %+23 %
250 W per card814 W–34 %–16 %+27 %
600 W per card (max)1 229 W0 %0 %baseline

In reality though, the numbers are even more in favor of 300W, as I was cherry picking the peak wattage specifically. It is interesting that 360W is where you get almost no loss in performance with 99.1% throughput, but with minimal power savings.

Sort:  


This post has been shared on Reddit by @themarkymark through the HivePosh initiative.

Interesting stuff. I hope it ends up getting you the information that you are looking for.

Loading...

You could start to open a business to rent space for processing lol

Not at my electricity prices.

Is solar power an option?

I wish, but it generally isn't practical.
Takes 20 years to break even, by then it's almost dead. Loses 1-2% efficiency a year, and in snowy areas you got to pay hundreds if not thousands to have roof cleaned once and a while if there is build up.

I've been looking into it, but it kind of sucks.

Works well here on the other side of the world, I made back about 15% of my solar plus battery system in the first year.

I have a lot of trees so there is only partial sun, I do have a quote from Tesla though I've been looking at.

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Can you automate testing in terms of individual watts between 300 and 360 to see if theres more of an edge somewhere in that continuum?

I'm happy at 300, I think 310 is likely the sweet spot, but it's so minor I don't care.

Loading...

I know you had/have a Strix Halo that you were testing out but now that I see you have some RTX Pro 6000's Workstation editions, so it got me thinking...

Have you gotten your hands on a DGX Spark (or 2) or like a AMD Instinct?

I know a Spark is not built for fast TPS but it was built for Capacity and mainly for lite development on the go before loading those same workloads onto DGX proper servers. So while say a RTX Pro will blow it out of the water with what like 8 times better bandwidth, it also has double the TDP (not counting 2 RTX Pro's). So I am curious if you have done a true deep dive comparison across multiple GPUs like say cost wise compare 2 Sparks to 1 RTX Pro, or something like that.

I only get to play with DGX H200, B200 and B300's at work (still waiting on my Spark to arrive eventually) and there is no way to compare apples to apples with a DGX B300 to a RTX Pro or Strix Halo lol

The dgx sparks and strix are very similar and the strix is a lot cheaper, almost half.

I hated the strix. It was fun to tinker with but even though I got “good” speeds with gpt oss 120 (50 tokens a second) it was still painfully slow for anything.

Just chatting was ok, say hi it responds back. Anything agentic or even real work you can see the problems with slow prompt processing.

For example I wrote a cli tool called please. So I can “please stop process on port 8000” and it would ask the llm for the command to do it and allow you to execute it by just pressing one.

Very small context, 1000 tokens at best. When pointing to a cloud api using same model it takes a second or so to get a response. When using strix halo it would take around 10 seconds.

The amount of context used to answer this simple query should be tiny yet it was still 10x slower than using the cloud. This is an incredibly simple task too. Image generation and coding or any other agentic task was so slow it was unusable. I thought about using it just to handle small model for reasoning or other tasks and it just becomes a bottle neck.

I was hoping using a 3090 via egpu would help the prompt processing part but it was just too slow when they were forced to work together.

Loading...
Loading...

Honest here, I don't know shit about running servers. I just have a high-end PC that I built myself.
What is the operating temperature of your cards? Heat is also lost energy.

You can see in the final screenshot while the cards were running full tilt. Bottom right corner.

Loading...

The cards are $9000 each, I wouldn't sell them for $5000 dumbass.

Loading...