Ok, this is pretty deep in the weeds of geekville, so if you don't know a q8 quant is, you probably want to skip this one.
My current local AI server is an AMD Strix Halo 395+ 128GB. This is relatively new tech that has a lot of shared memory similar to a Mac that can be used to run very large AI models at moderately decent speeds.
The main model I run is OpenAI's Open Source model called GPT-OSS-120B which is a very large model that takes around 75GB of vram at q8 and around 50G at q4. Most gpus have 8-16GB of vram and even the top of the line nvidia 5090 only has 32Gb.
My old GPU (Nvidia 3090) has around 936GB/s memory bandwidth where the AMD 395+ (Strix Halo) has around 253Gb/s. While there is no way to add an additional GPU to the Strix Halo, you can use the 2nd m2 nvme slot to connect a GPU via Oculink. The m2 slot offers a 4x lane, quite a bit slower than a full 16x port you normally use for a GPU, but for inference (asking AI questions, not building new AI models), this is not a major concern. The reason is you load the model into the GPU then it doesn't talk to the cpu that much. If you were fine tuning (making new models) this would be a major problem.
This means hooking an external GPU via oculink was a very viable option. I've been looking around and talking to a lot of people with Strix Halos and no one has done sufficient testing on this option. I have seen a handful of users with eGPU setups, but no one has done testing on how it performs with AI.
Since I already had an Nvidia 3090 on a shelf, I thought I'd give it a try. The nvidia 3090 is a very good card for AI, it has 24G of vram, it is fairly quick, and cheaper than the current gen GPUs. I've seen some systems using 8-12 3090's to run large models.
I did expect to loose some performance due to the 4x bus and oculink protocol.
After some benchmarking, the results are less than I wanted, but somewhat what I expected.
The stock AMD 395+ scores with a small prompt.
prompt eval time = 1034.63 ms / 277 tokens ( 3.74 ms per token, 267.73 tokens per second)
eval time = 2328.85 ms / 97 tokens ( 24.01 ms per token, 41.65 tokens per second)
total time = 3363.48 ms / 374 tokens
With both the AMD 395+ and the 3090
prompt eval time = 864.31 ms / 342 tokens ( 2.53 ms per token, 395.69 tokens per second)
eval time = 994.16 ms / 55 tokens ( 18.08 ms per token, 55.32 tokens per second)
total time = 1858.47 ms / 397 tokens
That's 47.8% improved prompt processing (AMD 395+ weakness) and 32.8% faster token generation. Not bad really, getting 55 tokens/sec is pretty impressive considering how large this model is.
I'm not sure if i can squeeze more out of this setup. I don't really actually use it for much as I prefer even faster speeds for day to day stuff and I am planning a much bigger system when I get my stock system fully functional.
Since the EVO-X2 does not have an oculink port, I had to use this m.2. to oculink 4i cable from Amazon ($24.73).
I used this Miniforum DEG1 eGPU dock station to mount the power supply and GPU. Also picked up on Amazon for $100 shipped.
There is a slightly nicer but much more expensive eGPU dock that include a power supply for $260, also from Amazon.
I have a lot of high end power supplies lying around and I wasn't even sure this would work.
My next project is to try my 5090 and see what sort of gains I can get from that, it is not only twice as fast, it has an additional 8GB of vram which allow for many more layers to be loaded.
I might even try putting two 3090's in the machine using up both M2 slots and booting off a flash drive. This will be really slow to load the model, but once it is loaded, it really doesn't need any drive access. Although this is just getting silly, putting in a 5090 would be a much better option.
That's a ton of VRAM. Are you mainly going to earn by selling time or use it so you don't have to buy compute through memberships?
I use it for some of my systems, so I don't have to use API credits but also for privacy so my data isn't being used to train models.
Nice. It's so easy to blow through credits on these services.
Impressive setup! The performance gains with the 3090 must be satisfying :)
Wow, you really can handle high speeds, and I see you're looking for more. Good for you... Nothing better than investing in our work tools. Happy start to the week! ✌️
This is indeed far too profound.
By the way, six years ago I built a huge desktop machine at home.
It was capable of supporting multiple graphics cards, but I only installed one of the cheapest ones just for basic display output.
About three years ago, I even removed that cheap GPU, because as a network server, I could connect to it via SSH — no need for a monitor at all.
Around eight months ago, I deployed a home AI assistant on it (ollama + open-webui + DeepSeek R1 14B).
It barely works, but I honestly couldn’t imagine what to use it for — after all, ChatGPT is already great and convenient enough for me.
I was looking around yesterday and came across this and got some ideas.
I run headless whenever possible.
If you are into self hosting, take a look at KaraKeep. It basically Pocket but open source and bring your own AI. I use this 100 times a day.
Check out LM Studio if you haven't, so much better than Ollama.
I don't use local AI much, I mostly use providers, so it is mostly for fun as I need the speed more, but I am working on a project that I do not want my data in the cloud and once I can prove it is profitable, I will be getting some RTX 6000 Pros and Epyc CPUs to run Kimi K2 locally.
I think it's impressive.
Using 2 GPUs and booting from flash is a bit overkill, but would make me giggle.
Would be another 24G of vram allowing me to almost fully keep GPT-OSS-120B in vram probably speeding it up 200%.
How long would it take to load the gguf from flash though, I mean I guess if it only has to do it once.
Probably 10-20 minutes instead of 10-20 seconds. I do it once every 3 months or so. I do want to try 5090 on this setup, that would be a massive boost.
Pretty cool. I had looked at going the external video card route in the past for other uses, but I was never able to talk myself into it. It sounds like you have some lofty plans for the future!
I'm just goofing off, I am working on a much bigger server. This one is just for hobby stuff.
I think you just need a Nvidia Spark next 😉
But this is neat, love the nerdiness
It's roughly the same machine at twice the price
Well similar with RAM and memory bandwidth but the compute side the Spark is king (1 petaFLOP). But that is coming from someone who installs Nvidia DGX systems for a living so I am slightly biased haha
You know how much the DGX Station is going to run?
The "Station's" are the workstation type servers that take normal power and sit under a desk. To my knowledge Nvidia stopped producing them internally after the Station A100, but other partners may produce them still.
The full fledge DGX servers like the B200 or H200 (latest versions) are in the $300,000 - $400,000 price range just for a single DGX :)
When I set them up I always want to stress test them by mining crypto with them for like a hour haha but never have gotten the approval to do it.
Also I had a customer who had a old station a100 they said they didn't want and I could have but that was 2 years ago and I was an idiot and was offered it about 20 minutes before I had to leave to fly home :( I regret it everyday. Used ones are anywhere from $10,000 - $20,000 now on secondary market.
https://docs.nvidia.com/dgx/dgx-station-a100-user-guide/intro-to-station-a100.html
Talking about this one, been waiting for the announcement.
Ah I didn't know they were bringing back the Nvidia branded Stations... My Nvidia rep didn't mention it so now I gotta go give him some shit haha
Well I do know the Spark's where a priority and those were delayed some and should be shipping out in Q4 finally. But when I head to Nvidia GTC here later this month I will track down any info I can :)
It's also twice the price and rumors it will be about 10-15% faster.
My PC has a 3060 Ti and I was thinking of running one of OpenAI's smaller models, but seeing your setup, I'm having a second thought 😂
Off topic, I can't ask on Discord unless I reload it, took it off for a transition to a new computer should I finally make up my mind which one to buy. Anyway, I got a text today that seems to imply that someone asked for a withdrawal code from Coinbase. It said, "Your Coinbase withdrawal code is XXXXXX. Please do not share this code with anyone. If you have not requested this, please call 888-986-0543" and gave a ref # to reference, I guess, what transition it was all about. It's not unusual to receive emails that I just ignore but this appears to be someone who attempted a withdrawal and need a code verification via text. I haven't used Coinbase outside signing up for it years ago. That's what most the emails were about, Coinbase wanting confirmation of dormant accounts. I'd asked you if that was legit before responding also, except I figured if they were going to delete the account after a certain said amount of time, I never used it so I didn't care, therefore if it was a scam email, I also wouldn't care. This one though, is more concerning, not that I have any real worth invested here, I just wouldn't want to start over having not even a pennies worth of upvote or resource credits. It may not even be Hive related, Bastyon would be more worthwhile to try take someone's change.
90% chance it's completely fake and has nothing to do with coinbase official. If they did get that, you likely wouldn't be at risk unless you handed them this code.
Blocktrades can you please explain why you keep downvoting my original content
You are even downvoting my comments
What is the problem 🤔