If your team is building internal AI workflows and trying to decide between hosted models and private deployment, this is the question worth answering first. We’ve been testing what it takes to deploy an open-source LLM on our own infrastructure for internal AI products and agentic workflows (RAG, tool calling, and multi-step flows), and the biggest lesson is this: self-hosting can be a strong option, but it is not a shortcut.
This post is for technology leaders and product teams who need to make a practical decision about control, cost, security, and operational overhead before they commit to a direction.
Why hosting your own model is appealing
What we liked:
Predictable costs: once deployed, cost is largely infrastructure-based instead of scaling directly with every API call.
Security: data stays inside your infrastructure, which can simplify risk conversations for internal use cases.
Control: the model you deploy won’t suddenly change or be deprecated by a vendor
What we didn’t like:
Model capability gap: open-source models can still be less capable than leading commercial models, depending on the task. That can translate into lower answer quality, more human QA, and slower adoption if the workflow depends on strong reasoning.
You own the ops: infrastructure, upgrades, reliability, and debugging are on you. That means the real cost is not just hardware. It is the ongoing operational burden.
Tooling and cluster maturity
A big takeaway: cluster solutions are not fully production-ready yet unless your team is comfortable operating in a fast-moving ecosystem.
Documentation often covers the happy path.
Edge cases are yours to solve, which increases delivery risk if you are trying to support a production internal tool.
Edge cases are yours to solve, which increases delivery risk if you are trying to support a production internal tool.
There are multiple clustering approaches:
Exo felt user-friendly, but not stable enough for us yet. Llama.cpp with RPC is more server-oriented, but still in preview mode.
Exo felt user-friendly, but not stable enough for us yet. Llama.cpp with RPC is more server-oriented, but still in preview mode.
Model performance varies significantly:
Larger models typically run at lower throughput than smaller ones. “Thinking” style models can also consume far more tokens per answer, which can amplify latency and performance bottlenecks in real workflows.
If you need advanced reasoning, RAM requirements can get very large (on the order of hundreds of GB), which can quickly turn into a hardware procurement and infrastructure cost issue.
What we actually tested
Setup: A small cluster of 3 Mac mini M4 machines (16GB RAM each)
This setup because is relatively affordable to buy and run. Our goal was to learn about tooling and stability before investing in stronger hardware or deploying in the cloud.
Models:
We tested models from mlx-community on Hugging Face that fit our memory constraints.
We also verified that we could load swiss-ai/Apertus-8B-Instruct-2509 directly from Hugging Face.
What this test proved for us:
We can get open-source models running locally and clustered for early experimentation.
We can validate basic model loading and start testing workflows relevant to internal AI products.
The biggest constraint was not the model. It was cluster stability and operational maturity.
Issues we hit (so you don’t have to)
With exo, we ran into a failure mode where the master loses connection and crashes. On restart, it replays cluster messages and takes a very long time. Restarting the cluster clears the event stream, but then the deployed model is lost.
Business impact: this kind of instability creates downtime risk for internal copilots and workflow tools, which hurts trust with users fast.
Code changes move fast. Pulling updates broke builds for us (for example, new dependencies surfaced that needed resolution).
Business impact: this is a maintenance burden, not a one-time setup project. Teams need to plan for ongoing support.
What this means for enterprise teams
Self-hosting can be the right move, but it is rarely the right first move unless you already have the platform and infrastructure muscle to support it.
The biggest risk is not usually “Can we run a model?” It is whether the system is reliable enough to support the workflow your team depends on. In practice, infrastructure maturity, operational ownership, and model fit all matter as much as the model itself.
What are our next steps?
We'll validate cluster stability end-to-end (we may need to wrap the cluster with a watchdog). Then, we'll pick a single open-source instruct model and verify we can do reliable tool calling and agentic flows in a repeatable way
A practical path to get started for your team
If you're planning to self-host an LLM for your team
- First, Validate the workflow (what the AI needs to do in the business)
- Next, test model quality against real use cases
- Then, harden the deployment path that matches your security and cost constraints
If you're still deciding between hosted and self-hosted AI for internal workflows or not sure where to start, Gritmind can help you evaluate the tradeoffs, validate model fit, and run a production-minded pilot before you over invest.
