Kubefeeds Team A dedicated and highly skilled team at Kubefeeds, driven by a passion for Kubernetes and Cloud-Native technologies, delivering innovative solutions with expertise and enthusiasm.

RamaLama Project Brings Containers and AI Together

5 min read

The RamaLama project stands at the intersection of AI and Linux containers and is designed to make it easier to develop and test AI models on developer desktops.

With the recent launch of RamaLama’s website and public invitation to contribute, I decided to catch up with two of the founders of the project, Eric Curtin and Dan Walsh. Dan and Eric previously worked together on the Podman container management tool, recently accepted as a Cloud Native Computing Foundation (CNCF) project.

How RamaLama Got Started

Scott McCarty: How did you get involved with RamaLama?

Eric Curtin, software engineer at Red Hat: RamaLama was a side project I was hacking on. We started playing around with LLaMA.cpp, making it easy to use with cloud native concepts. I’m also a LLaMA.cpp maintainer these days. I have a varied background in software.

Dan Walsh, senior distinguished engineer at Red Hat: I now work for the Red Hat AI team. For the last 15 years, I have been working on container technologies including the creation of Podman, now a CNCF project. For the last year or so, I’ve worked on bootable containers and this led to working on Red Hat Enterprise Linux AI (RHEL AI), which used bootable containers for AI tools. I also worked on the AI Lab Recipes, which used containers for running AI workloads. I worked with Eric a couple of years ago on a separate project, so we have kept in touch.

Scott: How and when did the RamaLama project get started?

Dan: Eric wrote up some scripts and was demonstrating his tools last summer, when I noticed the effort. I was concerned that the open source AI world was ignoring containers and was going to trap AI developers into specific laptop hardware and operating systems. And more importantly, exclude Linux and Kubernetes.

Eric: The initial goal of RamaLama was to make AI boring (easy to use) and use cloud native concepts. It was called podman-llm at the time. We had two main features planned back then: pull the AI accelerator runtime as a container and support multiple transport protocols (OCI, Hugging Face, Ollama). The diagram today in the README.md hasn’t really changed since.

Dan: I started suggesting changes like moving it to Python to make it easier for contributors and line up with most AI software. We renamed the project “RamaLama.” I also suggested we move the tools to the containers org on GitHub, where we had our first pull request merged on July 24, 2024.

Scott: Where did the name come from?

Eric: (laughs) I’ll leave that to Dan.

Dan: A lot of open AI content is using some form of Llama, spearheaded by Meta’s Llama2 AI model. We based some of the technology in RamaLama on Ollama, and the primary engine we use inside of the containers is LLaMA.cpp. So we wanted to somehow have a “llama” name. A silly song I recalled from when I was young was “Rama Lama Ding Dong,” so we picked the name RamaLama.

How RamaLama Works

Scott: What’s the advantage of using container images for AI models on the desktop?

Eric: We already use Open Container Image (OCI) as a distribution mechanism for things like application containers, bootc and AI runtimes. OCI registries are designed to transfer large data, and it’s a mature transport mechanism that’s already available in a lot of places.

Dan: Enterprises want to be able to store their AI content on their infrastructure. Many enterprises will not allow their software to pull directly from the internet. They will want to control the AI models used. They will want their models signed, versioned and with supply chain security data. They will want them to be orchestrated using tools like Kubernetes. Therefore being able to store AI models and AI content as OCI images and artifacts makes total sense.

Scott: How does RamaLama work?

Eric: RamaLama attempts to autodetect the primary accelerator in a system; it will pull an AI runtime based on this. Then it will use or pull a model based on the model name specified — for example, ramalama run granite3-moe — and then serve a model. That’s the most basic usage; there’s functionality for Kubernetes, Quadlet and many other features.

Dan: Another goal for RamaLama is to help developers get their AI applications into production. RamaLama makes it easy to convert an AI model from any transport into OCI content and then push the model to an OCI registry such as Docker Hub, Quay.io or Artifactory. RamaLama can not only serve models locally but will generate Quadlets and Kubernetes deployments to easily run the AI models in production.

Scott: Why is RamaLama important?

Dan: We make it easy for users to just install RamaLama and get up and running an AI model as a chatbot or to serve an AI-based service in a simple command, as opposed to the user having to download and install, and in some cases build, the AI tools before pulling a model to the system. One of the key ideas of RamaLama is to run the model within a container to protect the user from the model or the software running the model from affecting their host machine. Users running random models is a security concern.

Eric: It has given the community an accessible project for AI inferencing using cloud native concepts. We are also less opinionated about things like inferencing runtimes, transport mechanisms, backend compatibility and hardware compatibility, letting developers use and build on AI on their chosen systems.

RamaLama’s Support for Hardware and Other Tools

Scott: Are you able to support alternative hardware?

Eric: This is one area where RamaLama differs. Many projects have limited support for hardware and support just one or two types of hardware, like Nvidia or AMD. We will work with the community to enable alternate hardware on a best effort basis.

Dan: RamaLama is written in Python and can probably run anywhere Python is supported and supports Podman or Docker container engines. As far as accelerators, we currently have images to support CPU-only, as well as Vulkan, Cuda, Rocm, Asahi and Intel-GPU. A lot of these were contributed by the community, so if someone wants to contribute a containerfile (Dockerfile) to build the support for a new GPU or other accelerator, we will add it to the project.

Scott: What other tools does RamaLama integrate with?

Eric: RamaLama stands on the shoulders of giants and uses a lot of pre-existing technologies. From the containers perspective, we integrate with existing tooling like Podman, Docker, Kubernetes and Kubernetes-based tools. From the inferencing perspective, we integrate with LLaMA.cpp and vLLM, so we are compatible with tooling that can integrate with those APIs. There’s probably ways it’s being used that we are unaware of.

Scott: Does RamaLama work with the new DeepSeek AI model?

Eric: Yes, we were compatible with DeepSeek on the day the model was released. It’s one of the more impressive models; it’s interesting how it shows its thought process.

Dan: We have found very few GGUF (GPT-Generated Unified Format) models that it does not work with. When we have, we worked with the LLaMA.cpp project to get them fixed, and we have them working within a few days. We plan on supporting other models for use with vLLM as well.

What’s Ahead for AI?

Scott: Any other thoughts on RamaLama or the future of AI?

Dan: I see our AI adventure as being a series of steps. First, we play and serve AI models. RamaLama does this now. We want to enhance this by adding other ways of using AI models like Whisper. Next, we are actively working on helping users convert their static documents into retrieval-augmented generation (RAG) databases using open source tools like Docling and Llama Stack. After that, we add support for running and service models along with RAG data to improve the ability of AI models to give good responses. All this will be done focusing on containerizing the AI data as well.

The next step after that is support for AI agents. These agents allow AI models to interact with random APIs and databases all over the internet. We are seeing a ton of work going on in this field in the open source world. We want to make it easy for developers to take advantage of these tools and to eventually put them into production.

Eric: We welcome the community to get involved. I still see RamaLama as being in its infancy. We’ve only barely touched on things like RAG, AI agents, speech recognition and Stable Diffusion. I’m looking forward to seeing how the community will use it. Podman at the start was used for things like servers; now we see more creative usages of it like Podman Desktop, toolbox and bootc. I’m looking forward to seeing how RamaLama evolves for unprecedented use cases.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon Europe in London from April 1-4.

The post RamaLama Project Brings Containers and AI Together appeared first on The New Stack.

Kubefeeds Team A dedicated and highly skilled team at Kubefeeds, driven by a passion for Kubernetes and Cloud-Native technologies, delivering innovative solutions with expertise and enthusiasm.