Deploy and Use Open Source GPT Models for RAG

  • You work as a network engineer at a renowned system integrator company. You are tasked with configuring a broad range of networking devices—from enterprise-level Cisco Catalyst switches to Cisco Nexus 9000 devices in data centers. Keeping track of configuration details across these platforms is a challenging task. Although you are adept at reading and writing technical documentation, it still takes you a considerable amount of time. Despite having a subscription to a cloud AI provider that could streamline your search process, company policy restricts you from uploading any confidential information to the cloud AI provider. Alternatively, you think of deploying a comparable AI solution on-premises yourself.

  • While searching for an appropriate on-premises solution, you come across various open-source GPT models and chatbot applications capable of managing general IT tasks. A standout discovery is the open-source Open WebUI application, which incorporates an Ollama inference server and offers a user-friendly chat interface. This interface is equipped with advanced features such as Retrieval Augmented Generation (RAG), allowing you to upload files and use them as reference data for the GPT. Remarkably, deploying this application is straightforward, requiring just a simple Docker command. You decide to try Open WebUI.

  • To proceed, you need a computer—either physical or virtual—with a GPU, which considerably enhances processing speeds. Fortunately, the IT Ops department has provided you with a Linux VM that is equipped with 8 GB of GPU RAM and Docker already configured to use the GPU resources.

  • To deploy the Open WebUI application, you run the following Docker command that you found in the official documentation: docker run -d -p 3000:8080 -e WEBUI_AUTH=False --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama.

  • This Docker command starts the container in detached mode by using -d, which allows it to operate in the background without occupying the terminal. The command also maps port 3000 on your VM to port 8080 on the container via -p 3000:8080, enabling you to access the application through your VM IP address on port 3000. Authentication is disabled with -e WEBUI_AUTH=False for easier access during initial tests. The --gpus=all option allocates all available GPU resources to the Docker container to help ensure optimal performance.

  • Further, the command mounts volumes with -v ollama:/root/.ollama and -v open-webui:/app/backend/data for storing the downloaded models and application data, respectively. The container is named open-webui for straightforward management and is set to automatically restart with --restart always should it stop unexpectedly, such as during a reboot. Finally, the ghcr.io/open-webui/open-webui:ollama specifies the Docker image that is sourced from the GitHub Container Registry.

  • When you press the Enter key, you feel relieved that you have avoided a lengthy and tedious installation and configuration process. After a few seconds, the Open WebUI application is up and running on your VM, providing a robust, in-house solution for your documentation and configuration search needs while remaining compliant with your company's data security policies.

Get Started with Open WebUI and RAG

  • With the Open WebUI application up and running, you begin searching the web to understand how Retrieval Augmented Generation (RAG) functions. You learn that RAG relies on a technique called semantic search, which identifies relevant context within the files you upload to the RAG system. Unlike traditional search methods that search for exact keyword matches, semantic search aims to grasp the intent and contextual meaning behind the words in a query.

  • A key component that enables semantic search is the embedding process that transforms the text from your queries and potential information sources into numerical vectors. These vectors are essentially lists of numbers that represent text in a high-dimensional space. You can think of these vectors as coordinates on a map where texts with similar meanings have vectors that are closer together. For example, the words "apple" and "banana" will be placed close to each other because both are types of fruit, while the word "keyboard" will be further from the fruits, as it relates to a different context. The same principle applies to phrases or even entire paragraphs.

  • The files you add to the RAG system undergo several processing steps. First, the text is extracted from the files. This text is then divided into smaller sections called chunks. Chunking is necessary because GPT models have a limit on the number of characters they can process at once. This limit is known as the context size of the model. This context size varies between different GPT models. The way text is chunked significantly impacts the quality of the answers because each chunk should ideally capture all semantically similar content—but you will explore that in more detail later.

  • After chunking, the sections are transformed into vectors using an embedding model, which is a specialized Large Language Model (LLM). These vectors are then stored in a vector database, optimized for efficiently handling high-dimensional data. You can think of the vector database as a lookup table, where each vector serves as a key linked to its corresponding raw text. The entire process is illustrated in the next figure.Alt text

  • When you ask a question using the RAG system, it first uses the embedding model LLM to convert your question into vectors. It then uses these vectors to search the vector database for the information that is most relevant to your query. The text from the best matches is combined with your original question to create a new expanded context. This expanded context, along with your initial query, is sent to the inference LLM in a raw text format. The inference LLM uses this information to provide an accurate answer, enhancing its built-in knowledge with the most relevant data extracted from the database. You see the entire process shown in the next figure. Note, that the same embedding LLM is used for both creating the database from the uploaded files and for a semantic search with queries.Alt text

  • Satisfied with your high-level understanding of the RAG pipeline, you begin to explore the Open WebUI application.

Arena Model

  • The phi3, and llama3.1 are categorized as instruct models. Instruct models are specifically trained to excel at tasks that require following user instructions. These models are particularly well suited for chatbots because they can interpret to a wide range of user queries and commands and respond to them effectively. The mxbai-embed-large model is trained specifically for embedding tasks. This model is used in RAG applications for semantic search.

  • The models featured in this selection are classified as small LLMs because they possess fewer parameters. Models with a greater number of parameters typically demonstrate enhanced capabilities, handling more complex tasks and offering more nuanced responses. Each model here is designed to operate on a consumer-grade GPU with 8 GB of RAM, making them accessible for smaller-scale applications.

  • To estimate the RAM consumption of a GPT model, you can use the formula: M = P x Q/8 x 1.2, where M is the estimated memory in Gigabytes, P is the number of parameters in billions, Q is the quantization in bits, and 1.2 accounts for additional overhead related to inference optimization tasks. Quantization defines how many bits are used for storing each parameter in memory. It's important to note that while quantization helps reduce memory requirements by compressing model size, it can also lead to a decrease in performance.

  • For instance, the phi3 model, which has 3.8 billion parameters and uses 4-bit quantization, would have an estimated GPU consumption calculated as follows: 3.8 x (4/8) x 1.2 = 2.28 GB of RAM. Similarly, the llama3.1 model, with 8 billion parameters and 4-bit quantization, would require approximately 4.8 GB of RAM. This approach ensures that the models are both efficient and effective within the constraints of typical consumer hardware. To put things into perspective, GPT-3 has 175 B parameters and runs 16-bit quantization which consumes around 420 GB of RAM.

  • The RAM consumption for RAG applications can be calculated by considering both the inference and the embedding models. For example, the mxbai-embed-large embedding model with 334 million (0.334 billion) parameters and using 16-bit quantization, requires approximately 0.8 GB of RAM. When combined with the llama3.1 model, the entire RAG system has a baseline RAM requirement of at least 5.6 GB. Additionally, optimization processes such as data caching, activation map storage, and parallel processing overhead are actively running on the GPU, all of which demand extra RAM. These processes are essential for efficient data retrieval, quick access to frequently used data, and managing the computational load across multiple GPU cores. Therefore, allocating 8 GB of GPU RAM for the system provides sufficient headroom to accommodate these demands and ensures smooth operation under varying workloads.

  • Note: The models were downloaded at the time of writing this guide. The latest tag means that the models were at their latest version when the lab guide was created. These open-source models are quite frequently updated so keep in mind that the latest versions might behave differently. You can download additional models from huggingface or ollama portals, or even use cloud services, such as ChatGPT from OpenAI in the administration settings of the application.

Last updated