SaladCloud Blog

Your own ChatGPT for just $0.04/hr – with Ollama, ChatUI and Salad

Deploy your own ChatGPT with Ollama, Huggingface Chat UI and Salad

Deploy your own LLM with Ollama & Huggingface Chat UI on Salad How much does it cost to build and deploy a ChatGPT-like product today? The cost could be anywhere from thousands to millions – depending on the model, infrastructure and use case. Even the same task could cost anywhere from $1000 to $100,000. But with the advancement of open-source models & open infrastructure, there’s been tremendous interest in building a cost-efficient ChatGPT-like tool for various real-life applications. In this article, we explore how tools like Ollama and Huggingface Chat UI can simplify this process, particularly when deployed on Salad’s distributed cloud infrastructure. The challenges in hosting & implementing LLMs In today’s digital ecosystem, Large Language Models (LLMs) have revolutionized various sectors, including technology, healthcare, education, and customer service. Their ability to understand and generate human-like text has made them immensely popular, driving innovations in chatbots, content creation, and more. These models, with their vast knowledge bases and sophisticated algorithms, can converse, comprehend complex topics, write code, and even compose poetry. This makes them highly versatile tools for many enterprise & everyday use-cases. However, hosting and implementing these LLMs poses significant challenges. Despite these challenges, the integration of LLMs into platforms continues to grow, driven by their vast potential and the continuous advancements in the field. As solutions like Hugging Face’s Chat UIand SaladCloud offer more accessible and efficient ways to deploy these models, we’re likely to see an even greater adoption and innovation across industries. What is Ollama? Ollama is a tool that enables the local execution of open-source large language models like Llama 2 and Mistral 7B on various operating systems, including Mac OS, Linux, and soon Windows. It simplifies the process of running LLMs by allowing users to execute models with a simple terminal command or an API call. Ollama optimizes setup and configuration, specifically tailoring GPU usage for efficient performance. It supports a variety of models and variants, all accessible through the Ollama model library, making it a versatile and user-friendly solution for running powerful language models locally. Here is a list of supported models: Model Parameters Size Download Llama2 7B 3.8GB ollama run llama2 Mistral 7B 4.1GB ollama run mistral Dolphin Phi 2.7B 1.6GB ollama run dolphin-phi Phi-2 2.7B 1.7GB ollama run phi Neural Chat 7B 4.1GB ollama run neural-chat Starling 7B 4.1GB ollama run starling-lm Code Llama 7B 3.8GB ollama run codellama Llama 2 Uncensored 7B 3.8GB ollama run llama2-uncensored Llama 2 13B 13B 7.3GB ollama run llama2:13b Llama 2 70B 70B 39GB ollama run llama2:70b Orca Mini 3B 1.9GB ollama run orca-mini Vicuna 7B 3.8GB ollama run vicuna LLaVA 7B 4.5GB ollama run llava What is Huggingface Chat UI? Huggingface Chat UI is a powerful tool for practitioners in the Large Language Model (LLM) space looking to deploy a ChatGPT-like conversational interface. It enables interaction with models hostedon Huggingface, leveraging its text generation inference or any custom API powered by LLM. Chat UI has such capabilities as conversational history, memory, authentication, and theming. Huggingface Chat UI is an ideal choice for those looking to create a more engaging and robust conversational agent. Integrating Ollama and Huggingface Chat UI for deploying on Salad The main goal of our project is to integrate Ollama with Huggingface Chat UI and deploy them to Salad.The final version of the code can be found here: GitHub – SaladTechnologies/ollama-chatui In order to achieve our goal we did the following: 1. Clone Ollama Repository We start by cloning the Ollama repository from Ollama Git Repo. This repository serves as the base of the project.Ollama is a user-friendly tool and can be operated via terminal or as a REST API. In this project, the intention is to run Ollama in a Docker container and connect it to Chat UI. The Dockerfile from Ollama repository shows that it runs on host 0.0.0.0 and port 11434. However, since direct access to Ollama isn’t required but rather through the UI, this configuration will be modified later. 2. Setting Up Huggingface Chat UI Chat UI git repo: GitHub – huggingface/chat-ui: Open source codebase powering the HuggingChat app From the Chat UI Readme, we can see that we need to follow a few steps to make it work in our custom solution: Notice that the path to ollama is specified as http://127.0.0.1:11434. 3. Connecting Ollama and Chat UI We now need to connect Ollama and ChatUI. This involves ensuring that the Chat UI can communicate with the Ollama instance, typically by setting the appropriate port and host settings in the UI configuration to match the Ollama Docker deployment. First we clone the ChatUI repo in our Dockerfile and replace the host that Ollama uses with 127.0.0.1. Next expose port 3000 that is used by ChatUi.We will also replace the entrypoint with our custom shell script: With this script, we establish the necessary .env.local file and populate it with configurations. Next, we initiate the Ollama server in a separate tmux session to download the desired model. ChatUI is then activated on port 3000. For any adjustments in model settings, refer to the models_config/model.local file. We have also converted the MongoDB URL, Huggingface Token, and Model name into environment variables to facilitate seamless alterations during deployment to Salad. Additionally, a DOWNLOAD_TIME variable is defined. Since Ollama runs in a tmux session, it allows subsequent commands to execute even if the server isn’t fully operational. To ensure that Ollama is fully active before initiating ChatUI, we incorporate a sleep duration. This duration is model-dependent; forinstance, downloading llama2 might take around 8 minutes. 4. Deploying to Salad After setting up and connecting Ollama and Chat UI, the complete system is ready for deployment to Salad’s cloud infrastructure. The integrated solution will be hosted on Salad’s robust cloud platform. Detailed deployment instructions and necessary files are accessible through the Salad Technologies Ollama Chat UI GitHub repository or by pulling the image from Salad Docker Registry: saladtechnologies/ollama-chatui-salad:1.0.0. To deploy our solution we need to follow instructions: Deploy a Container Group with

LLM Comparison Through TGI Benchmark Using SaladCloud

LLM comparison benchmark with text generation inference on Salad GPU cloud

In the field of Artificial Intelligence (AI), Text Generation Inference (TGI) has become a vital toolkit for deploying and serving Large Language Models (LLMs). TGI enables efficient and scalable text generation with popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and Mistral. This SaladCloud benchmark dives deep into this technology, with a LLM comparison focused on the performance of popular language models. TGI and Large Language Models TGI is essential for leveraging the capabilities of Large Language Models, which are key to many AI applications today. These models, known for generating text that closely resembles human writing, are crucial for applications ranging from automated customer service to creative content generation.You can easily deploy TGI on Salad using the following instructions: Run TGI (Text Generation Interface) by Hugging Face Experiment design: Benchmarking on SaladCloud Our benchmark study on SaladCloud aims to evaluate and compare select LLMs deployed through TGI. This will provide insights into model performance under varying loads and the efficacy of SaladCloud in supporting advanced AI tasks. Models for comparison We’ve selected the following models for our benchmark, each with its unique capabilities: Test parameters Batch Sizes: The models will be tested with batch sizes of 1, 4, 8, 16, 32, 64, 128.Hardware Configuration: Uniform hardware setup across tests with 8 vCPUs, 28GB of RAM, and a 24GB GPU card.Benchmarking Tool: To conduct this benchmark, we utilized the Text Generation Benchmark Tool,which is a part of TGI, designed to effectively measure the performance of these models.Model Parameters: We’ve used the default Sequence length of 10 and decode length 8. Performance metrics The TGI benchmark provides us with the following metrics for each batch we provided: Bigcode/santacoder bigcode/santacoder is part of the SantaCoder series, featuring 1.1 billion parameters and trained on subsets of Python, Java, and JavaScript from The Stack (v1.1). This model, known for its Multi Query Attention and a 2048-token context window, utilizes advanced training techniques like near-deduplication and comment-to-code ratio filtering. The SantaCoder series also includes variations in architecture and objectives, providing diverse capabilities in code generation and analysis. This is the smallest model in our benchmark. Key observations Cost-effectiveness on Salad Cloud: bigcode/santacoder A key part of our analysis focused on the cost-effectiveness of running TGI models on SaladCloud. For a batch size of 32, with a compute cost of $0.35 per hour, we calculated the cost per million tokens based on throughput : The cost per token, considering the throughput and compute price, is approximately $0.03047 or about 3.047 cents per million tokens for output and $0.07572 per million input tokens. Tiiuae/falcon-7b Falcon-7B is a decoder-only model with 7 billion parameters, built by TII and trained on an extensive 1,500B tokens dataset from RefinedWeb, enhanced with curated corpora. It is available under the Apache 2.0 license, making it a significant model for large-scale text generation tasks. Key findings Cost-effectiveness on SaladCloud: Tiiuae/Falcon-7b For the tiiuae/falcon-7b model on SaladCloud with a batch size of 32 and a compute cost of $0.35 per hour, the calculated cost per million tokens with a throughput of 744 tokens per second is approximately $0.13095, or about 13.095 cents per million output tokens and $0.28345 per million input tokens. Average decode total latency for batch size 32 is 300.82 milliseconds. While this latency might be slightly higher compared to smaller models, it still falls within a reasonable range for many applications, especially considering the model’s large size of 7 billion parameters. The cost-effectiveness, combined with the model’s capabilities, makes it a viable option for extensive text generation tasks on SaladCloud. Code Llama Code Llama is a collection of generative text models, with the base model boasting 7 billion parameters. It’s part of a series ranging up to 34 billion parameters, specifically tailored for code-related tasks. This benchmark focuses on the base 7B version in Hugging Face Transformers format, designed to handle a wide range of coding applications. The cost for processing one million tokens using the Code Llama model on SaladCloud, with a batch size of 32 and a compute cost of $0.35 per hour, is approximately $0.11826 per million output tokens and $0.28679 per million input tokens. This figure highlights the economic feasibility of utilizing SaladCloud for large-scale text generation tasks with sophisticated models like Code Llama. Mistral-7B-Instruct-v0.1 Mistral-7B-Instruct-v0.1 is an instruct fine-tuned version of the Mistral-7B-v0.1 generative text model. This model leverages a variety of publicly available conversation datasets to enhance its capability in understanding and generating human-like, conversational text. Its finetuning makes it particularly adept at handling instruction-based queries, setting it apart in the realm of LLMs. Key insights Implications and cost analysis The performance of the Mistral-7B-Instruct-v0.1 model on SaladCloud shows promising potential for its use in various AI-driven conversational systems. Its ability to process a high number of tokens per second at a manageable latency makes it a strong contender for applications requiring nuanced language understanding and generation. With a price of $0.35 per hour, we achieve a cost of approximately $0.12153 per million output tokens and $0.27778 per million input tokens. Conclusion – LLM comparison benchmark results Our comprehensive LLM comparison benchmark of various Text Generation Inference (TGI) models on SaladCloud reveals an insightful trend: despite the diversity in the models’ capabilities and complexities, there is a remarkable consistency in cost-effectiveness when using the same compute configuration. Consistent performance and cost-effectiveness Customizable compute options Final thoughts In conclusion, Salad Cloud emerges as a compelling choice for deploying and running TGI models. Its ability to provide uniform compute efficiency across a range of models, combined with the option to customize and optimize compute resources, offers both consistency in performance and flexibility in cost management. Whether it’s for large-scale commercial deployments or smaller, more targeted AI tasks, SaladCloud’s platform is well-equipped to meet diverse text generation requirements with an optimal balance of efficiency and costeffectiveness