...

SaladCloud Blog

Whisper Large V3 Speech Recognition Benchmark: 1 Million hours of audio transcription for just $5110

Whisper large v3 - Automatic speech - recognition - gpu benchmark

Save over 99.8% on audio transcription using Whisper Large V3 and consumer GPUs A 99.8% cost-savings for automatic speech recognition sounds unreal. But with the right choice of GPUs and models, this is very much possible and highlights the needless overspending on managed transcription services today. In this deep dive, we will benchmark the latest Whisper Large V3 model from Open AI for inference against the extensive English CommonVoice and Spoken Wikipedia Corpus English (Part1, Part2) datasets, delving into how we accomplished an exceptional 99.8% cost reduction compared to other public cloud providers. Building upon the inference benchmark of Whisper Large V2 and with our continued effort to enhance the system architecture and implementation for batch jobs, we have achieved substantial reductions in both audio transcription costs and time while maintaining the same level of accuracy as the managed transcription services. Behind The Scenes: Advanced System Architecture for Batch Jobs Our batch processing framework comprises of the following: We aimed to keep the framework components fully managed and serverless to closely simulate the experience of using managed transcription services. A decoupled architecture provides the flexibility to choose the best and most cost-effective solution for each component from the industry. Within each node in the GPU resource pool in SaladCloud, two processes are utilized following best practices: one dedicated to GPU inference and another focused on I/O and CPU-bound tasks, such as downloading/uploading, preprocessing, and post-processing. 1) Inference Process The inference process operates on a single thread. It begins by loading the Whisper Large V3 model, warming up the GPU, and then listens on a TCP port by running a Python/FastAPI app in a Unicorn server. Upon receiving a request, it calls the transcription inference and returns the generated assets.  The chunking algorithm is configured for batch processing, where long audio files are segmented into 30-second clips, and these clips are simultaneously fed into the model. The batch inference significantly enhances performance by effectively leveraging the GPU cache and parallel processing capabilities.  2) Benchmark Worker Process The benchmark worker process primarily handles various I/O tasks, as well as pre- and post processing. Multiple threads are concurrently performing various tasks: one thread pulls jobs and downloads audio clips; another thread calls the inference, while the remaining threads manage tasks such as uploading generated assets, reporting job results and cleaning the environment, etc. Several queues are created to facilitate information exchange among these threads. Running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks, and fetching the next audio clips earlier while the current one is still being transcribed, allows us to eliminate any waiting period. After one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time for batch jobs but also leads to even more significant cost savings. Deployment on SaladCloud We created a container group with 100 replicas (2 vCPU and 12 GB RAM with 20 different GPU types) in SaladCloud, and ran it for approximately 10 hours. In this period, we successfully transcribed over 2 million audio files, totalling nearly 8000 hours in length. The test incurred around $100 in SaladCloud costs and less than $10 on both AWS and Cloudflare. Results from the Whisper Large v3 benchmark Among the 20 GPU types, based on the current datasets, the RTX 3060 stands out as the most cost-effective GPU type for long audio files exceeding 30 seconds. Priced at $0.10 per hour on SaladCloud, it can transcribe nearly 200 hours of audio per dollar.  For short audio files lasting less than 30 seconds, several GPU types exhibit similar performance, transcribing approximately 47 hours of audio per dollar.  On the other hand, the RTX 4080 outperforms others as the best-performing GPU type for long audio files exceeding 30 seconds, boasting an average real-time factor of 40. This implies that the system can transcribe 40 seconds of audio per second. While for short audio files lasting less than 30 seconds, the best average real-time factor is approximately 8 by a couple of GPU types, indicating the ability to transcribe 8 seconds of audio in just 1 second. Analysis of the benchmark results Different from those obtained in local tests with several machines in a LAN, all these numbers are achieved in a global and distributed cloud environment that provides the transcription at a large scale, including the entire process from receiving requests to transcribing and sending the responses. There are various methods to optimize the results. Aiming for reduced costs, improved performance or even both, and different approaches may yield distinct outcomes. The Whisper models come in five configurations of varying model sizes: tiny, base, small, medium, and large(v1/v2/v3). The large versions are multilingual and offer better accuracy, but they demand more powerful GPUs and run relatively slowly. On the other hand, the smaller versions support only English with slightly lower accuracy, but it requires less powerful GPUs and runs very fast. Choosing more cost-effective GPU types in the resource pool will result in additional cost savings. If performance is the priority, selecting higher-performing GPU types is advisable, while still remaining significantly less expensive than managed transcription services.  Additionally, audio length plays a crucial role in both performance and cost, and it’s essential to optimize the resource configuration based on your specific use cases and business goals. Discover our open-source code for a deeper dive: Implementation of Inference and Benchmark Worker Docker Images Data Exploration Tool Performance Comparison across Different Clouds The results indicate that AI transcription companies are massively overpaying for cloud today. With the open-source automatic speech recognition model – Whisper Large V3, and the advanced batch-processing architecture leveraging hundreds of consumer GPUs on SaladCloud, we can deliver transcription services at a massive scale and at an exceptionally low cost, while maintaining the same level of accuracy as managed transcription services.  With the most cost-effective GPU type for Whisper Large V3 inference on SaladCloud, $1 dollar can transcribe 11,736 minutes of audio (nearly 200 hours), showcasing a 500-fold

Tutorial: How to run your own GPU-accelerated JupyterLab on SaladCloud

Jupyterlab deployment tutorial on SaladCloud

In recent times, JupyterLab has gained popularity among data scientists and students because of its ease of use, flexibility, and extensibility. But access to resources and cost remains a hindrance. In this blog, we provide a walkthrough on creating and running your own GPU-accelerated JupyterLab taking advantage of low GPU prices on SaladCloud. The challenge in data science learning & research Many college students and professionals in the AI and Data Science industry face common challenges when dealing with GPU-capable development environments for learning, testing, or researching. The laptops they use daily often lack a dedicated GPU, or the built-in GPUs are incompatible with the popular frameworks like TensorFlow and PyTorch. Investing in a second computer with an NVIDIA GPU for Machine Learning not only costs thousands of dollars but also results in low utilization and inconvenience. In addition, building development environments using NVIDIA GPUs could be tedious work. One needs to be familiar with Windows, Linux or both; understand the version compatibility among different software pieces; and know how to install Python and its IDE, TensoFlow/PyTorch, C/C++ Compiler, cuDNN, CUDA and GPU Driver, etc. The process can be frustrating and time-consuming. Many individuals spend several days reading instructions and seeking help online, hindering research and learning progress. While public cloud providers offer options with GPU-capable compute instances or managed services, these solutions work well for enterprise customers training and deploying large AI models in production environments. However, they are too expensive and an overkill for personal learning or testing, with prices ranging from $0.50 to tens of dollars per hour. Moreover, the services from these public cloud providers are becoming more and more complicated, and many services are intertwined and built on top of others. To start working on AI and Data Science using these public clouds, you likely need several weeks first to gain a basic understanding of how these services work together. The JupyterLab solution This is where a tool like JupyterLab is becoming increasingly popular as the standard for learning & researching in data science. JupyterLab is a web-based interactive development environment for notebooks, code, and data. It is designed to provide a flexible and powerful platform for data science, scientific computing, and computational workflows. JupyterLab is the next generation of Jupyter Notebook, which is one of the most popular IDEs for data science. It offers more features, flexibility, and integration than the classic Jupyter Notebook. But accessing and running JupyterLab on public clouds still requires significant time and financial commitment. Easy, affordable access to JupyterLab on SaladCloud Salad is the world’s largest community-powered cloud connecting unused compute resources with GPU-hungry businesses. By running JupyterLab on a distributed cloud infrastructure like Salad, you can now learn data science at a more affordable cost. With more than a Million individuals sharing compute and 10,000+ GPUs available at any time, SaladCloud offers consumer-grade GPUs at the lowest market prices compared to any other cloud. Prices start from $0.02/hour. You can view the complete list of GPU prices here. SaladCloud is very straightforward and easy to use: with pre-built container images, you can swiftly launch publicly-accessible, elastic and GPU-accelerated container applications within a few minutes. By building and running JupyterLab container images with popular AI/ML frameworks, we can transform SaladCloud into an ideal platform for college students and professionals to: Cost analysis of running JupyterLab on Salad Here are the typical use cases running JupyterLab on SaladCloud and a cost analysis for each: Resource Type Use Cases Public Cloud Providers SaladCloud 2vCPU, 4 GB RAM,GPU with 4 GB VRAM Learning programming with Shell, C/C++, CUDA,  PyTorch/TensorFlow, and Hugging Face. N/A $0.032 per hour 4vCPU, 16 GB RAMGPU with 16 GB VRAM Most NLP and CV tasks including testing, training and inference. $0.5+ per hour,Additional charge onnetwork traffic. $0.31 per hour,40% Saving 8vCPU, 24 GB RAMGPU with 24 GB VRAM Testing, fine-tuning and inference for the latest LLM and Stable Diffusion, etc. $1.2+ per hour,Additional charge onnetwork traffic. $0.36 per hour,70% Saving Cost comparison of Salad & public cloud providers for different JupyterLab use cases Several JupyterLab container images have been built to meet general AI/ML requirements. The corresponding Dockerfiles are also available on the GitHub repository, allowing SaladCloud users to tailor these images to specific needs.  Resources: How to deploy JupyterLab on SaladCloud SaladCloud is designed to execute stateless container workloads. To preserve code and data while using JupyterLab, it is imperative to set up the cloud-based storage and integrate it with the JupyterLab containers. We have already integrated the major public cloud platforms, including AWS, Azure and GCP, into the pre-built container images. There are also detailed instructions on how to provision storage services on these platforms. With these integrations, the JupyterLab instances running on SaladCloud support the data persistence, ensuring that changes of code and data are automatically saved to the cloud.  For more information on how these images are built and integrated with public cloud providers, please refer to the user guide. Deploy the JupyterLab instance Let’s run a JupyerLab container instance on SaladCloud to see what it looks like. In this instance, we utilize AWS S3 as the backend storage. The AWS S3 bucket/folder has already been provisioned and the access key ID and secret access key have been generated. This step can be omitted if data persistence in the container is not necessary. Log in the SaladCloud Console and deploy the JupyterLab instance by selecting ‘Deploy a Container Group’ with the following parameters: Parameter Value Container Group Name jupyterlab001 Image Source saladtechnologies/jupyterlab:1.0.0-pytorch-tensorflow-cpu-aws-azure-gcp Replica Count 1 vCPU 2 Memory 4 GB GPU RTX 1650 (4 GB), RTX 2080 (8 GB), RTX 4070 (12 GB)# We can choose multiple GPU types simultaneously, and SaladCloud will then select a node that matches one of the selected types. Networking Enable, Port: 8000Use Authentication: No Environment Variables JUPYTERLAB_PWAWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_S3_BUCKET_FOLDER Setup the environment variables The default password for JupyterLab will be ‘data’ if we don’t provide the environment variable –  ‘JUPYERLAB_PW’, and the other 3 AWS-related environment variables can be omitted

Your own ChatGPT for just $0.04/hr – with Ollama, ChatUI and Salad

Deploy your own ChatGPT with Ollama, Huggingface Chat UI and Salad

Deploy your own LLM with Ollama & Huggingface Chat UI on Salad How much does it cost to build and deploy a ChatGPT-like product today? The cost could be anywhere from thousands to millions – depending on the model, infrastructure and use case. Even the same task could cost anywhere from $1000 to $100,000. But with the advancement of open-source models & open infrastructure, there’s been tremendous interest in building a cost-efficient ChatGPT-like tool for various real-life applications. In this article, we explore how tools like Ollama and Huggingface Chat UI can simplify this process, particularly when deployed on Salad’s distributed cloud infrastructure. The challenges in hosting & implementing LLMs In today’s digital ecosystem, Large Language Models (LLMs) have revolutionized various sectors, including technology, healthcare, education, and customer service. Their ability to understand and generate human-like text has made them immensely popular, driving innovations in chatbots, content creation, and more. These models, with their vast knowledge bases and sophisticated algorithms, can converse, comprehend complex topics, write code, and even compose poetry. This makes them highly versatile tools for many enterprise & everyday use-cases. However, hosting and implementing these LLMs poses significant challenges. Despite these challenges, the integration of LLMs into platforms continues to grow, driven by their vast potential and the continuous advancements in the field. As solutions like Hugging Face’s Chat UIand SaladCloud offer more accessible and efficient ways to deploy these models, we’re likely to see an even greater adoption and innovation across industries. What is Ollama? Ollama is a tool that enables the local execution of open-source large language models like Llama 2 and Mistral 7B on various operating systems, including Mac OS, Linux, and soon Windows. It simplifies the process of running LLMs by allowing users to execute models with a simple terminal command or an API call. Ollama optimizes setup and configuration, specifically tailoring GPU usage for efficient performance. It supports a variety of models and variants, all accessible through the Ollama model library, making it a versatile and user-friendly solution for running powerful language models locally. Here is a list of supported models: Model Parameters Size Download Llama2 7B 3.8GB ollama run llama2 Mistral 7B 4.1GB ollama run mistral Dolphin Phi 2.7B 1.6GB ollama run dolphin-phi Phi-2 2.7B 1.7GB ollama run phi Neural Chat 7B 4.1GB ollama run neural-chat Starling 7B 4.1GB ollama run starling-lm Code Llama 7B 3.8GB ollama run codellama Llama 2 Uncensored 7B 3.8GB ollama run llama2-uncensored Llama 2 13B 13B 7.3GB ollama run llama2:13b Llama 2 70B 70B 39GB ollama run llama2:70b Orca Mini 3B 1.9GB ollama run orca-mini Vicuna 7B 3.8GB ollama run vicuna LLaVA 7B 4.5GB ollama run llava What is Huggingface Chat UI? Huggingface Chat UI is a powerful tool for practitioners in the Large Language Model (LLM) space looking to deploy a ChatGPT-like conversational interface. It enables interaction with models hostedon Huggingface, leveraging its text generation inference or any custom API powered by LLM. Chat UI has such capabilities as conversational history, memory, authentication, and theming. Huggingface Chat UI is an ideal choice for those looking to create a more engaging and robust conversational agent. Integrating Ollama and Huggingface Chat UI for deploying on Salad The main goal of our project is to integrate Ollama with Huggingface Chat UI and deploy them to Salad.The final version of the code can be found here: GitHub – SaladTechnologies/ollama-chatui In order to achieve our goal we did the following: 1. Clone Ollama Repository We start by cloning the Ollama repository from Ollama Git Repo. This repository serves as the base of the project.Ollama is a user-friendly tool and can be operated via terminal or as a REST API. In this project, the intention is to run Ollama in a Docker container and connect it to Chat UI. The Dockerfile from Ollama repository shows that it runs on host 0.0.0.0 and port 11434. However, since direct access to Ollama isn’t required but rather through the UI, this configuration will be modified later. 2. Setting Up Huggingface Chat UI Chat UI git repo: GitHub – huggingface/chat-ui: Open source codebase powering the HuggingChat app From the Chat UI Readme, we can see that we need to follow a few steps to make it work in our custom solution: Notice that the path to ollama is specified as http://127.0.0.1:11434. 3. Connecting Ollama and Chat UI We now need to connect Ollama and ChatUI. This involves ensuring that the Chat UI can communicate with the Ollama instance, typically by setting the appropriate port and host settings in the UI configuration to match the Ollama Docker deployment. First we clone the ChatUI repo in our Dockerfile and replace the host that Ollama uses with 127.0.0.1. Next expose port 3000 that is used by ChatUi.We will also replace the entrypoint with our custom shell script: With this script, we establish the necessary .env.local file and populate it with configurations. Next, we initiate the Ollama server in a separate tmux session to download the desired model. ChatUI is then activated on port 3000. For any adjustments in model settings, refer to the models_config/model.local file. We have also converted the MongoDB URL, Huggingface Token, and Model name into environment variables to facilitate seamless alterations during deployment to Salad. Additionally, a DOWNLOAD_TIME variable is defined. Since Ollama runs in a tmux session, it allows subsequent commands to execute even if the server isn’t fully operational. To ensure that Ollama is fully active before initiating ChatUI, we incorporate a sleep duration. This duration is model-dependent; forinstance, downloading llama2 might take around 8 minutes. 4. Deploying to Salad After setting up and connecting Ollama and Chat UI, the complete system is ready for deployment to Salad’s cloud infrastructure. The integrated solution will be hosted on Salad’s robust cloud platform. Detailed deployment instructions and necessary files are accessible through the Salad Technologies Ollama Chat UI GitHub repository or by pulling the image from Salad Docker Registry: saladtechnologies/ollama-chatui-salad:1.0.0. To deploy our solution we need to follow instructions: Deploy a Container Group with

LLM Comparison Through TGI Benchmark Using SaladCloud

LLM comparison benchmark with text generation inference on Salad GPU cloud

In the field of Artificial Intelligence (AI), Text Generation Inference (TGI) has become a vital toolkit for deploying and serving Large Language Models (LLMs). TGI enables efficient and scalable text generation with popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and Mistral. This SaladCloud benchmark dives deep into this technology, with a LLM comparison focused on the performance of popular language models. TGI and Large Language Models TGI is essential for leveraging the capabilities of Large Language Models, which are key to many AI applications today. These models, known for generating text that closely resembles human writing, are crucial for applications ranging from automated customer service to creative content generation.You can easily deploy TGI on Salad using the following instructions: Run TGI (Text Generation Interface) by Hugging Face Experiment design: Benchmarking on SaladCloud Our benchmark study on SaladCloud aims to evaluate and compare select LLMs deployed through TGI. This will provide insights into model performance under varying loads and the efficacy of SaladCloud in supporting advanced AI tasks. Models for comparison We’ve selected the following models for our benchmark, each with its unique capabilities: Test parameters Batch Sizes: The models will be tested with batch sizes of 1, 4, 8, 16, 32, 64, 128.Hardware Configuration: Uniform hardware setup across tests with 8 vCPUs, 28GB of RAM, and a 24GB GPU card.Benchmarking Tool: To conduct this benchmark, we utilized the Text Generation Benchmark Tool,which is a part of TGI, designed to effectively measure the performance of these models.Model Parameters: We’ve used the default Sequence length of 10 and decode length 8. Performance metrics The TGI benchmark provides us with the following metrics for each batch we provided: Bigcode/santacoder bigcode/santacoder is part of the SantaCoder series, featuring 1.1 billion parameters and trained on subsets of Python, Java, and JavaScript from The Stack (v1.1). This model, known for its Multi Query Attention and a 2048-token context window, utilizes advanced training techniques like near-deduplication and comment-to-code ratio filtering. The SantaCoder series also includes variations in architecture and objectives, providing diverse capabilities in code generation and analysis. This is the smallest model in our benchmark. Key observations Cost-effectiveness on Salad Cloud: bigcode/santacoder A key part of our analysis focused on the cost-effectiveness of running TGI models on SaladCloud. For a batch size of 32, with a compute cost of $0.35 per hour, we calculated the cost per million tokens based on throughput : The cost per token, considering the throughput and compute price, is approximately $0.03047 or about 3.047 cents per million tokens for output and $0.07572 per million input tokens. Tiiuae/falcon-7b Falcon-7B is a decoder-only model with 7 billion parameters, built by TII and trained on an extensive 1,500B tokens dataset from RefinedWeb, enhanced with curated corpora. It is available under the Apache 2.0 license, making it a significant model for large-scale text generation tasks. Key findings Cost-effectiveness on SaladCloud: Tiiuae/Falcon-7b For the tiiuae/falcon-7b model on SaladCloud with a batch size of 32 and a compute cost of $0.35 per hour, the calculated cost per million tokens with a throughput of 744 tokens per second is approximately $0.13095, or about 13.095 cents per million output tokens and $0.28345 per million input tokens. Average decode total latency for batch size 32 is 300.82 milliseconds. While this latency might be slightly higher compared to smaller models, it still falls within a reasonable range for many applications, especially considering the model’s large size of 7 billion parameters. The cost-effectiveness, combined with the model’s capabilities, makes it a viable option for extensive text generation tasks on SaladCloud. Code Llama Code Llama is a collection of generative text models, with the base model boasting 7 billion parameters. It’s part of a series ranging up to 34 billion parameters, specifically tailored for code-related tasks. This benchmark focuses on the base 7B version in Hugging Face Transformers format, designed to handle a wide range of coding applications. The cost for processing one million tokens using the Code Llama model on SaladCloud, with a batch size of 32 and a compute cost of $0.35 per hour, is approximately $0.11826 per million output tokens and $0.28679 per million input tokens. This figure highlights the economic feasibility of utilizing SaladCloud for large-scale text generation tasks with sophisticated models like Code Llama. Mistral-7B-Instruct-v0.1 Mistral-7B-Instruct-v0.1 is an instruct fine-tuned version of the Mistral-7B-v0.1 generative text model. This model leverages a variety of publicly available conversation datasets to enhance its capability in understanding and generating human-like, conversational text. Its finetuning makes it particularly adept at handling instruction-based queries, setting it apart in the realm of LLMs. Key insights Implications and cost analysis The performance of the Mistral-7B-Instruct-v0.1 model on SaladCloud shows promising potential for its use in various AI-driven conversational systems. Its ability to process a high number of tokens per second at a manageable latency makes it a strong contender for applications requiring nuanced language understanding and generation. With a price of $0.35 per hour, we achieve a cost of approximately $0.12153 per million output tokens and $0.27778 per million input tokens. Conclusion – LLM comparison benchmark results Our comprehensive LLM comparison benchmark of various Text Generation Inference (TGI) models on SaladCloud reveals an insightful trend: despite the diversity in the models’ capabilities and complexities, there is a remarkable consistency in cost-effectiveness when using the same compute configuration. Consistent performance and cost-effectiveness Customizable compute options Final thoughts In conclusion, Salad Cloud emerges as a compelling choice for deploying and running TGI models. Its ability to provide uniform compute efficiency across a range of models, combined with the option to customize and optimize compute resources, offers both consistency in performance and flexibility in cost management. Whether it’s for large-scale commercial deployments or smaller, more targeted AI tasks, SaladCloud’s platform is well-equipped to meet diverse text generation requirements with an optimal balance of efficiency and costeffectiveness

Data Pipeline Processing with GPUs: Why, How, and Where

Data pipeline processing on GPUs

You can’t train foundational AI models without good data, and lots of it. Data pipeline processing is a crucial task for any team that is building, or even fine tuning, their own models. It involves loading, transforming, and analyzing large amounts of data from various sources, such as images, text, audio, video, logs, sensors, and more. Data pipeline processing can be used for tasks such as data cleaning, noise reduction, feature extraction, data augmentation, data validation, and dataset restructuring. However, data pipeline processing can also be very challenging, especially when dealing with massive volumes of data and complex computations. If not done properly, the result is a slow, expensive, and inefficient process. This is where GPU clouds come in handy. Why data pipeline processing should be done on GPUs GPUs can perform many operations simultaneously, which makes them more efficient than CPUs for certain types of tasks. GPUs are especially good at handling data-intensive and compute-intensive tasks, such as image processing, video processing, and machine learning. The benefits of using GPUs for this task are many: – GPUs speed up data pipeline processing by orders of magnitude, compared to CPUs. For example, Google Cloud reported that using GPUs to accelerate data pipeline processing with Dataflow resulted in an order of magnitude reduction in CPU and memory usage. – GPUs reduce the cost of data pipeline processing by using less resources and power, compared to CPUs. For example, NVIDIA reported up to 50x faster performance and up to 90% lower cost to accelerate genomic workflows using GPUs compared to CPUs. – GPUs simplify data pipeline processing by enabling users to perform data transformations and machine learning tasks in the same pipeline, without switching between different platforms or tools. For example, Cloud to Street, a company that uses satellites and AI to track floods, reported that using GPUs to perform image processing and machine learning tasks in Dataflow pipelines reduced the complexity and latency of their workflows. Data processing in times of GPU shortage & high prices Despite the advantages of using GPUs for data pipeline processing, there are also some challenges and limitations that users may face. One of the main challenges is the GPU shortage. The AI rush for GPUs and the resulting high cost on public clouds affects the availability and affordability of GPUs. The GPU shortage has led to high prices for renting GPUs, particularly enterprise grade chips on major cloud providers. This makes it harder for companies to access and afford GPUs. It also affects the profitability and competitiveness of businesses who rely on GPUs for their data pipeline processing applications. How consumer GPUs are the solution to this problem One solution to the GPU shortage and the high prices is to use consumer GPUs for data pipeline processing. There is an estimated 400 Million GPUs in people’s home, many of which are compatible for multiple use cases like AI inference, data processing, etc. Consumer GPUs are always connected to the internet and yet are typically used for gaming sporadically, leaving them underutilized for most of the day. Most consumer GPUs lie unused for almost 20-22 hrs a day. Consumer GPUs are more cost effective and more widely available than enterprise grade GPUs, and they still offer high performance and quality for data pipeline processing. However, using consumer GPUs for data pipeline processing also poses some challenges and limitations, such as the compatibility, scalability, security, and reliability of consumer GPUs. To overcome these challenges and limitations, companies need a platform or a service that can enable them to use consumer GPUs in an easy, efficient, and secure way. In this blog, we highlight how choosing the right GPU – between a high-end AI-focused GPU and lower-end consumer GPUs – based on your use case is the crucial factor in overcoming the GPU shortage and high cloud costs. Distributed clouds: The perfect recipe for data pipeline processing Enter distributed clouds. Salad is a distributed cloud of consumer GPUs that is perfect for data pipeline processing. We do this by connecting companies who need GPUs with gamers who have idle GPUs that they can share or rent. Salad unlocks the following benefits for data pipeline processing: – Access to a large and diverse pool of consumer GPUs, with over 10,000+ GPUs available, starting from $0.02 per hour. Companies can choose from different types, models and quantities of consumer GPUs, depending on their needs and preferences. – Effortlessly run common frameworks, such as TensorFlow, PyTorch, Keras, Scikit-learn, and more, on public datasets, such as ImageNet, MNIST and CIFAR-10.   – The ability to source video, audio, image or text data from the public web, to be processed at massive scale using open source models such as whisper-large or wave2vec. – Scale up and down at massive scale, powering data pipelines in batch job processing, without having to deal with the scalability or the reliability of consumer GPUs. Companies can use Salad to submit their jobs as batch jobs, and Salad will automatically allocate and manage the consumer GPUs for these jobs. Teams can also monitor and control their jobs through either web interface or API. – With isolated containers on every machine, Salad offers a secure and private way, without having to worry about the nuances of running on consumer GPUs. All container images are fully encrypted during transit and rest, and are only unencrypted for actual runtime during which there is a proprietary runtime security and node reputation system in place to keep workloads private and secure. Once a worker is done with a job, the entire VM is destroyed along with all data.  Try SaladCloud today Data processing is currently the bottleneck for the AI industry, but this will be tackled with millions of consumer GPUs. Obtaining quality datasets is a mission-critical task for any company building foundational AI models, yet it is a challenging task, especially when dealing with large and complex data and computations. Leveraging massive clusters of consumer GPUs is the solution.  Companies can use

Analyzing the Stunning Realism of GTA6 with YOLOv8 and SaladCloud

YOLOv8 object detection tutorial - analyzing gta6 trailer

Running YOLOv8 on the GTA6 trailer with Salad The gaming community was recently electrified with the release of a new trailer for “Grand Theft Auto VI” (GTA6), a title known for its immersive gameplay and hyper-realistic graphics. To gauge the level of detail and realism in the game’s graphics, we conducted an interesting experiment: we ran the trailer through the YOLOv8 model, a cutting-edge object detection AI, hosted on SaladCloud for this experiment. The results were nothing short of fascinating, providing a glimpse into the intricate world that GTA6 promises to offer. YOLO (You Only Look Once) models are renowned for their efficiency and accuracy in detecting objects in images and videos. We chose YOLOv8 for its latest advancements in machine learning and its ability to discern objects with high precision. We picked a medium pretrained model provided by Ultralytics: “yolov8m.pt”. To facilitate this experiment, we utilized SaladCloud, the most affordable GPU compute platform available today. We created an API that, upon receiving the URL and storage information, processed the video through the YOLOv8 model and saved all detections to our storage account. Additionally, it generated a summary detailing the duration each object was present in the video. For those interested in the technical details or in replicating this experiment, we have prepared a comprehensive YOLOv8 tutorial available on Salad’s documentation: YOLOv8 Deployment Tutorial. Our computational setup for this experiment included 8 vCPUs, 8GB of memory, and an RTX 3090 GPU with 24 GB of VRAM. Remarkably, this configuration is priced at only $0.29 per hour on Salad Cloud. Results from the object detection experiment The entire process of running the video through the model and saving the results took approximately 90 seconds. This translates to a very cost-effective operation. To calculate the exact cost: Total time: 90 seconds (or 1.5 minutes)Hourly rate: $0.29Cost for 1.5 minutes: (1.5×0.29)/60 = $0.0072 Let’s compute the exact cost for this operation.The cost of running the video through the model on SaladCloud for 1.5 minutes came to approximately $0.0072 which is approximately 0.73 cents. This exceptionally low cost demonstrates the efficiency and affordability of using Salad Cloud for high-end GPU compute tasks. Let’s check our results now. The model easily detected and tracked main characters, especially when they were the central figures in a scene. What is more impressive is how the model performed detecting NPCs in the bustling scenes set on the beaches of Vice City, even amidst massive crowds. This level of accuracy is crucial for understanding the dynamics of densely populated game environments, a staple in the GTA series. Another area where YOLOv8 excelled was in identifying various modes of transportation that are central to the GTA experience, such as motorcycles, cars, and boats. The accuracy in this domain is essential given the franchise’s emphasis on vehicular exploration and interaction. However, the model wasn’t flawless. In some instances, it confused birds with kites, likely due to their similar appearance in motion. A gator from one of the scenes was mistaken with a dog, probably because gator is not a part of the labels in the pretrained model. The model’s performance in analyzing aerial shots or bird’s-eye views of the city was also noteworthy. Capturing details from such perspectives can be challenging due to changes in scale and perspective, yet YOLOv8 managed to do a commendable job. Perhaps one of the most striking demonstrations of the model’s capabilities was its detection of little details, such as bottles on the shelves in a shop scene. Here is a count of all the unique objects our solution detected in the trailer: OBJECT IN GTA6 TRAILER COUNT PERSON 133 CAR 65 BIRD 23 BOTTLE 16 KITE 15 TRUCK 9 MOTORCYCLE 9 BOAT 8 CHAIR 7 UMBRELLA 4 BUS 3 AIRPLANE 2 SPORTS BALL, CAT, DOG, TRAFFIC LIGHT 1 Overall, the results from running the GTA6 trailer through YOLOv8 on SaladCloud illustrates the remarkable advancements in both video game graphics and AI technology. As we move forward, such synergies between AI and gaming are likely to enhance our virtual experiences, blurring the lines between the digital and real world even further. GTA6, with its stunning graphics validated by computer vision, is poised to be more than just a game; it’s a glimpse into the future of immersive virtual experiences. What also stands out is the cost-effectiveness of SaladCloud. Running this sophisticated AI analysis cost us merely 0.73 cents, a testament to the affordability of high-end GPU capabilities for object detection. SaladCloud’s role in enabling the analysis of GTA6’s stunning graphics with YOLOv8’s precision at such a low cost highlights the growing accessibility of advanced technology like computer vision in gaming and AI. This synergy is not just pushing the boundaries of virtual experiences but also making them more attainable, heralding a future where such advancements are within reach of a wider audience.

Training a custom YOLOv8 model on Salad for just $0.25

YOLOv8 training & deployment tutorial on SaladCloud

Training a Custom YOLOv8 Model for Logo Detection In the dynamic world of AI and machine learning, the ability to customize is immensely powerful. Our previous exploration delved into deploying a pre-trained YOLOv8 model using Salad’s cloud infrastructure, revealing 73% cost savings in real-time object tracking and analysis. Advancing this journey, we’re now focusing on training a customized YOLO (You Only Look Once) model using SaladCloud’s distributed infrastructure. In this training, we focused on processing times, cost efficiency, and model accuracy – things that are relevant to real-world use-case scenarios. Training custom models is notably more resource-intensive than running pre-trained ones. It demands substantial GPU power and time, translating into higher costs. This is especially true for deep learning models used in object detection, where numerous parameters are finetuned over extensive datasets. The process involves repeatedly processing large amounts of data, making heavy use of GPU resources forextended periods. Here are some of our considerations for this training: Dataset and Preparation For our testing, we decided to create a custom model that will be able to detect popular logos. Training Approach Salad’s Role in Streamlining Training Training Results Overview: Cost-Effectiveness and Performance of YOLOv8 Models As we delve into the world of custom model training, it’s crucial to evaluate both the financial and performance aspects of the models we train. Here, we provide a concise comparison of the YOLOv8 Nano, Small, and Medium models, highlighting their training duration and associated costs when trained on Salad Cloud, a platform celebrated for its efficiency and cost-effectiveness.First let’s check performance difference based on validation results: It seams like every next model is slightly better than the previous one. Let us check how long it took to train each model and how much spent using Salad cloud: Each model brings unique strengths to the table, with the Nano model offering speed and cost savings, while the Medium model showcases the best performance for more intensive applications. That is unbelievable that we got a performing custom detection model for only 25 cents. Bringing Custom Models to Life: Tracking Coca-Cola Labels With our custom-trained YOLO model in hand, we now want to test it in the real life. We will run a logo tracking experiment on the iconic Coca-Cola Christmas commercial. This real-world application illustrates the practical utility of our model in dynamic, visually-rich scenarios.For those eager to replicate this process or deploy their own models for similar tasks, detailed instructions are available in our previous article, which walks you through the steps of running inference on Salad’s cloud platform.Let’s now see the performance of our YOLO model in action, and witness how it keeps up with the holiday spirit, frame by frame: As a result we can see that we not only can use our custom trained model on images, but even on videos adding tracking possibilities. Conclusion: An Extremely Affordable Path to Custom Model Training By harnessing the power of SaladCloud, we managed to train three distinct YOLO models, each tailored to the same dataset and unified by consistent hyperparameters. The training took under an hour at the economical sum of 1 dollar. The culmination of this process is a robust model fine-tuned for real-world applications, remarkably realized at the modest expense of a quarter. This endeavor not only highlights the feasibility of developing custom AI solutions on a budget but also showcases the potential for such models to be rapidly deployed and iteratively improved in commercial and research settings.

YOLOv8 Benchmark: Object Detection on Salad’s GPUs (73% Cheaper Than Azure)

YOLOv8-object-detection-on-gpus-blog-cover

What is YOLOv8? In the fast-evolving world of AI, object detection has made remarkable strides, epitomized by YOLOv8. YOLO (You Only Look Once) is an object detection and image segmentation model launched in 2015 and YOLOv8s is the latest version, developed by Ultralytics. The algorithm is not just about recognizing objects; it’s about doing so in real-time with unparalleled precision and speed. From monitoring fast-paced sports events to overseeing production lines, YOLOv8 is transforming how we see and interact with moving images. With features like spatial attention, feature fusion and context aggregation modules, YOLOv8 is being used extensively in agriculture, healthcare, manufacturing among others. In this YOLOv8 benchmark, we compare the cost of running YOLO on Salad and Azure. Running object detection on SaladCloud’s GPUs: A fantastic combination  YOLOv8 can be run on GPUs, as long as they have enough memory and support CUDA. But with the GPU shortage and high cost, you need GPUs rented at affordable prices to make the economics work. SaladCloud’s network of 10,000+ Nvidia consumer GPUs has the lowest prices in the market and are a perfect fit for YOLOv8. Deploying YOLOv8 on SaladCloud democratizes high-end object detection, offering it on a scalable, cost-effective cloud platform for mainstream use. With GPUs starting at $0.02/hour, Salad offers businesses and developers an affordable, scalable solution for sophisticated object detection at scale. A deep dive into live stream video analysis with YOLOv8 This benchmark harnesses YOLOv8 to analyze not only pre-recorded but also live video streams. The process begins by capturing a live stream link, followed by real-time object detection and tracking.  Using GPU’s on Saladcloud, we can process each video frame in less then 10 milliseconds, which is 10 times faster then using a CPU.  Each frame’s data is meticulously compiled, yielding a detailed dataset that provides timestamps, classifications, and other critical metadata. As a result we get a nice summary of all the objects being present on our video:  How to run YOLOv8 on SaladCloud’s GPUs We introduced a FastAPI with a dual role: it processes video streams in real-time and offers interactive documentation via Swagger UI. You can process live streams from Youtube, RTSP, RTMP, TCP as well as a regular videos. All the results will be saved in an Azure storage account you specify. All you need to do is send an API call with the video link, check if the video is a live stream or not, storage account information and timeframes of how often you want to save the results. We also integrated multithreading capabilities, allowing multiple video streams to be processed simultaneously. Deploying on SaladCloud In our step by step guide, you can go through the full deployment journey on Salad Cloud. We configured container groups, set up efficient networking, and ensured secure access. Deploying the FastAPI application on Salad proved to be not just technically feasible but also cost effective, highlighting the platform’s efficiency.  Price comparison: Processing live streams and videos on Azure and Salad  When it comes to deploying object detection models, especially for tasks like processing live streams and videos, understanding the cost implications of different cloud services is crucial. Let’s do some price comparison for our live stream object detection project:  Context and Considerations  Live Stream Processing: Live streams are unique in that they can only be processed as the data is received. Even with the best GPUs, the processing is limited to the current feed rate.  Azure’s Real-Time Endpoint: We assume the use of an ML Studio real-time endpoint in Azure for a fair comparison. This setup aligns with a synchronous process that doesn’t require a full dedicated VM.  Azure Pricing Overview  We will now compare the compute prices in Azure and Salad. Note that in Azure you can not pick RAM, vCpu and GPU memory separately. You can only pick preconfigured computes. With Salad, you can pick exactly what you need.  Lowest GPU Compute in Azure: For our price comparison, we’ll start by looking at Azure’s lowest GPU compute price, keeping in mind the closest model to our solution is YOLOv5.  1. Processing a Live Stream  Service Configuration Cost per hour Remarks Azure 4 core, 16GB RAM (No GPU) $0.19  General purpose compute, no dedicated GPU Salad 4 vCores, 16GB RAM  $0.032  Equivalent to Azure’s general compute Percentage Cost Difference for General Compute  Salad is approximately 83% cheaper than Azure for general compute configurations.  2. Processing with GPU Support. This is the GPU Azure recommends for yolov5.  Service Configuration Cost per hour Remarks Azure NC16as_T4_v3 (16 vCPU, 110GB RAM, 1 GPU) $1.20  Recommended for YOLOv5 Salad Equivalent GPU Configuration $0.326  Salad’s equivalent GPU offering Percentage Cost Difference for GPU Compute  Salad is approximately 73% cheaper than Azure for similar GPU configurations. YOLOv8 deployment on GPUs in just a few clicks You can deploy YOLOv8 in production on SaladCloud’s GPUs in just a few clicks. Simply download the code from our GitHub repository or pull our ready-to-deploy Docker container from the Salad Portal. It’s as straightforward as it sounds – download, deploy, and you’re on your way to exploring the capabilities of YOLOv8 in real-world scenarios. Check out SaladCloud documentation for quick guides on how to start using our batch or synchronous solutions.  Check out our step-by-step guide To get a comprehensive step-by-step guide of how to deploy YOLOv8 on SaladCloud, check out our step-by-step guide here. In this guide, we will show: This process is fully customizable to your needs. Follow along, make modifications, and experiment to your heart’s content. Our guide is designed to be flexible, allowing you to adjust and enhance the deployment of YOLOv8 according to your project requirements or curiosity. We are excited about the potential enhancements and extensions of this project. Future considerations include broadening cloud integrations, delving into custom model training, and exploring batch processing capabilities. 

A New Price-Performance Standard for BERT Transformers.

New price performance standard for bert transformers

Engineers from Numenta used Salad Container Engine (SCE) to benchmark a first-of-its-kind intelligent computing platform that optimizes BERT transformer networks. Learn how Numenta attained 10x more inferences per dollar on SCE. Challenge Optimizing AI Systems Deploying practical artificial intelligence applications at scale requires the distribution of large data sets to complex networks of specialized hardware. Though deep neural networks have facilitated significant advancements, their fundamental reliance on highly available processing resources and their tendency toward rapid expansion make it costly and inefficient to run transformers in the public cloud. Price-Performance Comparison Solution Optimizing AI Systems Leveraging insights from 20 years of neuroscience research, Numenta has developed breakthrough advances in AI that deliver dramatic performance improvements across broad use cases. Grounded in the sensorimotor framework of intelligence elaborated by co-founder Jeff Hawkins in A Thousand Brains, Numenta’s innovative technology turns the principles of human learning into new architectures, data structures, and algorithms that deliver disruptive performance improvements. Case Study 10x Price Performance In a side-by-side comparison, Numenta’s optimized BERT technologies improved the throughput of a standard transformer network by up to 6.5x. When deployed on SCE, Numenta attained 10x more inferences per dollar than possible with on-demand offerings from AWS—and managed to beat the cost efficiency of the nearest spot-basis instance by 2.39x. About Numenta Numenta has developed new artificial intelligence technologies that deliver breakthrough performance in AI/ML applications such as natural language processing and computer vision. Backed by two decades of neuroscience research, Numenta’s novel architectures, data structures, and algorithms deliver disruptive performance improvements. Numenta is currently engaged in a private beta with several Global 100 companies and startups to apply its platform technology across the full spectrum of AI, from model development to deployment—and ultimately enable novel hardware architectures and whole new categories of applications.

SaladCast Episode 11: Jared Carpenter on Salad’s Go-to-Market Strategy

blog-jared-carpenter-interview

Welcome to SaladCast! In this podcast series, we introduce you to Salad Chefs from all corners of the Infinite Kitchen. We hope you’ll join us as we get to know members of our community, indie developers, and teammates from our very own Salad staff. In this episode: Bob continues his journey to open source the day-to-day efforts of Salad’s lean, “non-fat” team. Join our intrepid CEO and Director of Channel Partners Jared Carpenter as they peel back the layers on Salad’s guerilla marketing rollout, the history of the Salad Chefs Discord, and our burgeoning creator partnerships. Episode Highlights Highlights content has been edited and slightly reordered for clarity. How did you come to work at Salad? One day I’m going to write an article called How Answering a Reddit Post Changed My Life—because it really was as simple as. Y’all picked me up in September 2018 when I replied to Salad’s post on r/HireaWriter looking for game writers. At the time, all I wanted to do was get involved in game writing and build my portfolio. I was up late every night freelance writing for different game review sites or doing tutorials. I saw your post pop up a week too late, but something told me to fire off a direct message anyway. I’ve heard you say in the past it was Salad’s logo that gave you the sense we were legit. That’s correct. My experience with other places had been dodgy. I was fine with contracting, and I had no expectations with the companies I was contracting with. If they were paying me, I was happy—and y’all were paying me, so I was good. But I remember seeing the logo and thinking, “That’s a schnazzy logo. I can tell there are people involved in this.” And then I saw your faces—we got on a call pretty early into my contract—and I thought, “Okay, real people exist at Salad. It’s not just Skynet messaging me, telling me to write this crypto crap and tie it into gaming for some nefarious purpose.” How did you convince users to give it a go with Salad’s alpha? BOB: You just touched on a common misconception. Around 2017, a lot of people believed that crypto was a virus, or somehow dodgy—not to mention the sentiment among gamers that GPU mining could damage their computers. (Editor’s note: clean yer fans, ya casuals) Along comes this company with a completely new value proposition: share your computer for rewards. How did you convince users to give it a go with Salad’s alpha? I bothered people, over and over. In the beginning, even I was a Salad doubter. I figured I’d work here for a year, I’ll move on from this crypto scam stuff—and, of course, I quickly began to learn once I became part of the team. The toughest part was educating myself about the pain points and concerns from users. Will this hurt my hardware? Is this profitable, or efficient? Do I have to worry about privacy? These are unfounded criticisms when you know how we operate, but they’re all valid questions because the general zeitgeist says there are bad actors in the space. In a post-truth world, those rumors are taken as true. How did you confront that? I would describe it as “swimming against the current.” The go-to-market strategy was all about getting the proper information put in place, under our brand and in our voice. Some of these articles now have hundreds of thousands of views, but at the time we really had no plan for pay-per-click or influencer campaigns. We just needed people to use the app and help test the dang thing out. So we took to Discord servers and basically spammed our invite link in lobbies. We’d get banned immediately, but sometimes people would notice and take interest, like, “What’s this? I want free money from my PC.” What was your strategy for engaging those users? Then we’d go through the whole list. Being upfront about electricity use and profitability helped us to convince people of Salad’s potential. Our addressable market was much smaller then, because we were mostly talking to younger gamers living with their parents or in college dorms. You also didn’t make nearly as much as what you do today, but it was a great deal for that key demo of people who didn’t have access to credit cards or any other traditional financial resources. That was our first unlock: solving a huge pain point for the people who would be willing to educate themselves, rise above the FUD, and get that five bucks. All you need to do is turn on your PC. How did personal intervention become a scalable model? BOB: You’re talking two to three hours of one-on-one education. That was impactful in generating our first few hundred users. How did that become a scalable model? Conventional ads were cost prohibitive at that stage, so we came to rely on core power users and moderators like Tasha to help us get the word out and build up the Salad Chefs Discord server. To scale our acquisition strategy, we took advantage of Discord’s unofficial ad ecosystem, where people trade server pings for exposure. We partnered with about sixty big servers—with some pretty trash ones among them—and cross-posted invite links. That was useful, but it only generated a trickle of ten or 15 users per day. We eventually held a Nitro giveaway with Gamer’s Garage, an LFG server, and that was the secret sauce. That brought a few hundred people to the server, and our first hundred users on the network. That speaks to the power of social proof! Right. When we started focusing on growing the Discord itself, we saw how meaningful it was for new users to interact with our community moderators. Getting that social proof from someone who volunteered their support means a lot more than when a community manager like me says, “Try my freakin’ app!” If phase one was person-to-person education, and phase two was community interaction, what’s phase three of Salad’s acquisition strategy?

Seraphinite AcceleratorOptimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.