Understanding Large Model Parameters | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

Wanting to deploy large models locally to save money and protect data privacy is a great idea!

But diving into the model world, various parameter sizes can be confusing: 7B, 14B, 32B, 70B... Even for the same model, there are so many parameters, which one should you choose?

And what level is my computer at, which one can it run?

Don't panic! This article will help you sort out your thoughts. It will tell you in the simplest way how to choose hardware for local deployment of large models! I guarantee you won't be confused after reading it!

There is a Hardware Configuration and Model Size Reference Table at the bottom of this article.

Understanding Large Model Parameters: What do 7B, 14B, 32B Represent?

The Meaning of Parameters: The numbers 7B, 14B, 32B represent the number of parameters in a large language model (LLM), where "B" is an abbreviation for Billion. Parameters can be thought of as the "weights" that the model learns during training, and they store the model's understanding of language, knowledge, and patterns.
Number of Parameters and Model Capability: Generally speaking, the more parameters a model has, the more complex it is, and in theory, it can learn and store richer information, thereby capturing more complex language patterns, and performing more powerfully in understanding and generating text.
Resource Consumption and Model Size: Models with more parameters also mean that they require more computing resources (GPU computing power), larger memory (VRAM and system RAM), and more data to train and run.
Small Models vs. Large Models:
- Large Models (such as 32B, 65B or larger): Capable of handling more complex tasks, generating more coherent and nuanced text, and may perform better in knowledge Q&A, creative writing, etc. However, they have high hardware requirements and run relatively slowly.
- Small Models (such as 7B, 13B): Consume fewer resources, run faster, and are more suitable for running on devices with limited resources, or in application scenarios that are sensitive to latency. Small models can also perform well on some simple tasks.
The Trade-off of Choice: Choosing a model size requires a trade-off between the model's capabilities and hardware resources. More parameters are not necessarily "better". The most suitable model needs to be selected according to the actual application scenario and hardware conditions.

What Kind of Hardware Do I Need to Run a Local Model?

Core Requirement: Video Memory (VRAM)
- Importance of VRAM: When running a large model, the model's parameters and intermediate calculation results need to be loaded into the video memory. Therefore, the size of the video memory is the most critical hardware indicator for running local large models. Insufficient video memory will cause the model to fail to load, or only very small models can be used, or even severely reduce the running speed.
- The Bigger the Better: Ideally, it is best to have a GPU with as much video memory as possible, so that you can run models with larger parameters and get better performance.
Second Most Important: System Memory (RAM)
- The Role of RAM: System memory RAM is used to load the operating system, run programs, and serve as a supplement to video memory. When the video memory is insufficient, the system RAM can be used as "overflow" space, but the speed will be much slower (because RAM is much slower than VRAM), and the model running efficiency will be significantly reduced.
- Sufficient RAM is also Important: It is recommended to have at least 16GB or even 32GB or more of system RAM, especially when your GPU video memory is limited, larger RAM can help alleviate video memory pressure.
Processor (CPU)
- The Role of the CPU: The CPU is mainly responsible for data preprocessing, model loading, and some model calculation tasks (especially in the case of CPU offloading). A CPU with better performance can improve the model loading speed and assist the GPU in calculations to a certain extent.
- NPU (Neural Network Processor): Some laptops are equipped with an NPU (Neural Processing Unit), which is a dedicated hardware for accelerating AI calculations. The NPU can accelerate specific types of AI operations, including the inference process of some large models, thereby improving efficiency and reducing power consumption. If your laptop has an NPU, it will be a plus, but the GPU is still the core of running local large models. The support and effect of the NPU depend on the specific model and software.
Storage (Hard Disk/SSD)
- The Role of Storage: You need enough hard disk space to store model files. The files of large models are usually very large. For example, a quantized 7B model may require 4-5GB of space, and larger models require tens or even hundreds of GB of space.
- SSD is Better than HDD: Using a solid-state drive (SSD) instead of a mechanical hard drive (HDD) can significantly speed up model loading.

Hardware Priority

Video Memory (VRAM) (Most Important)
System Memory (RAM) (Important)
GPU Performance (Computing Power) (Important)
CPU Performance (Auxiliary Role)
Storage Speed (SSD is Better than HDD)

What if I Don't Have a Dedicated GPU?

Integrated Graphics and CPU Operation: If you don't have a dedicated GPU, you can still use integrated graphics (such as Intel Iris Xe) or rely entirely on the CPU to run the model. However, the performance will be greatly limited. It is recommended to focus on running 7B or even smaller, highly optimized models and use technologies such as quantization to reduce resource requirements.
Cloud Services: If you need to run large models but the local hardware is insufficient, you can consider using cloud GPU services, such as Google Colab, AWS SageMaker, RunPod, etc.

How to Run a Local Model?

For beginners, it is recommended to use some user-friendly tools that simplify the process of running local models:

Ollama: Operated through the command line, but the installation and use are very simple, focusing on quickly running the model.
LM Studio: The interface is simple and intuitive, supports model download, model management, and one-click operation.

Hardware Configuration and Model Size Reference Table

Slide left and right to see all

X86 Laptop
Integrated Graphics Laptop (such as Intel Iris Xe)	Shared System Memory (8GB+ RAM)	8-bit, even 4-bit Quantization	≤ 7B (Extremely Quantized)	* Very basic local running experience, suitable for learning and light experience. * Limited performance, slow inference speed. * It is recommended to use 4-bit or lower precision quantized models to minimize video memory usage. * Suitable for running small models, such as TinyLlama, etc.
Entry-level Gaming Laptop/Thin and Light Dedicated Graphics Laptop (such as RTX 3050/4050)	4-8 GB VRAM + 16GB+ RAM	4-bit - 8-bit Quantization	7B - 13B (Quantized)	* Can run 7B models relatively smoothly, and some 13B models can also run through quantization and optimization. * Suitable for experiencing some mainstream small and medium-sized models. * Note that VRAM is still limited, and it will be difficult to run large models.
Mid-to-High-end Gaming Laptop/Mobile Workstation (such as RTX 3060/3070/4060)	8-16 GB VRAM + 16GB+ RAM	4-bit - 16-bit (Flexible Choice)	7B - 30B (Quantized)	* Able to run 7B and 13B models more comfortably, and has the potential to try models around 30B (requires better quantization and optimization). * You can choose different quantization precisions according to your needs to achieve a balance between performance and model quality. * Suitable for exploring more types of medium and large models.

ARM (Apple M Series)
Raspberry Pi 4/5	4-8 GB RAM	4-bit Quantization (or Lower)	≤ 7B (Extremely Quantized)	* Limited by memory and computing power, it is mainly used to run extremely small models or as an experimental platform. * Suitable for researching model quantization and optimization technologies.
Apple M1/M2/M3 (Unified Memory)	8GB - 64GB Unified Memory	4-bit - 16-bit (Flexible Choice)	7B - 30B+ (Quantized)	* The unified memory architecture makes memory utilization more efficient, and even an M series Mac with 8GB of memory can run a certain size model. * Higher memory versions (16GB+) can run larger models, and even try models above 30B. * Apple chips have advantages in energy efficiency ratio.

Nvidia GPU Computer
Entry-level Dedicated Graphics Card (such as RTX 4060/4060Ti)	8-16 GB VRAM	4-bit - 16-bit (Flexible Choice)	7B - 30B (Quantized)	* Performance is similar to mid-to-high-end gaming laptops, but desktop computers have better heat dissipation and can run stably for a long time. * Cost-effective, suitable for entry-level local LLM players.
Mid-range Dedicated Graphics Card (such as RTX 4070/4070Ti/4080)	12-16 GB VRAM	4-bit - 16-bit (Flexible Choice)	7B - 30B+ (Quantized)	* Able to run medium and large models more smoothly, and has the potential to try models with larger parameters. * Suitable for users who have high requirements for local LLM experience.
High-end Dedicated Graphics Card (such as RTX 3090/4090, RTX 6000 Ada)	24-48 GB VRAM	8-bit - 32-bit (or even higher)	7B - 70B+ (Quantized/Native)	* Able to run most open source LLMs, including large models (such as 65B, 70B). * You can try higher bit precision (such as 16-bit, 32-bit) to obtain the best model quality, or use quantization to run larger models. * Suitable for professional developers, researchers, and heavy LLM users.
Server-grade GPU (such as A100, H100, A800, H800)	40GB - 80GB+ VRAM	16-bit - 32-bit (Native Precision)	30B - 175B+ (Native/Quantized)	* Designed for AI computing, with ultra-large video memory and extremely strong computing power. * Can run ultra-large models, and can even perform model training and fine-tuning. * Suitable for enterprise-level applications, large-scale model deployment, and research institutions.

Table Supplement

Quantization: Refers to reducing the numerical precision of model parameters, such as from 16-bit floating-point numbers (float16) to 8-bit integers (int8) or 4-bit integers (int4). Quantization can significantly reduce model size and video memory usage, and accelerate inference speed, but may slightly reduce model accuracy.
Extreme Quantization: Refers to using very low bit precision quantization, such as 3-bit or 2-bit. Can further reduce resource requirements, but model quality degradation may be more obvious.
Native: Refers to the model running at its original precision, such as float16 or bfloat16. Can obtain the best model quality, but resource requirements are the highest.
Parameter Range After Quantization: "Recommended LLM Parameter Range (After Quantization)" in the table refers to the model parameter range that the hardware can run smoothly under the premise of reasonable quantization. The actual model size and performance that can be run also depend on the specific model architecture, quantization level, software optimization, and other factors. The parameter range given here is for reference only.
Unified Memory: A feature of Apple Silicon chips, where the CPU and GPU share the same physical memory, and data exchange efficiency is higher.

Understanding Large Model Parameters: What do 7B, 14B, 32B Represent? ​

What Kind of Hardware Do I Need to Run a Local Model? ​

Hardware Priority ​

What if I Don't Have a Dedicated GPU? ​

How to Run a Local Model? ​

Hardware Configuration and Model Size Reference Table ​

Table Supplement ​