Want to deploy large models locally to save money and protect data privacy? That's a great idea!
But diving into the world of models, you'll quickly get confused by all the parameters and types: 7B, 14B, 32B, 70B... So many parameters for the same model—which one should you choose?
What's the capability of my computer, and which models can it run smoothly?
Don't panic! This article will help you sort things out, telling you in the simplest way how to choose hardware for local deployment of large models! Guaranteed to clear up your confusion!
There's a
Hardware Configuration and Model Size Reference Tableat the bottom of this article.
Understanding Large Model Parameters: What Do 7B, 14B, 32B Mean?
- Meaning of Parameters: Numbers like 7B, 14B, 32B represent the number of parameters in a Large Language Model (LLM), where "B" stands for Billion. Parameters can be thought of as the "weights" the model learns during training, storing its understanding of language, knowledge, and patterns.
- Parameter Count vs. Model Capability: Generally, more parameters mean a more complex model that can theoretically learn and store richer information, capturing more intricate language patterns, and performing better in understanding and generating text.
- Resource Consumption vs. Model Size: Models with more parameters also require more computational resources (GPU power), more memory (VRAM and system RAM), and more data for training and running.
- Small Models vs. Large Models:
- Large Models (e.g., 32B, 65B and larger): Can handle more complex tasks, generate more coherent and nuanced text, and may excel in knowledge Q&A, creative writing, etc. But they demand higher hardware requirements and run relatively slower.
- Small Models (e.g., 7B, 13B): Consume fewer resources, run faster, and are more suitable for running on resource-limited devices or in latency-sensitive scenarios. Small models can also perform well on simple tasks.
- Trade-offs in Choice: Choosing model size involves balancing model capability against hardware resources. More parameters aren't always "better"; you need to select the most suitable model based on your actual use case and hardware conditions.
What Kind of Hardware Do I Need to Run Local Models?
Core Requirement: VRAM
- Importance of VRAM: When running large models, the model's parameters and intermediate calculation results need to be loaded into VRAM. Therefore, VRAM size is the most critical hardware metric for running local large models. Insufficient VRAM can prevent the model from loading, limit you to very small models, or severely degrade performance.
- Bigger is Better: Ideally, having a GPU with as much VRAM as possible is best to run larger-parameter models and achieve better performance.
Secondary Importance: System RAM
- Role of RAM: System RAM is used to load the operating system, run programs, and serve as a supplement to VRAM. When VRAM is insufficient, system RAM can act as "overflow" space, but it's much slower (since RAM is much slower than VRAM) and significantly reduces model running efficiency.
- Sufficient RAM is Also Important: It's recommended to have at least 16GB or even 32GB+ of system RAM, especially if your GPU VRAM is limited, as more RAM can help alleviate VRAM pressure.
Processor (CPU)
- Role of CPU: The CPU primarily handles data preprocessing, model loading, and some model computation tasks (especially with CPU offloading). A better-performing CPU can speed up model loading and assist the GPU in computation to some extent.
- NPU (Neural Processing Unit): Some laptops come with an NPU, hardware specifically designed to accelerate AI computations. NPUs can accelerate certain types of AI operations, including inference for some large models, improving efficiency and reducing power consumption. If your laptop has an NPU, it's a bonus, but the GPU remains the core for running local large models. NPU support and effectiveness depend on the specific model and software.
Storage (HDD/SSD)
- Role of Storage: You need enough disk space to store model files. Large model files are typically huge; for example, a quantized 7B model might still take 4-5GB, while larger models can require tens or even hundreds of GB.
- SSD Over HDD: Using a Solid State Drive (SSD) instead of a Hard Disk Drive (HDD) can significantly speed up model loading times.
Hardware Priority
- VRAM (Most Important)
- System RAM (Important)
- GPU Performance (Computing Power) (Important)
- CPU Performance (Supporting Role)
- Storage Speed (SSD Over HDD)
What If I Don't Have a Dedicated GPU?
- Integrated Graphics and CPU Running: If you don't have a dedicated GPU, you can still use integrated graphics (like Intel Iris Xe) or rely entirely on the CPU to run models. However, performance will be significantly limited. It's recommended to focus on running 7B or even smaller, highly optimized models, using techniques like quantization to reduce resource demands.
- Cloud Services: If you need to run large models but lack local hardware, consider using cloud GPU services like Google Colab, AWS SageMaker, RunPod, etc.
How to Run Local Models?
For beginners, it's recommended to use some user-friendly tools that simplify the process of running local models:
- Ollama: Operated via command line, but installation and use are very simple, focusing on quickly running models.
- LM Studio: Features a clean, intuitive interface, supports model downloading, model management, and one-click running.
Hardware Configuration and Model Size Reference Table
Swipe left/right to see all
| X86 Laptops | ||||
|---|---|---|---|---|
| Integrated Graphics Laptop (e.g., Intel Iris Xe) | Shared System RAM (8GB+ RAM) | 8-bit, even 4-bit quantization | ≤ 7B (Heavily Quantized) | * Very basic local running experience, suitable for learning and light experimentation. * Limited performance, slower inference speed. * Recommended to use 4-bit or lower precision quantized models to minimize VRAM usage as much as possible. * Suitable for running small models like TinyLlama. |
| Entry-level Gaming/Thin & Light Laptop (e.g., RTX 3050/4050) | 4-8 GB VRAM + 16GB+ RAM | 4-bit - 8-bit quantization | 7B - 13B (Quantized) | * Can run 7B models relatively smoothly; some 13B models can also run with quantization and optimization. * Suitable for trying out some mainstream small to medium-sized models. * Note: VRAM is still limited; running large models will be challenging. |
| Mid-to-High-End Gaming/Mobile Workstation (e.g., RTX 3060/3070/4060) | 8-16 GB VRAM + 16GB+ RAM | 4-bit - 16-bit (Flexible Choice) | 7B - 30B (Quantized) | * Can run 7B and 13B models more comfortably and has the potential to try models around 30B (requires good quantization and optimization). * Can choose different quantization precisions based on needs to balance performance and model quality. * Suitable for exploring more varieties of medium to large models. |
| ARM (Apple M Series) | ||||
|---|---|---|---|---|
| Raspberry Pi 4/5 | 4-8 GB RAM | 4-bit quantization (or lower) | ≤ 7B (Heavily Quantized) | * Limited by memory and computing power, mainly used for running very small models or as an experimental platform. * Suitable for researching model quantization and optimization techniques. |
| Apple M1/M2/M3 (Unified Memory) | 8GB - 64GB Unified Memory | 4-bit - 16-bit (Flexible Choice) | 7B - 30B+ (Quantized) | * Unified Memory architecture allows for more efficient memory utilization; even M-series Macs with 8GB RAM can run models of a certain size. * Higher memory versions (16GB+) can run larger models, even attempting models above 30B. * Apple chips have advantages in energy efficiency. |
| NVIDIA GPU Computers | ||||
|---|---|---|---|---|
| Entry-level Dedicated GPU (e.g., RTX 4060/4060Ti) | 8-16 GB VRAM | 4-bit - 16-bit (Flexible Choice) | 7B - 30B (Quantized) | * Performance is similar to mid-to-high-end gaming laptops, but desktops have better cooling, allowing for stable long-term operation. * Good cost-performance, suitable for entry-level local LLM enthusiasts. |
| Mid-range Dedicated GPU (e.g., RTX 4070/4070Ti/4080) | 12-16 GB VRAM | 4-bit - 16-bit (Flexible Choice) | 7B - 30B+ (Quantized) | * Can run medium to large models more smoothly and has the potential to try larger parameter models. * Suitable for users with higher requirements for local LLM experience. |
| High-end Dedicated GPU (e.g., RTX 3090/4090, RTX 6000 Ada) | 24-48 GB VRAM | 8-bit - 32-bit (or higher) | 7B - 70B+ (Quantized/Native) | * Can run the vast majority of open-source LLMs, including large models (e.g., 65B, 70B). * Can try higher bit precisions (e.g., 16-bit, 32-bit) for the best model quality, or use quantization to run even larger models. * Suitable for professional developers, researchers, and heavy LLM users. |
| Server-grade GPU (e.g., A100, H100, A800, H800) | 40GB - 80GB+ VRAM | 16-bit - 32-bit (Native Precision) | 30B - 175B+ (Native/Quantized) | * Designed specifically for AI computing, featuring超大 VRAM and极强的 computing power. * Can run超大 models and even perform model training and fine-tuning. * Suitable for enterprise applications, large-scale model deployment, and research institutions. |
Table Additional Notes
- Quantization: Refers to reducing the numerical precision of model parameters, e.g., from 16-bit floating-point (float16) to 8-bit integer (int8) or 4-bit integer (int4). Quantization can significantly reduce model size and VRAM usage and accelerate inference speed, but it may slightly reduce model accuracy.
- Heavy Quantization: Refers to using very low bit-precision quantization, e.g., 3-bit or 2-bit. Can further reduce resource demands, but model quality degradation may be more noticeable.
- Native: Refers to running the model at its original precision, e.g., float16 or bfloat16. Yields the best model quality but has the highest resource requirements.
- Parameter Range After Quantization: The "Recommended LLM Parameter Range (After Quantization)" in the table refers to the approximate range of model parameters that the hardware can run smoothly under reasonable quantization. The actual runnable model size and performance also depend on specific model architecture, degree of quantization, software optimization, etc. The parameter ranges given here are for reference only.
- Unified Memory: A characteristic of Apple Silicon chips where the CPU and GPU share the same physical memory, resulting in higher data exchange efficiency.
