Skip to content

Want to deploy large models locally to save money and protect data privacy? That's a great idea!

But diving into the world of models, you might get overwhelmed by various parameters and types: 7B, 14B, 32B, 70B... With so many parameters for the same model, which one should you choose?

What level is my computer at, and which model can it handle?

Don't panic! This article will help you clarify your thoughts and tell you in the simplest way how to choose hardware for local deployment of large models! Guaranteed to clear up your confusion after reading!

There is a Hardware Configuration and Model Size Reference Table at the bottom of this article.

Understanding Large Model Parameters: What Do 7B, 14B, 32B Mean?

  • Meaning of Parameters: Numbers like 7B, 14B, and 32B represent the number of parameters in a large language model (LLM), where "B" stands for Billion. Parameters can be thought of as the "weights" the model learns during training, storing its understanding of language, knowledge, and patterns.
  • Parameter Count and Model Capability: Generally, the more parameters a model has, the more complex it is, theoretically enabling it to learn and store richer information, capture more complex language patterns, and perform better in understanding and generating text.
  • Resource Consumption and Model Size: Models with more parameters also require more computational resources (GPU power), more memory (VRAM and system RAM), and more data for training and running.
  • Small Models vs. Large Models:
    • Large Models (e.g., 32B, 65B, or larger): Can handle more complex tasks, generate more coherent and nuanced text, and may excel in knowledge Q&A, creative writing, etc. But they demand higher hardware requirements and run relatively slower.
    • Small Models (e.g., 7B, 13B): Consume fewer resources, run faster, and are more suitable for running on devices with limited resources or in latency-sensitive applications. Small models can also perform well on simpler tasks.
  • Trade-offs in Selection: Choosing model size involves a trade-off between model capability and hardware resources. More parameters don't always mean "better"; you need to select the most suitable model based on actual application scenarios and hardware conditions.

What Hardware Do I Need to Run Local Models?

  • Core Requirement: VRAM

    • Importance of VRAM: When running large models, the model parameters and intermediate calculation results need to be loaded into VRAM. Therefore, VRAM size is the most critical hardware metric for running local large models. Insufficient VRAM may prevent the model from loading, limit you to very small models, or severely reduce running speed.
    • Bigger is Better: Ideally, having a GPU with as much VRAM as possible is best to run larger models and achieve better performance.
  • Secondary Importance: System RAM

    • Role of RAM: System RAM is used to load the operating system, run programs, and serve as a supplement to VRAM. When VRAM is insufficient, system RAM can act as "overflow" space, but it's much slower (since RAM is much slower than VRAM) and significantly reduces model running efficiency.
    • Sufficient RAM is Also Important: It's recommended to have at least 16GB or even 32GB+ of system RAM, especially if your GPU VRAM is limited, as more RAM can help alleviate VRAM pressure.
  • Processor (CPU)

    • Role of CPU: The CPU is mainly responsible for data preprocessing, model loading, and some model computation tasks (especially with CPU offloading). A better-performing CPU can improve model loading speed and assist the GPU in computation to some extent.
    • NPU (Neural Processing Unit): NPUs in some laptops are specialized hardware for accelerating AI computations. NPUs can speed up specific types of AI operations, including inference for certain large models, improving efficiency and reducing power consumption. If your laptop has an NPU, it's a plus, but the GPU remains the core for running local large models. NPU support and effectiveness depend on the specific model and software.
  • Storage (HDD/SSD)

    • Role of Storage: You need enough storage space to store model files. Large model files are typically large; for example, a quantized 7B model might take 4-5GB, while larger models can require tens or even hundreds of GB.
    • SSD Over HDD: Using a solid-state drive (SSD) instead of a hard disk drive (HDD) can significantly speed up model loading.

Hardware Priority

  1. VRAM (Most Important)
  2. System RAM (Important)
  3. GPU Performance (Computing Power) (Important)
  4. CPU Performance (Supporting Role)
  5. Storage Speed (SSD Over HDD)

What If I Don't Have a Dedicated GPU?

  • Integrated Graphics and CPU Running: If you don't have a dedicated GPU, you can still use integrated graphics (e.g., Intel Iris Xe) or rely entirely on the CPU to run models. However, performance will be greatly limited; it's recommended to focus on running 7B or smaller, highly optimized models and use techniques like quantization to reduce resource demands.
  • Cloud Services: If you need to run large models but lack local hardware, consider using cloud GPU services like Google Colab, AWS SageMaker, RunPod, etc.

How to Run Local Models?

For beginners, it's recommended to use some user-friendly tools that simplify the process of running local models:

  • Ollama: Operated via command line, but installation and use are very simple, focusing on quickly running models.
  • LM Studio: Features a clean, intuitive interface, supports model downloading, model management, and one-click running.

Hardware Configuration and Model Size Reference Table

Swipe left and right to see all

X86 Laptops
Integrated Graphics Laptop (e.g., Intel Iris Xe)Shared System RAM (8GB+ RAM)8-bit, even 4-bit quantization≤ 7B (Heavily Quantized)* Very basic local running experience, suitable for learning and light experimentation. * Limited performance, slower inference speed. * Recommended to use 4-bit or lower precision quantized models to minimize VRAM usage as much as possible. * Suitable for running small models like TinyLlama.
Entry-Level Gaming/Thin & Light Laptop (e.g., RTX 3050/4050)4-8 GB VRAM + 16GB+ RAM4-bit - 8-bit quantization7B - 13B (Quantized)* Can run 7B models relatively smoothly; some 13B models can also run with quantization and optimization. * Suitable for experiencing some mainstream small to medium-sized models. * Note that VRAM is still limited; running large models will be challenging.
Mid-to-High-End Gaming/Mobile Workstation (e.g., RTX 3060/3070/4060)8-16 GB VRAM + 16GB+ RAM4-bit - 16-bit (Flexible Choice)7B - 30B (Quantized)* Can run 7B and 13B models more comfortably and has the potential to try models around 30B (requires good quantization and optimization). * Can choose different quantization precisions based on needs to balance performance and model quality. * Suitable for exploring more varieties of medium to large models.

ARM (Apple M Series)
Raspberry Pi 4/54-8 GB RAM4-bit quantization (or lower)≤ 7B (Heavily Quantized)* Limited by memory and computing power, mainly used for running very small models or as an experimental platform. * Suitable for researching model quantization and optimization techniques.
Apple M1/M2/M3 (Unified Memory)8GB - 64GB Unified Memory4-bit - 16-bit (Flexible Choice)7B - 30B+ (Quantized)* Unified Memory architecture makes memory usage more efficient; even M-series Macs with 8GB RAM can run models of a certain size. * Higher memory versions (16GB+) can run larger models, even attempting models above 30B. * Apple chips have advantages in energy efficiency.

NVIDIA GPU Computers
Entry-Level Dedicated GPU (e.g., RTX 4060/4060Ti)8-16 GB VRAM4-bit - 16-bit (Flexible Choice)7B - 30B (Quantized)* Performance similar to mid-to-high-end gaming laptops, but desktops have better cooling, allowing stable long-term operation. * Good value for money, suitable for entry-level local LLM enthusiasts.
Mid-Range Dedicated GPU (e.g., RTX 4070/4070Ti/4080)12-16 GB VRAM4-bit - 16-bit (Flexible Choice)7B - 30B+ (Quantized)* Can run medium to large models more smoothly and has the potential to try larger parameter models. * Suitable for users with higher requirements for local LLM experience.
High-End Dedicated GPU (e.g., RTX 3090/4090, RTX 6000 Ada)24-48 GB VRAM8-bit - 32-bit (or higher)7B - 70B+ (Quantized/Native)* Can run most open-source LLMs, including large models (e.g., 65B, 70B). * Can try higher bit precisions (e.g., 16-bit, 32-bit) for the best model quality, or use quantization to run even larger models. * Suitable for professional developers, researchers, and heavy LLM users.
Server-Grade GPU (e.g., A100, H100, A800, H800)40GB - 80GB+ VRAM16-bit - 32-bit (Native Precision)30B - 175B+ (Native/Quantized)* Designed specifically for AI computing, featuring超大 VRAM and extremely strong computing power. * Can run超大 models and even perform model training and fine-tuning. * Suitable for enterprise applications, large-scale model deployment, and research institutions.

Table Additional Notes

  • Quantization: Refers to reducing the numerical precision of model parameters, e.g., from 16-bit floating-point (float16) to 8-bit integer (int8) or 4-bit integer (int4). Quantization can significantly reduce model size and VRAM usage and speed up inference, but it may slightly reduce model accuracy.
  • Heavy Quantization: Refers to using very low bit precision quantization, e.g., 3-bit or 2-bit. Can further reduce resource demands, but model quality degradation may be more noticeable.
  • Native: Refers to running the model at its original precision, e.g., float16 or bfloat16. Provides the best model quality but has the highest resource requirements.
  • Quantized Parameter Range: The "Recommended LLM Parameter Range (After Quantization)" in the table refers to the approximate model parameter range that the hardware can run smoothly under reasonable quantization. The actual model size and performance that can be run also depend on specific model architecture, degree of quantization, software optimization, etc. The parameter ranges given here are for reference only.
  • Unified Memory: A feature of Apple Silicon chips where the CPU and GPU share the same physical memory, resulting in higher data exchange efficiency.