llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Best for Edge deploymentWorks with GitHubLow risk

#inference #cpu #apple-silicon #edge #gguf #quantization #non-nvidia #amd #intel #embedded

⌘source

author: @NousResearch
repo: NousResearch/hermes-agent
language: Python

✦overview.md

Key Features

·Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware
·Supports GGUF quantization (1.5-8 bit) for reduced memory
·Delivers 4-10× speedup vs PyTorch on CPU
·Pure C/C++ with minimal dependencies
·Optimized for edge deployment on systems like Raspberry Pi
·Simple deployment without requiring Docker or Python

Use Cases

→Deploying models on Apple Silicon Macs (M1/M2/M3/M4) for local development
→Running inference on CPU-only servers or machines without NVIDIA GPUs
→Embedding LLMs in edge devices or IoT systems like Raspberry Pi
→Utilizing AMD or Intel GPUs for acceleration when CUDA is unavailable

Best for

✓Edge deployment
✓Non-NVIDIA hardware
✓CPU-only environments

Not ideal for

!NVIDIA GPU datacenters needing maximum throughput
!Python-first API workflows

FAQs

skills/mlops/inference/llama-cpp/SKILL.md

name

llama-cpp

description

license

MIT

version:1.0.0

author:Orchestra Research

dependencies:["llama-cpp-python"]

metadata:{"hermes":{"tags":["Inference Serving","Llama.cpp","CPU Inference","Apple Silicon","Edge Deployment","GGUF","Quantization","Non-NVIDIA","AMD GPUs","Intel GPUs","Embedded"]}}

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

Running on CPU-only machines
Deploying on Apple Silicon (M1/M2/M3/M4)
Using AMD or Intel GPUs (no CUDA)
Edge deployment (Raspberry Pi, embedded systems)
Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)
Need maximum throughput (100K+ tok/s)
Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs
Need Python-first API
Want PagedAttention

Quick start

Installation

# macOS/Linux
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA (NVIDIA)
make LLAMA_CUDA=1

# With ROCm (AMD)
make LLAMA_HIP=1

Download model

# Download from HuggingFace (GGUF format)
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/

...

$install

1-click copy

npx skills add NousResearch/hermes-agent --skill llama-cpp

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

5/ 5

excellent

Very clear and well structured, with almost no room for misunderstanding.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

5/ 5

very high

Highly actionable with clear, concrete steps that an agent can follow directly.

~community cookbook

April 18, 2026

◧ Compare

llama-cpp

Best for Edge deploymentWorks with GitHubLow risk

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

Running on CPU-only machines

Deploying on Apple Silicon (M1/M2/M3/M4)

Using AMD or Intel GPUs (no CUDA)

Edge deployment (Raspberry Pi, embedded systems)

Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)

Need maximum throughput (100K+ tok/s)

Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs

Need Python-first API

Want PagedAttention

Quick start

Installation

# macOS/Linux brew install llama.cpp # Or build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # With Metal (Apple Silicon) make LLAMA_METAL=1 # With CUDA (NVIDIA) make LLAMA_CUDA=1 # With ROCm (AMD) make LLAMA_HIP=1

Download model

# Download from HuggingFace (GGUF format) huggingface-cli download \ TheBloke/Llama-2-7B-Chat-GGUF \ llama-2-7b-chat.Q4_K_M.gguf \ --local-dir models/ # Or convert from HuggingFace python convert_hf_to_gguf.py models/llama-2-7b-chat/

llama-cpp

Key Features

Use Cases

Best for

Not ideal for

FAQs

When should I use llama.cpp instead of TensorRT-LLM or vLLM?

What hardware does llama.cpp support?

llama.cpp

When to use llama.cpp

Quick start

Installation

Download model

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

threejs-geometry

threejs-materials

sveltekit

sql-pro

threejs-lighting

slo-implementation

AI Skill Finder

llama-cpp

Key Features

Use Cases

Best for

Not ideal for

FAQs

When should I use llama.cpp instead of TensorRT-LLM or vLLM?

What hardware does llama.cpp support?

llama.cpp

When to use llama.cpp

Quick start

Installation

Download model

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

threejs-geometry

threejs-materials

sveltekit

sql-pro

threejs-lighting

slo-implementation