vllm

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Best for Production LLM servingWorks with GitHubLow risk

#vllm #inference #serving #llm #production #openai #quantization #tensor-parallelism #pagedattention #continuous-batching #high-throughput

⌘source

author: @NousResearch
repo: NousResearch/hermes-agent
language: Python

✦overview.md

Key Features

·Serves LLMs with high throughput via PagedAttention and continuous batching
·Supports OpenAI-compatible API endpoints
·Enables quantization (GPTQ, AWQ, FP8) for memory efficiency
·Uses tensor parallelism for multi-GPU scaling
·Optimizes inference latency and throughput
·Handles models with limited GPU memory

Use Cases

→Deploying production LLM APIs for high-volume traffic
→Optimizing inference latency and throughput for cost efficiency
→Serving large models on hardware with constrained GPU memory
→Providing OpenAI-compatible endpoints for existing client applications

Best for

✓Production LLM serving
✓High-throughput inference workloads
✓GPU-constrained environments

skills/mlops/inference/vllm/SKILL.md

name

serving-llms-vllm

description

license

MIT

version:1.0.0

author:Orchestra Research

dependencies:["vllm","torch","transformers"]

metadata:{"hermes":{"tags":["vLLM","Inference Serving","PagedAttention","Continuous Batching","High Throughput","Production","OpenAI API","Quantization","Tensor Parallelism"]}}

vLLM - High-Performance LLM Serving

Quick start

vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).

Installation:

pip install vllm

Basic offline inference:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)

OpenAI-compatible server:

vllm serve meta-llama/Llama-3-8B-Instruct

# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
    model='meta-llama/Llama-3-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"

Common workflows

Workflow 1: Production API deployment

Copy this checklist and track progress:

Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring

...

$install

1-click copy

npx skills add NousResearch/hermes-agent --skill vllm

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

5/ 5

excellent

Very clear and well structured, with almost no room for misunderstanding.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

5/ 5

very high

Highly actionable with clear, concrete steps that an agent can follow directly.

~community cookbook

April 18, 2026

◧ Compare

vllm

Best for Production LLM servingWorks with GitHubLow risk

vLLM - High-Performance LLM Serving

Quick start

vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).

Installation:

pip install vllm

Basic offline inference:

from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3-8B-Instruct") sampling = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["Explain quantum computing"], sampling) print(outputs[0].outputs[0].text)

OpenAI-compatible server:

vllm serve meta-llama/Llama-3-8B-Instruct # Query with OpenAI SDK python -c " from openai import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY') print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}] ).choices[0].message.content) "

Common workflows

Workflow 1: Production API deployment

Copy this checklist and track progress:

Deployment Progress: - [ ] Step 1: Configure server settings - [ ] Step 2: Test with limited traffic - [ ] Step 3: Enable monitoring

vllm

Key Features

Use Cases

Best for

vLLM - High-Performance LLM Serving

Quick start

Common workflows

Workflow 1: Production API deployment

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

threejs-geometry

threejs-materials

sveltekit

sql-pro

threejs-lighting

slo-implementation

AI Skill Finder

vllm

Key Features

Use Cases

Best for

vLLM - High-Performance LLM Serving

Quick start

Common workflows

Workflow 1: Production API deployment

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

threejs-geometry

threejs-materials

sveltekit

sql-pro

threejs-lighting

slo-implementation