Compare AI models side-by-side
in your terminal
One prompt, multiple models, real-time streaming, performance stats, and an AI judge — all in a single command.
npx yardstiq "your prompt" -m claude-sonnet -m gpt-4oEverything you need to compare models
Stop copying prompts between tabs. One command gives you streaming comparisons, hard numbers, and AI-powered evaluation.
Side-by-Side Streaming
Watch model outputs appear in parallel, in real time. No more tab-switching between chat windows.
40+ Models
Claude, GPT, Gemini, Llama, DeepSeek, Mistral, Grok — every major model in one tool.
Performance Stats
Time to first token, throughput, token counts, and cost per model. Data, not vibes.
AI Judge
Let an AI evaluate which response wins with scored verdicts and reasoning.
Export Anywhere
JSON for pipelines, Markdown for docs, self-contained HTML for sharing.
Benchmark Suites
Define prompt suites in YAML and run them across models with aggregate scoring.
Local Models
Compare Ollama models with zero API cost. Your hardware, your data, your rules.
Flexible Auth
One Vercel AI Gateway key for everything, or individual provider keys. Mix and match.
Up and running in 60 seconds
No config files. No web UI. Just your terminal.
Install (or just use npx)
npm install -g yardstiq
# or skip install entirely npx yardstiq "your prompt" -m claude-sonnet -m gpt-4o
Set your API key
# One key for 40+ models via Vercel AI Gateway export AI_GATEWAY_API_KEY=your_key # Or individual provider keys export ANTHROPIC_API_KEY=sk-ant-... export OPENAI_API_KEY=sk-...
Compare models
# Basic comparison yardstiq "Explain monads" -m claude-sonnet -m gpt-4o # With AI judge yardstiq "Write a sort algorithm" -m claude-sonnet -m gpt-4o --judge # Three models + export yardstiq "Explain DNS" -m claude-sonnet -m gpt-4o -m gemini-flash --json > results.json
Go local (optional)
# No API key needed — just run Ollama yardstiq "hello" -m local:llama3.2 -m local:mistral
Real benchmarks, not marketing
Run your own benchmark suites with YAML configs. Here's a sample across coding, creative writing, and reasoning tasks.
# benchmark.yaml name: model-showdown prompts: - "Write a Python fibonacci with memoization" - "Explain quantum entanglement to a 10-year-old" - "Debug this async race condition: ..." models: - claude-sonnet - gpt-4o - gemini-flash judge: true
yardstiq benchmark run benchmark.yaml --json
| Model | Coding | Creative | Reasoning | Speed | Cost/req |
|---|---|---|---|---|---|
| Claude Sonnet | 92 | 88 | 94 | 69 t/s | $0.0013 |
| GPT-4o | 89 | 85 | 90 | 48 t/s | $0.0010 |
| Gemini Flash | 84 | 82 | 86 | 112 t/s | $0.0004 |
| Llama 3.1 70B | 81 | 79 | 83 | 35 t/s | $0.0000 |
Stop guessing. Start measuring.
Join developers who use yardstiq to make data-driven model decisions.