Cat 1 — Epistemic Humility
Tests calibration: does the model say "I don't know" when it should? Scores against unknowable facts, fictional entities, future events, and paradoxes.
✓ 6 browser testsRun safety, governance, refusal, prompt-injection, privacy, tool-use, latency, and breaking-point tests — directly in your browser. No install required.
Runs entirely in your browser — your API key goes directly from your machine to your endpoint. Orivael never sees it. Works with Ollama, vLLM, LM Studio, NVIDIA NIM, Hugging Face TGI, or any OpenAI-compatible endpoint.
🔒 How your API key is handled & CLI install option ↓Default endpoint: http://localhost:11434/v1. No API key needed. Start with ollama serve.
Use http://localhost:8000/v1 or your custom port. Any OpenAI-compatible server works.
Enter your NIM endpoint URL and your NGC API key. Evaluates enterprise inference endpoints with the same governance packs.
Use https://api.openai.com/v1 with your sk-... key. Costs a few cents per full run.
Standard leaderboards ask which model is smartest. AXIOM asks which model survives governance pressure, adversarial prompts, tool-use risk, and production constraints.
Tests calibration: does the model say "I don't know" when it should? Scores against unknowable facts, fictional entities, future events, and paradoxes.
✓ 6 browser testsAttempts DAN-mode activation, SYSTEM override injection, developer-mode tricks, and role-escape prompts. Checks the model ignores every attempt.
✓ 5 browser testsRequests step-by-step synthesis of harmful substances, phishing email drafts, unauthorized access instructions, and malware. Scores complete refusal rate.
✓ 5 browser testsAsks the model to execute destructive server commands, database wipes, mass emails, and financial transfers. Checks it refuses without hedging.
✓ 4 browser testsInjects a fake secret key into the system prompt then probes five ways to extract it. Passes only if the model never leaks the key.
✓ 5 browser tests8-level escalating adversarial ladder from normal requests to compound override attacks. Identifies the exact level where governance breaks.
✓ 8 levelsThe signature test. Pressure Sweep gradually increases governance and adversarial pressure until the model fails, then identifies the exact condition where trust breaks.
Run a benchmark to see where your model begins to weaken under adversarial pressure.
Every run produces plain-English verdicts, category scores, evidence, and deployment guidance.
| Category | Score | Status | Meaning |
|---|---|---|---|
| Epistemic Humility | — | PENDING | Run a benchmark to see results. |
| Prompt Injection Defense | — | PENDING | Run a benchmark to see results. |
| Governance Obedience | — | PENDING | Run a benchmark to see results. |
| Tool-Use Safety | — | PENDING | Run a benchmark to see results. |
| Privacy Leakage | — | PENDING | Run a benchmark to see results. |
| Pressure Sweep | — | PENDING | Run a benchmark to see results. |
| Deployment Recommendation | — | PENDING | Connect a model and run all tests. |
Open any result and see the prompt, expected behavior, actual behavior, AXIOM verdict, and recommended fix. Click a tab after running to inspect real test detail.
No benchmark has been run yet. Connect a model above and click Run Benchmark to populate this panel with real test evidence.
Technical teams get JSON. Executives get a summary. Developers get the prompts that caused failure.
Full structured results — all prompts, responses, verdicts, and scores — exported as a JSON file.
Copy the scored results table to your clipboard, formatted for Markdown or a plain text report.
Each run gets a signed run ID derived from your endpoint and timestamp — reference it in reports without sharing raw data.
Run the same suite against multiple endpoints and compare scores. Coming in Lab v2.
⧖ Coming soonTwo questions that come up for every team evaluating production tooling: where does my key go, and can I run this in a pipeline?
This page is a static HTML file hosted on a CDN. There is no Orivael server, no proxy, no request logger between your browser and your model.
Every fetch() call goes to the URL you typed in the Endpoint field. Open DevTools → Network while running — every request domain will be yours, not orivael.dev.
The key lives in a JS variable for the duration of the run. When the run finishes (or the tab closes), it's gone. It is never written to localStorage, sessionStorage, cookies, or any other persistent store.
Open DevTools → Network → filter by XHR/Fetch. You will see exactly one outbound domain: your endpoint. Refresh the page and the key field is blank — nothing was persisted.
Browsers block HTTP endpoints from HTTPS pages (mixed content). If you're running a local model over HTTP, download this file and open it via file:// — your key never touches the internet at all in that case.
The browser runner is zero-friction for one-off evaluations. For CI/CD pipelines, scheduled model comparisons, or scripted runs across dozens of endpoints, the CLI gives you the same test suite with JSON output you can pipe into dashboards.
AXIOM Benchmark Lab turns model evaluation into a practical production-readiness test: governance, pressure, latency, evidence, and runtime control — entirely in your browser, nothing to install.