Verification Report: Responses API web_search vs Exa Search/Answer APIs
4 min read
11/14/2025
Summary
This verification compares OpenAI Responses API's web_search tool and Exa.ai's Search/Answer APIs across evidence-backed strengths and limitations. Findings: both are viable for web-grounded RAG but differ in control, citation determinism, engineering effort, and risk surface. OpenAI web_search is model‑centric and offers live retrieval with low plumbing; Exa is retrieval‑centric, returns parsed content and explicit citations as first‑class outputs, and provides more control over index/filtering. Important caveats: hallucinations and citation errors occur with both approaches; medical/legal/high-stakes usage requires strict evaluation and guardrails.
Affirmed strengths — Exa
- Exa exposes search, contents, answer, and research endpoints that return parsed page content, highlights and structured citations — making it straightforward to ground LLMs without building your own crawl/index pipeline (source: Exa API pages and docs: https://exa.ai/exa-api; https://docs.exa.ai/).
- Exa supports filters (domain, date, category), websets and crawling options for tailored indexes, which helps domain coverage and governance (source: Exa docs/examples websets).
- Case studies and third‑party writeups show Exa deployed for real business workflows (investment banking LP sourcing, research use cases), indicating production usage beyond prototypes.
Critiques & limitations — Exa
- Public independent benchmarks for Exa's citation accuracy and performance are limited; vendor claims exist (Fast mode) but need buyer validation under realistic QPS and query complexity. Exa publishes marketing and case studies, but neutral third‑party performance tests are scarce.
- LLM‑based search systems (including Exa when used to feed LLMs) are still vulnerable to hallucinations and unsupported claims — empirical studies in medical domains show many unsupported statements across tools; mitigating this requires human evaluation and system-level guardrails.
Affirmed strengths — OpenAI Responses web_search
- The Responses API includes a web_search tool that enables models to fetch live web results during response generation, allowing up-to-date retrieval without managing a crawl or vector DB (source: OpenAI docs: https://platform.openai.com/docs/guides/tools-web-search).
- It supports multiple retrieval modes (agentic search, deep research) and can synthesize answers with citations when prompted correctly, which is convenient for minimal‑infra RAG.
- Web_search integrates naturally into model reasoning (tools pattern), making multi-step retrieval workflows possible within a single Responses call.
Critiques & limitations — OpenAI Responses web_search
- Citation fidelity issues: community reports and tests show the Responses/web_search tool can generate fabricated or outdated links and occasionally return incorrect citations; outputs must be validated and linked content verified before trusting in production.
- Limited control over the web index: you cannot tune crawling or indexing (unlike Exa). This reduces control over domain coverage, freshness, and filtering; for private corpora you still need embeddings + vector DB or file_search workflows.
- Latency and cost: model-invoked retrieval may increase response latency and token/compute costs; empirical reports note variability in embeddings and retrieval latencies in production, requiring benchmarking for SLA targets.
Where each approach fits best
- Use Exa when: you need deterministic, citation-friendly web retrieval from a controlled crawl (news monitoring, enterprise websets, research assistants), and you want filtering and parsed content out of the box.
- Use Responses web_search when: you want minimal infra for live web grounding inside the LLM, need up-to-the-minute web data, and accept model-driven citation formatting with additional validation.
Recommended tests (POC plan)
- Citation fidelity test (100 queries): compare top 5 sources returned by each system; human raters verify if claims are supported and links resolve.
- Latency & SLA test: run representative QPS and measure end-to-end p50/p95/p99 for both systems under concurrency.
- Cost simulation: run expected monthly query volume through both pricing models and compare total cost and cost per validated answer.
- Hallucination/factuality audit: create known-answer queries (especially in high-risk domains) and measure unsupported claim rates.
Sources
- Exa API & docs: https://exa.ai/exa-api ; https://docs.exa.ai/
- Exa blog & case studies: https://exa.ai/blog
- OpenAI Responses web_search docs: https://platform.openai.com/docs/guides/tools-web-search
- Community reports on citation issues: OpenAI community thread and independent articles
- Research on LLM citation/factuality and data poisoning vulnerabilities (Nature, PMC), showing risks in high‑stakes domains.