How we measure AI search citations across 4 LLMs daily

Every night, our AI Visibility tracker runs roughly 180,000 prompts across four LLM search surfaces – ChatGPT search, Google Gemini, Perplexity and Claude – and stores the citations each one returns. It sounds simple until you look at the response formats. Each model exposes citations differently, and one of them barely exposes them at all. This is how we built the pipeline, what we normalise, and what we still cannot measure.

We are publishing this partly because customers ask, and partly because there are enough teams building their own LLM tracking now that a shared vocabulary would help everyone. If you are debating whether to build or buy, this should tell you which parts are cheap and which parts eat months.

The pipeline in one paragraph

Prompt queue → per-provider worker pool → raw response store → citation extractor → normalised citation record → dedup and merge → daily rollup. Every stage is retryable and idempotent. If any part of the day breaks, we can replay from the raw store without hitting a provider API again. That last property is non-negotiable – provider costs and rate limits make "just re-run it" prohibitively expensive at our volume.

The prompt queue

A tracked "brand" in our system is a set of prompts – typically 40-120 prompts per brand, generated once and then curated. Prompts are grouped into three intent buckets:

Category prompts – "best SEO platforms for agencies" – where we test whether the brand appears at all in the answer.
Comparison prompts – "Ahrefs vs SEMOptimiser vs Semrush" – where we test citation order and framing.
Branded prompts – "is SEMOptimiser good for enterprise SEO" – where we test what the model says about the brand unprompted.

Every prompt runs once per day per provider. That is the daily × 4 × ~180k figure. We spread the run across a 6-hour window overnight to stay under rate limits and to sample any time-of-day variance the models introduce.

The four providers, and how each one hands back citations

Here is where it stops being simple. Every provider returns citations in a different shape. Two of them return structured citation objects. One returns footnote-style markers you have to parse out of the answer text. One returns almost nothing and you infer citations from what the model quotes.

Provider	Citation format	Structured?	What breaks
ChatGPT (search mode)	Inline `【source】` markers plus a structured `citations` array in the response metadata	Yes	Browsing sometimes disabled server-side; falls back to no citations.
Google Gemini	Structured `groundingMetadata.citations` array with URL, title and snippet	Yes	Snippet offsets sometimes off by a few characters against the answer text.
Perplexity	Footnote-style `[1] [2] [3]` in answer + parallel `sources[]` array with URL and title	Yes (via API)	Web UI response format drifts from API; we only use API.
Claude (with tools)	Free-text URLs embedded in the answer; no dedicated citation field	No	You have to regex URLs out and infer citation intent from surrounding text.

The Claude case is the annoying one. When Claude uses its web search tool, it embeds URLs in the answer as it writes, but there is no separate citations field on the response. We extract them with a URL regex, then run a lightweight classifier over the surrounding sentence to decide whether the URL is being cited as a source or merely mentioned as an example. Those are different intents and lumping them together would inflate Claude's citation count relative to the others.

Normalising to one schema

Every raw response gets extracted into the same shape. This is the schema we settled on after three rewrites – it is deliberately flat, deliberately verbose, and every field earns its place:

{
  "id": "cit_01HZY7...",
  "run_id": "run_2026-07-04_perplexity",
  "provider": "perplexity",
  "prompt_id": "pmp_best-seo-platforms",
  "prompt_intent": "category",
  "fetched_at": "2026-07-04T03:14:00Z",
  "citation_position": 2,
  "citation_kind": "source",
  "url_raw": "https://semoptimiser.com/blog/ai-visibility-2026?utm_source=perplexity",
  "url_normalised": "https://semoptimiser.com/blog/ai-visibility-2026",
  "domain": "semoptimiser.com",
  "brand_id": "brnd_semoptimiser",
  "answer_mentions_brand": true,
  "answer_sentiment": "positive",
  "raw_response_ref": "s3://ai-viz-raw/2026-07-04/perplexity/pmp_best-seo-platforms.json"
}

A few decisions worth calling out. `url_raw` vs. `url_normalised` – we keep both because providers sometimes append their own tracking params, and we want to be able to prove what came back without losing the ability to dedup. `citation_kind` is `source`, `mention` or `quote` and drives whether the row counts toward citation share. `raw_response_ref` points at the untouched provider payload – we never mutate that store, per our SEO data integrity rule.

Deduping across providers

A brand that shows up cited on the same prompt across all four providers should be counted four times for cross-provider share, but only once for "did we appear on this prompt today." Deduping keys off `(brand_id, prompt_id, fetched_date)` for the appearance count, and off `(brand_id, prompt_id, provider, fetched_date)` for the citation count. The distinction seems pedantic until you build the dashboard – reporting one when the customer wants the other makes the whole tracker look broken.

URL-level dedup is trickier. The same source can appear as `https://semoptimiser.com/blog/x`, `https://www.semoptimiser.com/blog/x/` and `https://semoptimiser.com/blog/x?utm_source=chatgpt`. Our normaliser lowercases the host, strips `www.`, strips trailing slashes, and drops a known allowlist of tracking params. We do not touch the path – even a trailing `?ref=` we do not recognise stays, because it might genuinely be a different page.

The hard parts

Building the happy path is a weekend. The last 20% is what takes months.

Prompt drift. Providers silently update their models. A prompt that returned three citations yesterday can return zero today because the model got a bit terser. We now track per-provider citation-density baselines and alert when a run comes back more than 2σ below baseline – usually means the provider changed something.
Rate limiting. Perplexity and ChatGPT have low per-minute quotas relative to our volume. We use a token-bucket scheduler per provider and adaptively slow down when we see 429s.
Response caching. OpenAI in particular sometimes returns cached answers for identical prompts within a short window. We inject a low-entropy nonce into prompts where the phrasing tolerates it, to force a fresh generation.
Cost. Running 180k prompts a day is not free. We cache raw responses for 24 hours so redashes and re-parses do not re-hit the provider, and we run the cheaper models for category prompts where model choice matters less to the outcome.
Timezone confusion. "Daily" means different things to different customers. We stamp everything in UTC and let the dashboard render in the workspace timezone, but this took two support tickets to figure out.

What we cannot measure

Three honest limitations that we surface to customers on the product page rather than burying:

ChatGPT sessions with browsing disabled. If the user has turned off web search, ChatGPT answers from parametric memory only and returns no citations. We sample only in browsing-enabled mode, so our numbers reflect the citation surface, not the total answer surface.
Logged-in personalisation. Provider answers can vary based on account history and location. Our runs use fresh sessions from a fixed geography (currently Sydney and Virginia), so we measure a repeatable baseline, not the answer any given user sees.
Gemini in Google Search. Answers rendered inside the Google search box (AI Overviews) use a different citation surface than the standalone Gemini app. We measure both, but report them separately – mixing them makes trend lines meaningless.

Where the numbers actually surface

Once normalised, the citation rows feed three things: a daily rollup per brand and provider, a weekly delta report emailed to workspace owners, and the live AI Visibility dashboard. The dashboard is the single most-viewed surface in the product this year – customers care about their citation share the way they used to care about their organic rank in 2015. The pipeline exists so that number is defensible.

What to do next

If you are building your own tracker, start with Perplexity and Gemini via API – clean citation structures, cheap prototypes, meaningful data within a day. Add ChatGPT next once you can stomach the rate limits. Leave Claude for last, because you will need to build the source-vs-mention classifier before the data is worth trusting. If you would rather skip the build entirely, our AI Visibility tracker does all four out of the box, with the raw responses retained so you can audit any citation back to its source payload.

The pipeline in one paragraph

The prompt queue

A tracked "brand" in our system is a set of prompts – typically 40-120 prompts per brand, generated once and then curated. Prompts are grouped into three intent buckets:

Category prompts – "best SEO platforms for agencies" – where we test whether the brand appears at all in the answer.
Comparison prompts – "Ahrefs vs SEMOptimiser vs Semrush" – where we test citation order and framing.
Branded prompts – "is SEMOptimiser good for enterprise SEO" – where we test what the model says about the brand unprompted.

The four providers, and how each one hands back citations

Provider	Citation format	Structured?	What breaks
ChatGPT (search mode)	Inline `【source】` markers plus a structured `citations` array in the response metadata	Yes	Browsing sometimes disabled server-side; falls back to no citations.
Google Gemini	Structured `groundingMetadata.citations` array with URL, title and snippet	Yes	Snippet offsets sometimes off by a few characters against the answer text.
Perplexity	Footnote-style `[1] [2] [3]` in answer + parallel `sources[]` array with URL and title	Yes (via API)	Web UI response format drifts from API; we only use API.
Claude (with tools)	Free-text URLs embedded in the answer; no dedicated citation field	No	You have to regex URLs out and infer citation intent from surrounding text.

Normalising to one schema

Every raw response gets extracted into the same shape. This is the schema we settled on after three rewrites – it is deliberately flat, deliberately verbose, and every field earns its place:

{
  "id": "cit_01HZY7...",
  "run_id": "run_2026-07-04_perplexity",
  "provider": "perplexity",
  "prompt_id": "pmp_best-seo-platforms",
  "prompt_intent": "category",
  "fetched_at": "2026-07-04T03:14:00Z",
  "citation_position": 2,
  "citation_kind": "source",
  "url_raw": "https://semoptimiser.com/blog/ai-visibility-2026?utm_source=perplexity",
  "url_normalised": "https://semoptimiser.com/blog/ai-visibility-2026",
  "domain": "semoptimiser.com",
  "brand_id": "brnd_semoptimiser",
  "answer_mentions_brand": true,
  "answer_sentiment": "positive",
  "raw_response_ref": "s3://ai-viz-raw/2026-07-04/perplexity/pmp_best-seo-platforms.json"
}

Deduping across providers

The hard parts

Building the happy path is a weekend. The last 20% is what takes months.

Prompt drift. Providers silently update their models. A prompt that returned three citations yesterday can return zero today because the model got a bit terser. We now track per-provider citation-density baselines and alert when a run comes back more than 2σ below baseline – usually means the provider changed something.
Rate limiting. Perplexity and ChatGPT have low per-minute quotas relative to our volume. We use a token-bucket scheduler per provider and adaptively slow down when we see 429s.
Response caching. OpenAI in particular sometimes returns cached answers for identical prompts within a short window. We inject a low-entropy nonce into prompts where the phrasing tolerates it, to force a fresh generation.
Cost. Running 180k prompts a day is not free. We cache raw responses for 24 hours so redashes and re-parses do not re-hit the provider, and we run the cheaper models for category prompts where model choice matters less to the outcome.
Timezone confusion. "Daily" means different things to different customers. We stamp everything in UTC and let the dashboard render in the workspace timezone, but this took two support tickets to figure out.

What we cannot measure

Three honest limitations that we surface to customers on the product page rather than burying:

ChatGPT sessions with browsing disabled. If the user has turned off web search, ChatGPT answers from parametric memory only and returns no citations. We sample only in browsing-enabled mode, so our numbers reflect the citation surface, not the total answer surface.
Logged-in personalisation. Provider answers can vary based on account history and location. Our runs use fresh sessions from a fixed geography (currently Sydney and Virginia), so we measure a repeatable baseline, not the answer any given user sees.
Gemini in Google Search. Answers rendered inside the Google search box (AI Overviews) use a different citation surface than the standalone Gemini app. We measure both, but report them separately – mixing them makes trend lines meaningless.

How we measure AI search citations across 4 LLMs daily

The pipeline in one paragraph

The prompt queue

The four providers, and how each one hands back citations

Normalising to one schema

Deduping across providers

The hard parts

What we cannot measure

Where the numbers actually surface

What to do next

Put this into practice with SEMOptimiser

The 2026 AI Visibility Playbook: how to rank in ChatGPT, Gemini and Perplexity

llms.txt explained: the new robots.txt for AI assistants

Core Web Vitals INP migration: a 30-day plan

How we measure AI search citations across 4 LLMs daily

The pipeline in one paragraph

The prompt queue

The four providers, and how each one hands back citations

Normalising to one schema

Deduping across providers

The hard parts

What we cannot measure

Where the numbers actually surface

What to do next

Put this into practice with SEMOptimiser

The 2026 AI Visibility Playbook: how to rank in ChatGPT, Gemini and Perplexity

llms.txt explained: the new robots.txt for AI assistants

Core Web Vitals INP migration: a 30-day plan

The pipeline in one paragraph

The prompt queue

The four providers, and how each one hands back citations

Normalising to one schema

Deduping across providers

The hard parts

What we cannot measure

Where the numbers actually surface

What to do next

Put this into practice with SEMOptimiser

Keep reading

The 2026 AI Visibility Playbook: how to rank in ChatGPT, Gemini and Perplexity

llms.txt explained: the new robots.txt for AI assistants

Core Web Vitals INP migration: a 30-day plan

The pipeline in one paragraph

The prompt queue

The four providers, and how each one hands back citations

Normalising to one schema

Deduping across providers

The hard parts

What we cannot measure

Where the numbers actually surface

What to do next

Put this into practice with SEMOptimiser

Keep reading

The 2026 AI Visibility Playbook: how to rank in ChatGPT, Gemini and Perplexity

llms.txt explained: the new robots.txt for AI assistants

Core Web Vitals INP migration: a 30-day plan