How to Measure AI Visibility Across ChatGPT & Perplexity

Table of Contents
- Why measurement comes first
- What AI visibility measurement actually tracks
- Step 1: Build your prompt library
- Step 2: Run prompts and capture outputs
- Step 3: Score what you find
- Step 4: Calculate share of voice
- What good AI visibility data looks like
- Common mistakes that break the measurement
- How FNA measures AI visibility for clients
The short version: You can't improve AI visibility without measuring it first. This framework covers the four steps — prompt library, output capture, citation scoring, and share of voice calculation — that turn vague AI mentions into a trackable metric. Most brands skip measurement entirely and wonder why their AI presence doesn't improve.
Most brands have strong opinions about whether they "show up" in AI. Few have data. The ones who do are ahead of competitors who are still guessing.
Why measurement comes first
The instinct when you discover AI visibility is to start publishing content. More blog posts. Better structured pages. More FAQ sections.
That's not wrong. But it's backwards if you haven't measured first.
Without a baseline, you can't tell whether the content you're creating is moving anything. You can't see which topics you're invisible on versus which you're already included in. And you can't prioritise — so you write for the topics that feel important rather than the ones where you're losing ground.
Measurement solves all three problems. It tells you where you stand, where the gap is widest, and which changes are actually working.
What AI visibility measurement actually tracks
AI visibility measurement has three components. They're different from each other, and each one tells you something the others don't.
Inclusion rate — What percentage of responses to a given prompt mention your brand at all? This is the base metric. If ChatGPT mentions you in 2 out of 10 runs of the same prompt, your inclusion rate for that prompt is 20%.
Characterisation — When you are mentioned, how? Are you described as a leading option, a niche alternative, or dismissed with a caveat? Inclusion without favourable characterisation is weak signal. A mention that says "some users consider Brand X, though it has limited integrations" is worse than not being mentioned at all in some competitive contexts.
Share of voice (SoV) — Across all prompts in a topic cluster, what percentage of total brand mentions are yours versus competitors? If your category generates 200 brand mentions across a prompt set and your brand accounts for 18 of them, your SoV is 9%. That number is what you improve against week over week.
| Metric | What it measures | How to use it |
|---|---|---|
| Inclusion rate | Whether you appear in responses | Identify invisible topics |
| Characterisation | How you're described when mentioned | Flag negative or weak mentions |
| Share of voice | Your mentions vs. competitors total | Track competitive position over time |
Step 1: Build your prompt library
The prompt library is the foundation. Every other measurement depends on it being well-constructed.
What makes a good prompt
The best prompts are phrased the way real buyers ask AI tools — not the way marketers think about keywords.
"AI visibility tools" is a keyword. "What tools can I use to track whether my brand appears in AI answers?" is a prompt. The second one generates more useful data because it's closer to actual user behaviour.
Use these sources to build your initial library:
- People Also Ask — Google's PAA boxes show real questions in your category
- Perplexity's "Related" suggestions — what users ask after an initial query
- Reddit and Quora threads — how people phrase questions to other people
- Your sales call recordings — how buyers describe their problems before they know your solution
How many prompts you need
20 prompts is enough to start. 40 gives you enough data to segment by topic and intent. More than 60 and you're adding diminishing returns unless you have a very complex competitive set.
Organise prompts into topic clusters:
Topic: Pricing and ROI
- "How much does [solution category] typically cost?"
- "Is [solution category] worth the investment for small businesses?"
- "What's the ROI on [solution category]?"
Topic: Comparison and alternatives
- "What are the best [solution category] tools?"
- "[Competitor] alternatives"
- "[Competitor] vs [Competitor] — which is better?"
Topic: Use cases
- "[Solution category] for [specific industry]"
- "How do companies use [solution category] to [specific outcome]?"
What to avoid in prompt construction
Don't name your brand in the prompt. This biases the response — you're testing discoverability, not direct recall.
Don't use prompts so broad they produce encyclopaedic answers rather than brand recommendations. "What is AI?" will not generate useful citation data.
Don't rotate prompts between measurement periods. Fix the library for at least 6 weeks. Changing prompts mid-measurement is like changing the questions between surveys — you lose comparability.
Step 2: Run prompts and capture outputs
Which engines to test
At minimum: ChatGPT (GPT-4o) and Perplexity. They behave differently and weight sources differently, so a brand can be strong in one and invisible in the other.
If your buyers use Google's AI Overviews heavily, add that. For enterprise clients in some regions, Gemini is worth including. In practice, most B2B measurement programmes track two to three engines.
How to run at scale
Manual testing works for an initial audit of 20 prompts. For anything ongoing, you need API access.
ChatGPT: OpenAI API with GPT-4o. Set temperature to 0.3 to 0.5 — low enough to get consistent responses, high enough to surface the natural variation the model would show real users. Run each prompt 3 times and record all three responses.
Perplexity: Perplexity API with the sonar-pro model. Run each prompt twice — Perplexity's live web access means responses vary more than ChatGPT's, so the second run helps catch variance.
What to record for each response
For each prompt run, capture:
- Which brands were mentioned (exact names as the model stated them)
- Whether your brand was mentioned (yes/no)
- How your brand was characterised (positive, neutral, negative, with caveat)
- Which URLs were cited as sources (if shown)
- The position of your first mention (first brand named, second, third, not mentioned)
Don't paraphrase. Capture exact text around your brand mention. "X is a solid option for mid-market teams" and "X is sometimes used by mid-market teams" are different characterisations.
Step 3: Score what you find
Raw data is noisy. Scoring gives you something to act on.
Citation scoring model
Assign each mention a score:
| Mention type | Score |
|---|---|
| Named as primary recommendation | 3 |
| Named alongside top competitors | 2 |
| Named with a caveat or limitation | 1 |
| Not mentioned | 0 |
| Mentioned negatively | -1 |
Run this for every prompt, every run, every engine. The resulting matrix tells you:
- Which topics you're scoring well on
- Which topics need content work
- Which topics have caveat problems (you're mentioned but being damped)
The caveat category is underrated. A lot of brands are technically being mentioned in AI answers but with language that reduces purchase intent. "Brand X is popular but some users find the onboarding complex" is a mention. It's not a good one.
Consistency score
Because AI models are probabilistic, run each prompt multiple times and measure how consistently you appear. If you show up in 1 out of 3 runs, that's low consistency — meaning you're on the edge of the model's consideration set for that topic.
Consistency = (runs where brand was mentioned) / (total runs)
High SoV but low consistency means something. It means you appear in some responses for that topic but not reliably — which often means your on-site content is partially structured correctly but missing elements that lock in consistent inclusion.
Step 4: Calculate share of voice
SoV is the number you report on. Everything else feeds into it.
The calculation
For a given topic cluster over a measurement period:
SoV = (your brand mentions across all prompt runs) / (total brand mentions across all prompt runs) × 100
If 5 competitors generated 240 total brand mentions across your prompt set and your brand generated 22, your SoV is 9.2%.
Calculate SoV per topic cluster and per AI engine. Your SoV on "pricing" questions might be 18% on Perplexity but 4% on ChatGPT. That gap tells you exactly where to focus: ChatGPT draws on different sources for pricing content, and your current content isn't in them.
What to compare against
Three benchmarks matter:
- Your own historical baseline — is your SoV moving week over week?
- Primary competitors — are you gaining or losing ground relative to the 2 or 3 brands you most often compete against in deals?
- Topic-level spread — which of your topic clusters has the lowest SoV? That's where lost deals are hiding.
The topic-level spread is where most brands find the biggest gap between their self-perception and reality. A company might feel strong on "security" positioning and discover their AI SoV on security-related prompts is 3%, because the sources AI uses for security content don't include them at all.
What good AI visibility data looks like
A well-structured measurement output gives you a table like this per measurement period:
| Topic cluster | Your SoV | Top competitor SoV | Your inclusion rate | Avg. characterisation score |
|---|---|---|---|---|
| Pricing & ROI | 12% | 31% | 40% | 1.8 |
| Use cases | 24% | 28% | 70% | 2.4 |
| Comparisons | 5% | 44% | 20% | 1.2 |
| Technical integration | 31% | 19% | 80% | 2.7 |
That table tells a clear story: strong on technical integration, nearly invisible on comparison prompts, a characterisation problem on pricing. The actions follow directly.
Common mistakes that break the measurement
Rotating prompts. If you change the prompt set between measurement periods, your data is not comparable. Treat the library as fixed infrastructure, not a variable.
Running prompts only once. AI models have natural variance. A single run per prompt tells you what the model said once, not what it typically says. Run each prompt at least 3 times.
Counting mentions without scoring characterisation. A mention is not a good mention by default. Track how you're being described, not just whether you appear.
Measuring too many engines at once without enough prompts. If you have 10 prompts split across 5 AI engines, you have 2 data points per engine. That's not enough signal. Go deep on 2 engines before expanding.
Not separating topic clusters. A single aggregate SoV number hides which topics you're invisible on. Always break down by cluster.
How FNA measures AI visibility for clients
At FNA Technology, AI visibility measurement is part of our AI development services — built into the weekly reporting cycle, not treated as a one-off audit.
The setup takes about two weeks: we build a 30 to 40 prompt library based on your category, competitive set, and buyer questions, then run the first baseline measurement across ChatGPT and Perplexity. From there, measurement runs weekly. The cost is under $3 per week in API usage for a typical client.
The output is a prioritised action list: which topic clusters need content created, which existing pages need structural updates, and where your off-site presence is missing from the sources AI draws on. The score tells you where you are. The action list tells you what to change to move it.
Want to see where your brand stands in AI answers right now? Run a visibility audit with our team and we'll show you your baseline AI share of voice across your category before we do anything else.
Frequently Asked Questions
Written by FNA Team
We are a team of developers, designers, and innovators passionate about building the future of technology. Specializing in AI, automation, and scalable software solutions.
Work with us