How GEOFixer Measures and Moves AI Visibility

Most "AI visibility" tools are black boxes. This page is the opposite. Below is exactly how we generate queries, which models we ask, what counts as a win, and the allocation that drives your score. Read it. Argue with it. Use it to compare us with anyone.

The seven-LLM panel

When a real buyer asks an AI "what is the best [thing in your category]?" they could be asking ChatGPT, Gemini, Claude, Perplexity, or one of the open-weight engines that now power dozens of vertical tools. We do not get to pick which model your buyer uses. So we ask all of them.

The seven-engine panel is: GPT-4o-mini (via API, the workhorse for free-ChatGPT and most third-party rebrands), ChatGPT-5 (via API, gpt-5 with gpt-4o fallback — the paid ChatGPT flagship), Gemini Flash (via API, Google's AI surface), Claude Haiku 4.5 (via Anthropic API), Perplexity sonar-pro (via API, live-web with citations), Mistral (via API, EU + open-weight reach), and DeepSeek (via API, vertical-tool reach). Some panels add Grok when its API quota is open.

Why so many? Because each engine has a different worldview. Gemini Flash leans on Google's index plus Vertex memory. GPT-4o-mini leans on OpenAI's training distribution plus Bing-like grounding. Perplexity leans hard on the live web. DeepSeek leans on a different set of crawls altogether. A brand that wins on one is often invisible on another. Measuring only one engine gives you a lottery ticket, not a score.

What we do not measure (yet): Google AI Mode in the SERP itself, voice assistants (Alexa, Siri, the Google Assistant), and the in-product AI surfaces inside Notion, Slack, Linear, etc. Those are coming as their APIs or scrape paths stabilize. We will not pretend we measure them today.

How we generate queries

A bad query measures nothing. "Tell me about MentionFox" tells you the model knows the brand exists. That is not a useful measurement — it tells the model the answer.

A good query is one a real buyer would type, blind, when they are looking for something your product solves. Three rules govern our query generator:

RULE 1

Persona-aware

Queries are written from the point of view of the buyer, not the brand. For a social listening tool, the buyer is a marketing manager, an agency lead, a founder, a brand defender. Each persona asks different questions. The query generator pulls personas from your co-profile (the brand profile we build on day one) and writes a balanced spread.

RULE 2

Category-aware

Your category determines the question shape. "Best tool for X" is one shape. "Cheapest tool for X" is another. "X for small teams" is a third. "X versus Y" is a fourth (and the most revealing, because it forces a head-to-head). The generator writes a spread of shapes, not just the easy "best" question.

RULE 3

Derived from your co-profile

The query generator pulls from your co-profile: your differentiators, your top competitors, your verticals, your integrations, your pricing posture. This is why the queries we run for a $99/mo SaaS look different from the queries we run for a $50k/yr enterprise platform. A boilerplate query set would punish niche brands by asking generic questions they have no shot at winning.

What "win" actually means

The most common cheat in AI visibility tooling is to count any mention as a hit. If ChatGPT says "MentionFox is one of many tools in this space, alongside Brandwatch, Meltwater, Sprinklr, and Sprout Social," that counts as a mention — but it is not a recommendation. Buyers do not act on a mention buried in a list of nine. They act on the top recommendation, the one the model bolds, the one the model justifies.

So we do not score mentions. We score wins. A win means one of three things, in this order:

The model named your brand as the recommended answer to the buyer's question.
The model named your brand in the top three with a positive justification (a reason a buyer would click).
The model named your brand specifically in response to the buyer's stated constraint (price, team size, vertical, integration), even if it was not the top answer overall.

Anything else — passing mentions, list-stuffing, "also worth considering" tail mentions — does not count. We log them, because they are useful for trend analysis, but they do not move your score.

The classifier that decides whether a turn was a win is itself an LLM call, with a structured prompt that returns one of five labels: recommended, listed-with-justification, conditional-match, mentioned-no-recommend, absent. The first three are wins. The last two are not.

The seven-LLM allocation that drives your score

All seven engines on the panel now contribute to your score, weighted. We used to compute the score from four engines and keep Perplexity, Claude, and the OpenAI flagship aside as evidence-only. Buyer behavior shifted — Perplexity now drives a meaningful share of citation-rich queries, ChatGPT-5 sits behind the most-used consumer AI app on the planet, and Claude has stabilized enough that excluding it makes the score less truthful, not more. So we put all seven in.

The allocation reflects what real buyers actually use. Heavier weight goes to the engines that drive the most buyer reach. Smaller, deliberate weights for the newer entrants so a single noisy day from one engine cannot swing your headline number.

Engine	Weight	Why this weight
Gemini Flash	25%	Powers Google's AI surfaces, the largest single source of AI search traffic.
GPT-4o-mini	20%	Powers free ChatGPT and most third-party apps that rebrand OpenAI. Highest raw buyer reach via the cheaper API.
DeepSeek	20%	Powers an exploding share of vertical AI tools (research, code, customer support). Different worldview from the OpenAI/Google axis — catches gaps the others miss.
Mistral	15%	Powers many EU-side and open-weight deployments. Cheaper to query at scale, useful as a tiebreaker.
Perplexity (sonar-pro)	10%	Live-web grounded answers with citations. Increasingly the buyer's first stop for "best X for Y" queries that demand sources.
ChatGPT-5 (gpt-5 / gpt-4o)	5%	Powers the paid ChatGPT product. Smaller weight than gpt-4o-mini because reach is narrower, but the answer quality reflects what the most engaged buyers see.
Claude Haiku 4.5	5%	Anthropic representation. Small weight on purpose — Claude is more hedged than the others, so a heavy weight would punish brands unfairly. Five points is enough to surface a real Claude gap if one exists.

Total weight: 100%. Every engine is queried on every panel run. The weighted average is the headline GEO Score; the per-engine breakdown is one click away on the dashboard so you can see which engines you win on and which you do not.

Why Claude is now in the score (was excluded — May 2026)

Earlier versions of this methodology excluded Claude from the score on the grounds that Anthropic's models are more hedged than the rest of the panel and consistently returned win rates 2 to 5 points lower. That was honest at the time. It became dishonest over the course of 2026.

Claude has stabilized. The 2-5 point penalty is now closer to 1-2 points, and Anthropic's product reach has grown enough that excluding Claude from the score skews your number optimistic relative to what your actual buyers see. So we put Claude back in — specifically Claude Haiku 4.5 — at a deliberately small 5% weight. That is enough to surface a real Claude gap (if your brand is invisible there, you will see it) without letting Claude's structural hedging drag your headline score below your real-world recommendation rate.

The "all major LLMs" claim on our marketing pages is now true. Excluding Anthropic was the one weak point in that claim and it is fixed.

AI crawler analytics — what we track and what we surface

Measuring how your brand is recommended inside an LLM is one half of the GEO problem. The other half is measuring whether the LLM is actually crawling your site to begin with. If GPTBot has not visited your blog in three months, no amount of content optimization will move your score until that crawl pattern changes.

Every request that hits mentionfox.com (and every client-domain we host) passes through Vercel edge middleware. The middleware inspects the User-Agent header and, if it matches one of the AI crawler patterns below, fires a fire-and-forget log to our ai_crawler_visits table. Logging never delays the user's request.

Crawlers tracked: GPTBot, ChatGPT-User, OAI-SearchBot, GPTBot-User, PerplexityBot, PerplexityBot-User, ClaudeBot, Claude-Web, Claude-SearchBot, Anthropic-AI, Google-Extended, Applebot-Extended, Bytespider, AI2Bot, FacebookBot, CCBot, Amazonbot.

What you see on the dashboard at /dashboard/geofixer/crawler-analytics:

Per-crawler visit count over the last 7 days, with week-over-week trend (catches a Bytespider drop before your competitive set notices).
Top 5 URLs each crawler is hitting (tells you which content is getting picked up).
30-day daily timeseries per crawler (lets you see seasonality, deploys that triggered re-crawls, and bots that have stopped visiting).

The same dashboard lives at /clients/:id/geo/crawler-analytics for agency users, scoped to that client's domain. Agency white-label reports can include the crawler timeseries on request.

Autopilot writes content. Trackers do not.

Here is the structural difference between GEOFixer and every "AI visibility tracker" on the market.

A tracker shows you a dashboard. The dashboard says "you have a 12% win rate on Gemini Flash for queries about social listening." The tracker is correct. The dashboard is accurate. Now what?

The honest answer most trackers will not give you: now you go write content, hire an agency, or open a Notion doc and try to figure out what the tracker is telling you to do. The tracker has no opinion on what to write, no draft, no publish flow, no measurement of whether your content moved the score.

GEOFixer Autopilot writes the content. When the system finds a query category where you lose to a specific competitor, it generates a content brief with a target query, an outline, a competitive read, and a draft. You approve it, edit it, or reject it. Approved content gets published to a slug we own (your shadow site) so AI crawlers find it on day one without you touching your CMS. Then the next measurement cycle catches whether the content moved your score on that query category.

The flywheel matters because content is the only durable lever. Active conversation training creates a signal. Shadow site serving makes you legible to crawlers. But content is what actually shifts the model's training-time and retrieval-time understanding of your brand. Trackers leave that work on your desk. Autopilot does it.

The flywheel

Once you turn Autopilot on, the system runs a loop. You are not in the loop most days — you are reviewing its output.

Measure. Every night, the system runs the seven-LLM panel against the persona-aware query set. Wins, conditional matches, and absences are all logged. Fresh data lands by morning.

Mine evidence. Each conversation surfaces facts about your brand, your competitors, and the query. The system extracts those facts and stores them as structured records (we call this layer "promoter facts"). Over time you get a database of what every model thinks about you.

Retrieve at write time. When the system drafts new content or new conversation prompts, it pulls from that fact database. So the next conversation you run is more informed than yesterday's. The next content brief reflects what models actually need to hear, not generic GEO advice.

Compound. Wins from yesterday inform queries today. Today's evidence informs tomorrow's content. Three months in, the system knows your category better than most agencies. Six months in, it knows it better than most analysts. That compounding is the moat.

What we are honest about not measuring

The temptation in this space is to claim measurement of everything. Resist that temptation when reading our materials and anyone else's. Here is what we explicitly do not measure today, and why:

Sentiment. "How positive was the model's tone about your brand?" is a real signal but a noisy one. We do not score it because it would inflate your score in a way the buyer never feels.
Google AI Mode specifically. The AI block at the top of Google search results is not the same as Gemini's API. We measure Gemini directly. We do not measure the AI Mode SERP block as a separate surface yet. When that surface stabilizes and an API or scrape path is reliable, we will add it.
Voice assistants. Alexa, Siri, Google Assistant. Different surfaces entirely. Coming, but not today.
In-product AI surfaces. The AI tools inside Notion, Slack, Linear, Gmail. These mostly route to the major LLMs we already measure, but the prompt-engineering and grounding shifts the answer. We do not yet measure these as distinct surfaces.
Long-form coverage. A model writing a 2,000-word piece about your category will mention your brand differently than it answers a 30-token query. We measure the query case because that is where buyers live.

Try the methodology on your brand

Five-day free trial. Nothing to install. The seven-LLM panel runs against your domain on day one and the dashboard shows the same numbers you read about above.

Run my brand through the panel