Most teams discover their AI visibility problem by accident. A prospect mentions that ChatGPT recommended a competitor. A founder vanity-checks Perplexity and finds the brand missing from its own category. Someone notices organic traffic sliding while rankings hold steady.

An accidental discovery is a bad baseline. It's one query, on one engine, on one day, filtered through whoever happened to be looking. Decisions get made on it anyway.

The fix costs an afternoon. Before buying any monitoring tool (including this one), a team can build a defensible first read of where the brand stands in AI answers, using nothing but the engines themselves and a spreadsheet. Here's the method.

Key takeaways

A defensible AI visibility baseline needs three things: the real questions buyers ask, the engines that matter for the category, and a consistent record of what came back.
Twenty buyer questions across five engines (one hundred answers) is enough to see the pattern: where the brand appears, where competitors own the answer, and which sources the engines trust.
Record presence, position, sentiment, and citations for every answer. Those four fields turn anecdotes into a number that can be re-measured.
A manual baseline decays fast because answers change continuously. Its job is to size the problem and justify (or kill) the case for continuous monitoring.

Step 1: Write down the questions buyers actually ask

The baseline is only as good as the prompts. Skip brand-name queries at first ("what is [brand]") and start where deals are won: category and comparison questions.

Three sources produce a solid list quickly. Sales calls, because the questions prospects ask a rep are the questions they ask an engine first. Existing search data, because high-intent keywords translate directly into conversational prompts ("best CRM for mid-market B2B teams"). And the category's natural comparisons, because shortlist queries ("X vs Y", "alternatives to X") are where recommendation behavior is most visible.

Twenty questions is the practical floor. Fewer than that and one odd answer skews the read. Aim for a mix: roughly half category-level ("best [category] for [segment]"), a quarter comparisons, a quarter problem-framed ("how do I [job the product does]").

Step 2: Pick the engines by buyer, not by buzz

Five engines cover the ground for most B2B categories: ChatGPT, Gemini, Perplexity, Claude, and Copilot, with Google's AI Overviews as a sixth read since it sits on top of existing search behavior. ChatGPT carries the most weight in the aggregate (Graphite.io puts it at 89% of global AI sessions), but aggregate share is not category share. Copilot matters more in Microsoft-heavy enterprises; Perplexity over-indexes among technical evaluators; each engine ranks brands its own way.

Run every question on every engine while logged out or in a fresh session where possible. Personalization contaminates a baseline.

Step 3: Record four fields per answer

For each question-engine pair, capture the same four fields:

Presence. Is the brand named at all? (Yes / no. The most important column in the sheet.)
Position. If named, where: the lead recommendation, mid-list, or a trailing mention.
Sentiment. How the answer characterizes the brand, in one word: recommended, neutral, caveated, or negative.
Citations. Which sources the engine linked or named. This column becomes the action list later, because those sources are where visibility is actually earned.

A hundred rows of that is a real dataset. Presence rate alone ("named in 23% of category answers, versus 71% for the leading competitor") is a board-legible number, and the citations column usually contains the entire content strategy: the same handful of roundups, comparison pages, and community threads appearing again and again.

Step 4: Read the gaps before the totals

The totals make the headline, but the gaps make the plan. Three patterns show up in almost every first baseline:

An engine blind spot. Strong presence on one engine, absence on another, usually because the engines trust different sources. Treating "AI" as one channel hides this completely.
A competitor-owned query. One rival is the default answer for a specific high-intent question. That's not a general visibility problem, it's a specific, fixable one.
A description drift. The brand appears, but the engines describe it for the wrong segment or against the wrong use case. Presence without accurate framing converts no one, which is why the score alone was never the point.

The honest limitation of a manual baseline

A manual baseline is a photograph of a moving object. Engine answers shift with model updates, fresh content, and new citations; the picture starts aging the day it's taken. Re-running a hundred prompts by hand every month is exactly the kind of work that quietly stops happening by month three.

That's the real decision the baseline informs. If the afternoon's work shows the brand solidly present everywhere it matters, continuous monitoring can wait. If it shows gaps in the engines buyers use most, the question becomes how fast those gaps are moving, and that requires measuring daily, across every engine, with the receipts kept.

Either way, the baseline pays for itself: it replaces the accidental anecdote with a number, and a number can be argued with, budgeted against, and improved.

For the fastest version of step zero, the free AEO readiness check reads how well a site is structured for AI citation in about a minute. And when the manual spreadsheet gets old, a demo shows the continuous version running on a real category.