← engagemii.com
Engagemii Research Brief · May 2026

What makes brands visible to AI bots: a three-stage analysis of 1.19M sites

Gregory Pellitteri · Engagemii Research · 29 May 2026 · Snapshot dataset

We analysed every site in the Engagemii directory (1,187,128 root domains) to answer two questions in order: which brands get crawled by AI bots at all, and among those that are crawled, which get crawled the most. Both questions matter for different reasons, and the answers come from different parts of the data. This brief reports what we found, the methods we used, and the parts of the answer we still cannot explain. It is written for technically-minded readers who want to evaluate the methodology rather than read marketing copy.

A note on terminology: each row in our dataset is one root domain, which we refer to throughout as a “brand.” The terms are interchangeable.

Published by
Engagemii is an Answer Engine Optimization (AEO) platform. We score websites for AI-citation readiness, deploy structured-data fixes for brands that want the work done, and operate the public brand directory of 1,187,128 sites that produced the dataset behind this brief. Full background in Section 7.
Key terms at a glance
AEO
Answer Engine Optimization. Structuring a site so AI engines (ChatGPT, Claude, Perplexity, Gemini) can read, understand, and cite it. SEO is for search results; AEO is for AI answers.
AI bots
Automated agents that AI companies use to crawl the web. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Google-Extended (Gemini), Applebot-Extended (Apple Intelligence), CCBot (Common Crawl).
AEO Score
Engagemii's 0-10 measurement of a site's AI-visibility readiness, across six categories. Public on every brand directory page.
Structured data
Machine-readable metadata embedded in a page (typically JSON-LD format) that describes what the page is about. The primary signal AI engines use to understand a brand.
LightGBM, SHAP, R²
The ML toolkit and reporting metrics used in this brief. Full definitions in Section 6 (Glossary) at the end.
Three findings. First, sites with a full slate of structured-data signals are roughly 2.09× as likely to be crawled by AI bots at all as sites with none (57.0% vs 27.2%). Without basic structured data, most sites are invisible. Second, among sites that are crawled, brands at AEO score 8 average 4.8× the crawls of brands at score 4 (6.3 vs 1.3 crawls per brand, raw means from 395,022 crawled sites). Third, the top 395 most-crawled brand pages in our directory are ~75× as likely to have an AEO score of 8 or higher as the bottom 90%. And they are not famous brands. They are dentists, regional ecommerce stores, mortgage brokers, and other small and mid-size businesses that have done the AEO work.

1. Dataset

The training data is a single snapshot of the Engagemii brand directory taken on 2026-05-29. Every row is one brand. Every brand has 22 features (mostly binary flags indicating presence or absence of structured-data signals like FAQ schema, Organization schema, llms.txt, multi-page JSON-LD blocks, contact email visibility, US-state location, identity-mismatch flag, etc.), an Engagemii AEO score (0-10), and an observed crawl count drawn from our bot-tracking pipeline.

1,187,128
brands in directory
395,022
crawled at least once
581,296
observed AI bot events

AI bots tracked include GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, Anthropic AI, Google-Extended, Applebot, AmazonBot, Meta-ExternalAgent, Bytespider, CCBot, and approximately ten others. Bot identification is by User-Agent matching against the published bot lists from each operator. Visits are aggregated daily and stored in MongoDB.

2. Method

A naive approach would train one model to predict crawl count directly from structured-data features. That approach gives misleading results because the AEO score we already publish for every brand is a function of those same features. Putting the score in the model alongside its constituent signals causes severe multicollinearity: the score absorbs the credit and the individual signals look small. So we built the model in two stages to mirror the underlying causal structure.

Stage 1: Features → AEO Score

A LightGBM gradient-boosted regressor trained to predict each brand's AEO score from its 17 binary structured-data feature flags. Trained on 80% of the data, evaluated on a held-out 20%. SHAP (SHapley Additive exPlanations) values computed on a 50,000-row sample provide per-feature attribution: for each feature, how many score points does its presence typically add or subtract for a given brand.

Stage 2: AEO Score → Crawl Count

A second LightGBM regressor trained to predict the natural log of (crawl_count + 1) from the AEO score alone. We deliberately exclude business category and PageRank-style domain popularity from this stage. Including them as controls is statistically tempting but conceptually wrong: a brand cannot change its business category or domain popularity in response to a fix, so the unconditional score-to-crawls relationship is what matters for the product claim. The Stage 2 coefficient is therefore interpretable as the typical multiplicative effect of a one-point score lift on crawl count.

Chaining

We chain the two stages to translate a feature's Stage-1 score lift into a Stage-2 crawl impact. For any feature whose presence raises a brand's expected score by Δ points, the expected crawl multiplier is approximately (1 + 0.047)Δ. The chained value is the number we use in the per-brand admin tool and the only one we publish in the per-feature ranking; raw per-feature SHAP values are kept private.

Goodness of fit (20% holdouts):
  • Stage A AUC = 0.612. The visibility classifier is meaningfully better than chance and identifies sites that are crawled vs. invisible.
  • Features → AEO score R² = 0.633. Structured-data features explain about 63% of AEO score variance, which is what lets us attribute a feature's effect on the score back to specific signals.
  • Stage B R² = 0.123. AEO score (unconditional) explains about 12% of crawl-count variance. The remainder lives in business category, domain popularity, brand-name recognition, and brand-specific noise. The R² is small because most crawl variation comes from structural factors a brand cannot change. The AEO score is one of the few it can.

3. Findings

3.1 Sites with structured data are roughly twice as likely to be crawled at all

Across the full 1,187,128-site directory, 33.3% of sites have been observed being crawled by an AI bot. The other two thirds are invisible to AI engines. The single strongest predictor of whether a site is in the visible group is whether it has the structured-data signals AI engines look for.

The Stage A model holds all other variables aside (no AEO score, no demographic controls) and toggles the 16 positive structured-data signals on and off. Sites with all signals present are observed to be crawled at 57.0%; sites with none are crawled at 27.2%. The multiplier is 2.09×.

The Stage A finding is the strongest single claim in this brief because the binary visibility question avoids the noise that affects the volume question. Whether a site is in the AI-crawled population is much cleaner to model than how often it is crawled once it is in that population.

3.2 Top-crawled brands look dramatically different from the rest

Restricted to the 395,022 sites that are crawled, the distribution of crawl counts is extremely right-skewed: a small head of sites accounts for a disproportionate share of the AI bot traffic. We cohort the crawled population by crawl rank and compare AEO score and structured-data adoption across cohorts.

CohortSitesMean crawlsMean AEO score% with score 8+% with 8+ signals
Top 0.1%39538.86.2629.9%41.8%
Top 1%3,9506.65.433.6%12.1%
Top 10%39,5023.15.200.7%12.1%
Bottom 90%355,5201.34.660.4%5.9%

The top 0.1% of crawled sites (the 395-site head of the distribution) average 39 crawls each, against 1.3 for the bottom 90%. Their AEO score-8+ rate is 29.9% vs 0.4%, a 75× gap. The structured-data gap is similar in shape. Looking at the actual top-crawled brands confirms they are not Fortune 500 companies. They are small and mid-size businesses (local dentists, niche ecommerce, professional services firms) that have invested in their AEO signals.

3.3 Among crawled sites, mean crawl count climbs sharply at AEO score 8

The same data, cut by integer AEO score, shows the same pattern from a different angle. Mean crawls per brand stay flat in the 1.3-1.9 range across scores 3-7, then jump sharply at score 8.

AEO scoreBrandsMedian crawlsMean crawls
359,2861.01.4
4117,4391.01.3
5111,2151.01.3
688,3162.01.9
716,3102.01.8
81,5221.06.3
91611.09.8
10713.09.3
0358101.431.341.351.961.876.389.899.310AEO ScoreMean crawls per brand

Figure 1. Mean AI bot crawls per brand, by AEO score band. The sharp jump at score 8+ is the headline visual.

1101001k10k100k1000k59k3117k4111k588k616k72k81619710AEO ScoreBrands (log scale)

Figure 2. Distribution of crawled brands across AEO score bands (log scale). Sites at score 8+ are rare; the directory's mass sits at 4-6.

3.4 Per-point model effect compounds across multiple score points

The unconditional Stage B model translates the score-to-crawl pattern into a continuous coefficient: a one-point lift in AEO score is associated with a +4.0% lift in expected crawl rate. Because the model is multiplicative, this compounds:

Score changeExpected crawl lift
+1 point+4.0%
+2 points+8.2%
+4 points+17.0%
+6 points+26.5%
+8 points (score 2 → 10)+36.9%
0%10%20%30%40%+0.0%+0+1+8.2%+2+3+17.0%+4+5+26.5%+6+7+36.9%+8AEO Score points addedExpected crawl lift

Figure 3. Expected crawl lift as a function of AEO score points added, compounded at +4% per point. Note: this is the unconditional model average; the descriptive cuts in 3.2 and 3.3 capture the same effect with less smoothing.

3.5 Why investing in AEO is hard to replicate without dedicated tooling

A natural follow-up question: if the structured-data signals are public knowledge, why doesn't every site simply deploy them? In practice, two things make this harder than it looks. First, correctly implementing JSON-LD schema across an entire site (Organization, Product, FAQPage, LocalBusiness, BreadcrumbList) is fiddly: errors silently invalidate the markup, and most CMS plugins emit only a subset. Second, the score moves over time as AI bot operators update their crawling and citation preferences. The combination of correct implementation and ongoing monitoring is the work, not the disclosure of which signals matter. Dedicated AEO platforms exist because the gap between “I know what to do” and “my markup is valid, complete, and current six months from now” is wider than most teams budget for.

4. Caveats

Stage 2 R² = 0.10. The AEO score is one input among many to a brand's crawl rate. Business category, domain popularity, brand-name recognition, backlink graph density, and brand-specific factors we do not model collectively explain ~90% of crawl-count variance. Our score moves the needle but does not, on its own, predict the absolute crawl count for any individual brand.
This is observational, not experimental. The model is fit on cross-sectional data: brands at high scores tend to be crawled more than brands at low scores. We have not run a randomized controlled trial in which structured-data fixes are deployed on a treatment group and withheld from a matched control. The causal interpretation (fix the signals → score rises → crawls rise) is consistent with how AI bots are documented to work, but it is supported by correlation plus plausible mechanism rather than by direct experimental evidence.
Bot sample is opportunistic. Engagemii sees crawls that reach our directory pages, not crawls on the brand's own site. The directory is large and frequently crawled, which makes it a usable proxy, but a brand-side measurement instrument (we are deploying one) will be required to claim that our directory crawl rate predicts crawl rate on the brand's own domain.

We expect to extend this work in three directions: (1) per-bot models so we can attribute crawl behavior to specific operators (GPTBot vs. ClaudeBot vs. PerplexityBot), (2) a before-after study on a cohort of Engagemii Fix-It customers once 30-60 days of on-brand crawl tracking has accumulated, and (3) extending the chain to citations as outcome rather than crawls. The citation outcome is the variable customers most care about.

5. Tools and methods used

For interested readers, the model was trained with the following standard tools. All are open-source and free.

The model is retrained periodically as the underlying directory and crawl tracking accumulate new observations. The numbers reported here reflect the snapshot dated 2026-05-29.

6. Glossary

Plain-language definitions for the terms used in this brief. Included so readers from outside the AEO and machine-learning communities can evaluate the work on its merits.

AEO: Answer Engine Optimization
The discipline of structuring a website so that AI engines (ChatGPT, Claude, Perplexity, Gemini, Google AI Overviews, etc.) can read, understand, and cite it when answering user questions. Distinct from traditional SEO (Search Engine Optimization), which optimizes for keyword rankings rather than AI citations.
AI bot / AI crawler
Automated software operated by an AI company to fetch web content for two purposes: building a training corpus, and retrieving live content to use in generated answers. Each bot is identified by its User-Agent string in HTTP request headers. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Google-Extended (Google Gemini and AI Overviews), Applebot-Extended (Apple Intelligence), Amazonbot (Alexa AI), CCBot (Common Crawl, used by many training pipelines).
Structured data / JSON-LD
Machine-readable metadata embedded in a web page that describes the page's content in terms a software agent can parse. JSON-LD is the format AI engines prefer. Common schema types include Organization (who the brand is), Product (what it sells), FAQPage (a list of question-answer pairs), and LocalBusiness (physical location and contact data).
Schema.org
The open vocabulary that defines the types and properties used in structured data, maintained by a community founded by Google, Microsoft, Yahoo, and Yandex. The types referenced above (Organization, Product, FAQPage, etc.) are all Schema.org types.
llms.txt
An emerging convention: a plain-text file at the root of a domain (parallel to robots.txt) that provides AI crawlers with a structured summary of the site, its key URLs, and the questions the brand's content answers. Adoption is growing rapidly among AI-native publishers.
E-E-A-T
Experience, Expertise, Authoritativeness, and Trust. A framework originally documented in Google's Search Quality Evaluator Guidelines and now widely treated as a heuristic for whether a piece of content is credible enough to be cited by AI engines. Concrete signals include named authors with verifiable credentials, customer reviews with structured data, third-party press mentions, and contact information.
AEO Score
Engagemii's composite measurement of a website's AI visibility readiness on a 0-10 scale, evaluated across six categories: Structured Data, Content Structure, Entity Clarity, E-E-A-T Signals, Technical AEO (crawler access, llms.txt, etc.), and AI Discoverability. The category definitions are public (see methodology); the relative weighting and specific sub-signals are proprietary.
LightGBM
An open-source gradient-boosting library released by Microsoft Research in 2017. Gradient boosting is a machine-learning technique that builds a strong predictor by training many small decision trees in sequence, where each tree corrects errors made by the previous ones. LightGBM is one of the dominant tools in tabular machine learning, widely used in finance, advertising, and competition data science.
SHAP: SHapley Additive exPlanations
A method for explaining individual predictions made by complex machine-learning models. Based on Shapley values, a concept from cooperative game theory that fairly distributes credit for an outcome among contributing players. In this context, SHAP tells us how much each feature (FAQ schema, llms.txt, etc.) contributed to a brand's predicted AEO score, on a per-brand basis.
R² (coefficient of determination)
A statistical measure between 0 and 1 indicating the proportion of variance in the outcome that the model's inputs explain. An R² of 1.0 means perfect prediction; an R² of 0 means the model is no better than guessing the average. R² values are reported on a held-out 20% sample of the data that the model has not seen during training.
MAE (mean absolute error)
The average size of the model's prediction errors on the held-out sample, ignoring direction. We report it in log-units because the crawl-count outcome is log-transformed (counts span four orders of magnitude in the data).
Causal mediation
A statistical framework for studying chains of cause and effect. In this brief: structured-data features → AEO score → crawl count. Mediation analysis lets us attribute each feature's downstream effect on crawls through the intermediate score, rather than mis-attributing credit to the score variable itself when it is in fact a summary of the features.
Common Crawl
A public, non-profit project that maintains an open archive of web crawl data, used as a primary or supplementary training input by most large language models. Their crawler (CCBot) is one of the most common AI-related bots in our data.

7. About Engagemii

Engagemii is an Answer Engine Optimization (AEO) platform. The company exists to help brands become readable, understandable, and citable by the AI engines that increasingly answer customers' questions in place of traditional search results.

What Engagemii AEO does, concretely:

Current pricing is published at engagemii.com/aeo/pricing.

The findings in this brief should be read as the product of the same observation infrastructure Engagemii operates commercially. The model is not a separate research project; it is a public window onto the directory we already maintain.


How to cite this brief

Pellitteri, G., & Engagemii Research. (2026). What predicts AI bot crawl rates: a two-stage gradient-boosted analysis of 1.16M brands. Engagemii Research Brief, May 2026. https://engagemii.com/research/aeo-crawl-drivers

License: brand directory and scores are published under CC BY 4.0. The trained model, scoring weights, and training dataset are proprietary.

Contact: [email protected]  ·  engagemii.com/aeo