What makes brands visible to AI bots: a three-stage analysis of 1.19M sites
We analysed every site in the Engagemii directory (1,187,128 root domains) to answer two questions in order: which brands get crawled by AI bots at all, and among those that are crawled, which get crawled the most. Both questions matter for different reasons, and the answers come from different parts of the data. This brief reports what we found, the methods we used, and the parts of the answer we still cannot explain. It is written for technically-minded readers who want to evaluate the methodology rather than read marketing copy.
A note on terminology: each row in our dataset is one root domain, which we refer to throughout as a “brand.” The terms are interchangeable.
1. Dataset
The training data is a single snapshot of the Engagemii brand directory taken on 2026-05-29. Every row is one brand. Every brand has 22 features (mostly binary flags indicating presence or absence of structured-data signals like FAQ schema, Organization schema, llms.txt, multi-page JSON-LD blocks, contact email visibility, US-state location, identity-mismatch flag, etc.), an Engagemii AEO score (0-10), and an observed crawl count drawn from our bot-tracking pipeline.
AI bots tracked include GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, Anthropic AI, Google-Extended, Applebot, AmazonBot, Meta-ExternalAgent, Bytespider, CCBot, and approximately ten others. Bot identification is by User-Agent matching against the published bot lists from each operator. Visits are aggregated daily and stored in MongoDB.
2. Method
A naive approach would train one model to predict crawl count directly from structured-data features. That approach gives misleading results because the AEO score we already publish for every brand is a function of those same features. Putting the score in the model alongside its constituent signals causes severe multicollinearity: the score absorbs the credit and the individual signals look small. So we built the model in two stages to mirror the underlying causal structure.
Stage 1: Features → AEO Score
A LightGBM gradient-boosted regressor trained to predict each brand's AEO score from its 17 binary structured-data feature flags. Trained on 80% of the data, evaluated on a held-out 20%. SHAP (SHapley Additive exPlanations) values computed on a 50,000-row sample provide per-feature attribution: for each feature, how many score points does its presence typically add or subtract for a given brand.
Stage 2: AEO Score → Crawl Count
A second LightGBM regressor trained to predict the natural log of (crawl_count + 1) from the AEO score alone. We deliberately exclude business category and PageRank-style domain popularity from this stage. Including them as controls is statistically tempting but conceptually wrong: a brand cannot change its business category or domain popularity in response to a fix, so the unconditional score-to-crawls relationship is what matters for the product claim. The Stage 2 coefficient is therefore interpretable as the typical multiplicative effect of a one-point score lift on crawl count.
Chaining
We chain the two stages to translate a feature's Stage-1 score lift into a Stage-2 crawl impact. For any feature whose presence raises a brand's expected score by Δ points, the expected crawl multiplier is approximately (1 + 0.047)Δ. The chained value is the number we use in the per-brand admin tool and the only one we publish in the per-feature ranking; raw per-feature SHAP values are kept private.
- Stage A AUC = 0.612. The visibility classifier is meaningfully better than chance and identifies sites that are crawled vs. invisible.
- Features → AEO score R² = 0.633. Structured-data features explain about 63% of AEO score variance, which is what lets us attribute a feature's effect on the score back to specific signals.
- Stage B R² = 0.123. AEO score (unconditional) explains about 12% of crawl-count variance. The remainder lives in business category, domain popularity, brand-name recognition, and brand-specific noise. The R² is small because most crawl variation comes from structural factors a brand cannot change. The AEO score is one of the few it can.
3. Findings
3.1 Sites with structured data are roughly twice as likely to be crawled at all
Across the full 1,187,128-site directory, 33.3% of sites have been observed being crawled by an AI bot. The other two thirds are invisible to AI engines. The single strongest predictor of whether a site is in the visible group is whether it has the structured-data signals AI engines look for.
The Stage A model holds all other variables aside (no AEO score, no demographic controls) and toggles the 16 positive structured-data signals on and off. Sites with all signals present are observed to be crawled at 57.0%; sites with none are crawled at 27.2%. The multiplier is 2.09×.
3.2 Top-crawled brands look dramatically different from the rest
Restricted to the 395,022 sites that are crawled, the distribution of crawl counts is extremely right-skewed: a small head of sites accounts for a disproportionate share of the AI bot traffic. We cohort the crawled population by crawl rank and compare AEO score and structured-data adoption across cohorts.
| Cohort | Sites | Mean crawls | Mean AEO score | % with score 8+ | % with 8+ signals |
|---|---|---|---|---|---|
| Top 0.1% | 395 | 38.8 | 6.26 | 29.9% | 41.8% |
| Top 1% | 3,950 | 6.6 | 5.43 | 3.6% | 12.1% |
| Top 10% | 39,502 | 3.1 | 5.20 | 0.7% | 12.1% |
| Bottom 90% | 355,520 | 1.3 | 4.66 | 0.4% | 5.9% |
The top 0.1% of crawled sites (the 395-site head of the distribution) average 39 crawls each, against 1.3 for the bottom 90%. Their AEO score-8+ rate is 29.9% vs 0.4%, a 75× gap. The structured-data gap is similar in shape. Looking at the actual top-crawled brands confirms they are not Fortune 500 companies. They are small and mid-size businesses (local dentists, niche ecommerce, professional services firms) that have invested in their AEO signals.
3.3 Among crawled sites, mean crawl count climbs sharply at AEO score 8
The same data, cut by integer AEO score, shows the same pattern from a different angle. Mean crawls per brand stay flat in the 1.3-1.9 range across scores 3-7, then jump sharply at score 8.
| AEO score | Brands | Median crawls | Mean crawls |
|---|---|---|---|
| 3 | 59,286 | 1.0 | 1.4 |
| 4 | 117,439 | 1.0 | 1.3 |
| 5 | 111,215 | 1.0 | 1.3 |
| 6 | 88,316 | 2.0 | 1.9 |
| 7 | 16,310 | 2.0 | 1.8 |
| 8 | 1,522 | 1.0 | 6.3 |
| 9 | 161 | 1.0 | 9.8 |
| 10 | 7 | 13.0 | 9.3 |
Figure 1. Mean AI bot crawls per brand, by AEO score band. The sharp jump at score 8+ is the headline visual.
Figure 2. Distribution of crawled brands across AEO score bands (log scale). Sites at score 8+ are rare; the directory's mass sits at 4-6.
3.4 Per-point model effect compounds across multiple score points
The unconditional Stage B model translates the score-to-crawl pattern into a continuous coefficient: a one-point lift in AEO score is associated with a +4.0% lift in expected crawl rate. Because the model is multiplicative, this compounds:
| Score change | Expected crawl lift |
|---|---|
| +1 point | +4.0% |
| +2 points | +8.2% |
| +4 points | +17.0% |
| +6 points | +26.5% |
| +8 points (score 2 → 10) | +36.9% |
Figure 3. Expected crawl lift as a function of AEO score points added, compounded at +4% per point. Note: this is the unconditional model average; the descriptive cuts in 3.2 and 3.3 capture the same effect with less smoothing.
3.5 Why investing in AEO is hard to replicate without dedicated tooling
A natural follow-up question: if the structured-data signals are public knowledge, why doesn't every site simply deploy them? In practice, two things make this harder than it looks. First, correctly implementing JSON-LD schema across an entire site (Organization, Product, FAQPage, LocalBusiness, BreadcrumbList) is fiddly: errors silently invalidate the markup, and most CMS plugins emit only a subset. Second, the score moves over time as AI bot operators update their crawling and citation preferences. The combination of correct implementation and ongoing monitoring is the work, not the disclosure of which signals matter. Dedicated AEO platforms exist because the gap between “I know what to do” and “my markup is valid, complete, and current six months from now” is wider than most teams budget for.
4. Caveats
We expect to extend this work in three directions: (1) per-bot models so we can attribute crawl behavior to specific operators (GPTBot vs. ClaudeBot vs. PerplexityBot), (2) a before-after study on a cohort of Engagemii Fix-It customers once 30-60 days of on-brand crawl tracking has accumulated, and (3) extending the chain to citations as outcome rather than crawls. The citation outcome is the variable customers most care about.
5. Tools and methods used
For interested readers, the model was trained with the following standard tools. All are open-source and free.
- LightGBM. Ke et al. (2017), Microsoft Research. Gradient-boosted decision tree library, widely used in tabular ML competitions and production systems.
- SHAP (SHapley Additive exPlanations). Lundberg and Lee (2017). Game-theoretic feature attribution method, the de facto standard for explaining tree-based predictions.
- scikit-learn. Train/test split, R² and MAE metrics.
- pandas and NumPy. Data ingestion and feature preparation.
The model is retrained periodically as the underlying directory and crawl tracking accumulate new observations. The numbers reported here reflect the snapshot dated 2026-05-29.
6. Glossary
Plain-language definitions for the terms used in this brief. Included so readers from outside the AEO and machine-learning communities can evaluate the work on its merits.
- AEO: Answer Engine Optimization
- The discipline of structuring a website so that AI engines (ChatGPT, Claude, Perplexity, Gemini, Google AI Overviews, etc.) can read, understand, and cite it when answering user questions. Distinct from traditional SEO (Search Engine Optimization), which optimizes for keyword rankings rather than AI citations.
- AI bot / AI crawler
- Automated software operated by an AI company to fetch web content for two purposes: building a training corpus, and retrieving live content to use in generated answers. Each bot is identified by its User-Agent string in HTTP request headers. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Google-Extended (Google Gemini and AI Overviews), Applebot-Extended (Apple Intelligence), Amazonbot (Alexa AI), CCBot (Common Crawl, used by many training pipelines).
- Structured data / JSON-LD
- Machine-readable metadata embedded in a web page that describes the page's content in terms a software agent can parse. JSON-LD is the format AI engines prefer. Common schema types include Organization (who the brand is), Product (what it sells), FAQPage (a list of question-answer pairs), and LocalBusiness (physical location and contact data).
- Schema.org
- The open vocabulary that defines the types and properties used in structured data, maintained by a community founded by Google, Microsoft, Yahoo, and Yandex. The types referenced above (Organization, Product, FAQPage, etc.) are all Schema.org types.
- llms.txt
- An emerging convention: a plain-text file at the root of a domain (parallel to robots.txt) that provides AI crawlers with a structured summary of the site, its key URLs, and the questions the brand's content answers. Adoption is growing rapidly among AI-native publishers.
- E-E-A-T
- Experience, Expertise, Authoritativeness, and Trust. A framework originally documented in Google's Search Quality Evaluator Guidelines and now widely treated as a heuristic for whether a piece of content is credible enough to be cited by AI engines. Concrete signals include named authors with verifiable credentials, customer reviews with structured data, third-party press mentions, and contact information.
- AEO Score
- Engagemii's composite measurement of a website's AI visibility readiness on a 0-10 scale, evaluated across six categories: Structured Data, Content Structure, Entity Clarity, E-E-A-T Signals, Technical AEO (crawler access, llms.txt, etc.), and AI Discoverability. The category definitions are public (see methodology); the relative weighting and specific sub-signals are proprietary.
- LightGBM
- An open-source gradient-boosting library released by Microsoft Research in 2017. Gradient boosting is a machine-learning technique that builds a strong predictor by training many small decision trees in sequence, where each tree corrects errors made by the previous ones. LightGBM is one of the dominant tools in tabular machine learning, widely used in finance, advertising, and competition data science.
- SHAP: SHapley Additive exPlanations
- A method for explaining individual predictions made by complex machine-learning models. Based on Shapley values, a concept from cooperative game theory that fairly distributes credit for an outcome among contributing players. In this context, SHAP tells us how much each feature (FAQ schema, llms.txt, etc.) contributed to a brand's predicted AEO score, on a per-brand basis.
- R² (coefficient of determination)
- A statistical measure between 0 and 1 indicating the proportion of variance in the outcome that the model's inputs explain. An R² of 1.0 means perfect prediction; an R² of 0 means the model is no better than guessing the average. R² values are reported on a held-out 20% sample of the data that the model has not seen during training.
- MAE (mean absolute error)
- The average size of the model's prediction errors on the held-out sample, ignoring direction. We report it in log-units because the crawl-count outcome is log-transformed (counts span four orders of magnitude in the data).
- Causal mediation
- A statistical framework for studying chains of cause and effect. In this brief: structured-data features → AEO score → crawl count. Mediation analysis lets us attribute each feature's downstream effect on crawls through the intermediate score, rather than mis-attributing credit to the score variable itself when it is in fact a summary of the features.
- Common Crawl
- A public, non-profit project that maintains an open archive of web crawl data, used as a primary or supplementary training input by most large language models. Their crawler (CCBot) is one of the most common AI-related bots in our data.
7. About Engagemii
Engagemii is an Answer Engine Optimization (AEO) platform. The company exists to help brands become readable, understandable, and citable by the AI engines that increasingly answer customers' questions in place of traditional search results.
What Engagemii AEO does, concretely:
- Scoring. We audit any website on demand and produce an AEO Score (0-10), broken out across six categories: structured data, content structure, entity clarity, E-E-A-T signals, technical AEO, and AI discoverability. The free score is instant; a deeper PDF audit is available.
- Fix-It deployment. For brands that want the work done rather than a list, we generate and deploy the actual fix files: JSON-LD schema blocks (Organization, Product, FAQPage, LocalBusiness), an llms.txt manifest, robots.txt patches that explicitly allow AI crawlers, and an AI Verified badge for visible attestation. Ongoing monitoring tracks citation pickup over time.
- The brand directory. We operate a public directory of 1,187,128 brand pages at engagemii.com/aeo/brands, each scored and indexed for discovery by AI engines. The directory itself is one of the most-crawled AEO information sources on the open web. The 581,296 crawl events that underlie this brief were observed against our directory pages.
- Research and infrastructure. We monitor AI bot traffic across the directory in real time, publish methodology and findings (this brief is one), maintain a tracked-citations API for paying customers, and operate the open data set behind engagemii.com/aeo/methodology under CC BY 4.0.
Current pricing is published at engagemii.com/aeo/pricing.
The findings in this brief should be read as the product of the same observation infrastructure Engagemii operates commercially. The model is not a separate research project; it is a public window onto the directory we already maintain.
How to cite this brief
License: brand directory and scores are published under CC BY 4.0. The trained model, scoring weights, and training dataset are proprietary.
Contact: [email protected] · engagemii.com/aeo