How I optimized my blog for AI search engines

People don’t just find websites through classic search results anymore. Discovery is increasingly happening through answer engines, AI summaries, and chat-based tools.

Because of this shift, I wanted to make my blog easier for both traditional crawlers and modern AI systems to parse. Don’t get me wrong—there isn’t some secret “AI SEO” trick that guarantees your site will get cited. It’s really just about making things easier to crawl, understand, and summarize.

The priorities

When tackling this, I split the work into two distinct buckets:

  • Proven baseline improvements: Things like structured data, canonical URLs, Open Graph metadata, solid sitemap coverage, and clear robots.txt rules.
  • Optional machine-readable indexes: Files like llms.txt and llms-full.txt. While useful, they aren’t standardized in quite the same way yet.

That distinction is actually pretty important. I definitely wouldn’t recommend skipping the basics just to jump straight into adding an llms.txt file.

Structured data and metadata

Honestly, the most impactful improvements were the boring ones.

I went ahead and added site-level and article-level structured data. This way, machines can answer simple questions about the site without having to guess:

  • what the site is about
  • who writes it
  • what a specific page represents
  • how a single blog post fits into the broader site structure

This effort resulted in four main schema types:

  • WebSite
  • Person
  • BlogPosting
  • BreadcrumbList

Down at the HTML level, I also double-checked that every single page includes:

  • a canonical URL
  • a genuinely useful meta description
  • Open Graph tags for rich previews
  • Twitter card tags

These tweaks don’t just help classic search engines—they make the entire site much easier for other tools to interpret correctly.

robots.txt for explicit crawler access

If you want AI crawlers to actually access your content, you need to be explicit about it.

My robots.txt still keeps the broad allow rule, but I’ve added named entries specifically for the bots I want to permit:

public/robots.txt
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Anthropic-ai
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: cohere-ai
Allow: /

Adding these isn’t a promise that every AI system will suddenly crawl, cite, or rank the site. It just makes your intent crystal clear and removes an easily avoidable blocker.

llms.txt as an optional index

I also decided to add /llms.txt and /llms-full.txt.

I don’t treat these files as guaranteed ranking factors by any means. Instead, I view them as optional, low-effort indexes that make the site incredibly easy to inspect programmatically.

  • llms.txt serves as the quick, short version.
  • llms-full.txt packs in fuller summaries and richer metadata.

A minimal llms.txt entry can look as simple as this:

public/llms.txt
## Technical Articles
### Web Development
- [How I migrated my blog from Gatsby to Astro](https://theodoroskokosioulis.com/blog/gatsby-to-astro-migration): A complete guide to moving a personal blog from Gatsby to Astro.

If a tool happens to use those files, it can understand the site a lot faster. If it doesn’t, no harm done—the baseline improvements still stand perfectly well on their own.

security.txt and trust signals

While I was at it, I added a security contact file at /.well-known/security.txt (the standard RFC 9116 location). To be clear, this isn’t some weird growth hack. It’s just a clean, standardized way to expose a security contact, which happens to make the site look a bit more complete and intentional.

Keeping the indexes updated

The most obvious downside of having an llms.txt file is the maintenance. Every time you publish a new post, you’ve got another place where metadata can quickly go stale.

I solved that headache with a GitHub Action that uses Cursor Agent. The workflow scopes itself to the posts changed in the current run and refreshes only the relevant index entries automatically, ensuring everything stays perfectly aligned with the actual site content.

Checklist

If you’re looking to do this yourself, here are the changes actually worth making:

ItemWhy I added it
Canonical URLsOne clear URL per page
Open Graph and Twitter tagsBetter metadata for previews and parsers
WebSite, Person, BlogPosting, BreadcrumbListMachine-readable structure
robots.txt rulesExplicit crawler permissions
Sitemap and RSS feedBaseline discovery signals
/llms.txt and /llms-full.txtOptional machine-readable site indexes
/.well-known/security.txtStandard security contact and trust signal

How I verified it

I checked my implementation using a few straightforward methods:

  1. Running pages through Google’s Rich Results Test to catch any glaring structured data issues.
  2. Validating the JSON-LD using the Schema.org validator.
  3. Visiting /robots.txt, /llms.txt, and /.well-known/security.txt directly in the browser just to confirm they’re public and up to date.

What is actually worth doing

If you only have time to do three things, start here:

  1. Clean up your metadata and canonical URLs.
  2. Add structured data that accurately matches the page content.
  3. Keep your overall content organized and easy to crawl.

Once that’s done, you can think about adding an llms.txt file—but only if you’re willing to maintain it and actually see value in publishing a machine-readable index of your site.

My general rule for AI-facing SEO is pretty simple: nail the durable basics first, and then you can start experimenting with the optional extras.