Serve Markdown to Agents with Content Negotiation

Introduction

I wanted a simple thing:

Serve the same blog URL to both humans and AI agents - but give them what they actually need.

Humans → HTML (nice, styled, readable)
Agents → Markdown (clean, lightweight, no nonsense)

No duplicate routes. No overengineering. Just one URL doing the right thing.

The Problem

AI agents are everywhere now. But here’s the issue:

Most websites are still built only for humans.

HTML is:

noisy
bloated
full of layout junk

Agents don’t care about your navbar, animations, or CSS.

They just want:

clean, structured, token-efficient content

Markdown fits that perfectly.

Why even bother?

Short answer: tokens + clarity

Long answer:

HTML wastes tokens (a lot)
Markdown is predictable and easy to parse
Agents perform better with less noise

For example, take this blog post. When you tokenize the HTML version it will have 95k+ tokens (because of all the tags, attributes, and layout code). The Markdown version is only around 4k+ tokens. That’s a huge difference for agents that have token limits or want to process content efficiently. This difference is litterly 25x more tokens just for the HTML formatting. By serving Markdown to agents, we can save them from having to parse through all the unnecessary HTML and get straight to the content they care about. (This is a real-world example, and the exact token counts can vary based on the specific HTML structure and Markdown formatting used, but it illustrates the general point that Markdown can be significantly more token-efficient than HTML for agents.) This difference can lead to faster processing, lower costs, and better performance for agents.

To visualize this, copy both HTML and Markdown versions of the content and paste them into AI Tokenizer to see the token counts for yourself.

So instead of building separate endpoints like:

/blog/html
/blog/md

Why not just:

same URL → different response based on Accept header

The Approach

At some point it hit me:

I already write in Markdown… why am I converting it away just to serve it?

So I stopped doing that.

Agents → raw Markdown
Humans → rendered HTML

Implementation (Next.js)

This is handled in proxy.ts (edge layer) - not inside page components.

Markdown for `/blogs`

if (pathname === '/blogs' && chosen === 'text/markdown') {
  const res = NextResponse.rewrite(new URL(serverEnvConfig.BLOGS_INDEX_MD_URL));

  res.headers.set('Content-Type', 'text/markdown; charset=utf-8');
  res.headers.set('Vary', 'Accept');

  return res;
}

Markdown for individual blogs

const blogIdMatch = pathname.match(/^\/blogs\/i\/(\d+)$/);

if (blogIdMatch && chosen === 'text/markdown') {
  const id = Number(blogIdMatch[1]);

  const data = await fetch(serverEnvConfig.BLOGS_INDEX_JSON_URL).then((res) =>
    res.json()
  );
  const blogEntry = data.find((entry) => entry.id === id);

  if (!blogEntry?.mdUrl) {
    return NextResponse.rewrite(new URL('/404', req.url), { status: 404 });
  }

  const res = NextResponse.rewrite(new URL(blogEntry.mdUrl));

  res.headers.set('Content-Type', 'text/markdown; charset=utf-8');
  res.headers.set('Vary', 'Accept');
  res.headers.set('Cache-Control', 's-maxage=60, stale-while-revalidate=86400');
  res.headers.set('Content-Signal', 'ai-train=no, search=yes, ai-input=no');

  return res;
}

NOTE: The NextResponse.rewrite method will expose the original Markdown file URL in the x-middleware-rewrite header. If your Markdown files are stored in a private API or protected location, make sure to remove or hide this header to prevent exposing internal endpoints to the public. In the example above, since the Markdown files are served from a public bucket, this is not an issue. However, if you are using private storage, you should ensure that the x-middleware-rewrite header is not included in the response to avoid unintended exposure of your internal API endpoints.

For cut-off x-middleware-rewrite header, I added bellow code to my nginx config:

location {
  proxy_hide_header x-middleware-path;
}

The relatable issue in Next.js GitHub discussions: Can I remove x-middleware-rewrite http header from the response headers when using NextResponse.rewrite?

In the above Cache-Control header is used to ensure that Markdown responses are cached at the edge for 60 seconds, and can be served stale while revalidating for up to 24 hours. This balances freshness with performance for agents.

The Content-Signal header is used to explicitly communicate content usage permissions to agents, indicating that the content can be used for search indexing but not for AI training or input. This header is interduced by Cloudflare's Content Signals initiative, which provides a standardized way for website operators to express content usage permissions to automated agents. See also Content Signals section below for more details.

Markdown Index (`/blogs`)

This is where things get interesting.

/blogs itself also supports Markdown.

So instead of scraping pages, agents can fetch everything in one request.

Generation

export function jsonToFlatMarkdown(blogs: BlogPost[]): string {
  const header = `# Sivothayan's Blogs

This is a collection of all the blogs that I have written. You can find the latest blogs at:

👉 [Read as Markdown (Accept: text/markdown)](https://sivothayan.com/blogs)
👉 [Read as HTML (Accept: text/html)](https://sivothayan.com/blogs)

This resource supports HTTP content negotiation.

- For Markdown: send header \`Accept: text/markdown\`
- For HTML: send header \`Accept: text/html\`

Example:
\`curl -H "Accept: text/markdown" https://sivothayan.com/blogs\`

---
`;

  const content = blogs
    .filter((b) => b.isPublished)
    .map((blog) => {
      const tags = blog.tags.map((t) => `\`${t}\``).join(', ');

      return `## ${blog.title}

- **Date:** ${blog.date}
- **Read Time:** ${blog.readTime} min
- **Language:** ${blog.language}
- **Tags:** ${tags}

${blog.description}

👉 [Read as Markdown (Accept: text/markdown)](https://sivothayan.com/blogs/i/${blog.id})
👉 [Read as HTML (Accept: text/html)](https://sivothayan.com/blogs/i/${blog.id})`;
    })
    .join('\n\n---\n\n');

  return `${header}\n${content}\n`;
}

Metadata (Next.js)

You also need to advertise Markdown support.

For Blog's layout page:

import type { Metadata } from 'next';

export const metadata: Metadata = {
  alternates: {
    canonical: `${site}/blogs/i/${id}`,
    types: {
      'text/markdown': `${site}/blogs/i/${id}`,
      'application/rss+xml': `${site}/rss.xml`,
      'application/atom+xml': `${site}/feed.xml`
    }
  }
};

for Blogs index page:

import type { Metadata } from 'next';

export const metadata: Metadata = {
  alternates: {
    canonical: `${site}/blogs`,
    types: {
      'text/markdown': `${site}/blogs`,
      'application/rss+xml': `${site}/rss.xml`,
      'application/atom+xml': `${site}/feed.xml`
    }
  }
};

Without this, agents won’t know Markdown is available.

Testing

If you don’t test this, you’re just guessing.

Markdown

curl -H "Accept: text/markdown" https://sivothayan.com/blogs/i/8

HTML

curl https://sivothayan.com/blogs/i/8

Headers

curl -I -H "Accept: text/markdown" https://sivothayan.com/blogs/i/8

You should see:

Content-Type: text/markdown; charset=utf-8
Vary: Accept
Cache-Control: s-maxage=60, stale-while-revalidate=86400
Content-Signal: ai-train=no, search=yes, ai-input=no

Content Signals

Since this setup is explicitly agent-friendly, it also makes sense to be explicit about what is allowed and what is not.

Instead of relying on assumptions, I expose content usage rules directly.

NOTE: Currently, the content signals are communicated via the Content-Signal header. To add this to robots.txt, Next.js don't have built-in support for adding via robots.ts or metadata export, so you would need to serve a static robots.txt file from your public directory with the appropriate content signals included. This way, agents that respect robots.txt will be able to read and adhere to the content usage permissions you have set. See Next.js GitHub discussions - Add support for robots.txt Content-Signals.

Add the below robots.txt file to the root of your website to communicate content usage permissions to agents via standardized content signals:

# As a condition of accessing this website, you agree to abide
# by the following content signals:

# (a)  If a content-signal = yes, you may collect content for
# the corresponding use.
# (b)  If a content-signal = no, you may not collect content for
# the corresponding use.
# (c)  If the website operator does not include a content signal
# for a corresponding use, the website operator neither grants
# nor restricts permission via content signal with respect to
# the corresponding use.

# The content signals and their meanings are:

# search: building a search index and providing search results
# (e.g., returning hyperlinks and short excerpts from your
# website's contents). Search does not include providing
# AI-generated search summaries.
# ai-input: inputting content into one or more AI models (e.g.,
# retrieval augmented generation, grounding, or other real-time
# use for generative AI answers).
# ai-train: training or fine-tuning AI models.

# ANY RESTRICTIONS EXPRESSED VIA CONTENT SIGNALS ARE EXPRESS
# RESERVATIONS OF RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION
# DIRECTIVE 2019/790 ON COPYRIGHT AND RELATED RIGHTS IN THE
# DIGITAL SINGLE MARKET.

User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /

`x-markdown-tokens` header

To make it even easier for agents, I also include a custom header x-markdown-tokens that estimates the number of tokens in the Markdown response. This allows agents to quickly assess the content size and make informed decisions about processing it, without having to parse the entire Markdown content first. This is especially useful for agents with token limits or those that want to prioritize certain content based on size.

How to calculate?

I precalculated the token count for each markdown file using ai-tokenizer npm package and stored it in the JSON index. Then I set the header in the response of each markdown request:

To generate token-map.json:

import { readdir, readFile, writeFile, stat } from 'fs/promises';
import path from 'path';

import Tokenizer, { ModelName, models } from 'ai-tokenizer';
import * as encoding from 'ai-tokenizer/encoding';

const PUBLIC_DIR = path.join(process.cwd(), 'public');
const TOKEN_MAP_JSON_PATH = path.join(PUBLIC_DIR, 'token-map.json');

const MODEL_NAMES: ModelName[] = [
  'openai/gpt-4o',
  'openai/gpt-4o-mini',

  'anthropic/claude-3.5-sonnet',
  'anthropic/claude-3-haiku',

  'google/gemini-2.5-pro',
  'google/gemini-2.5-flash'
];

const tokenizers = Object.fromEntries(
  MODEL_NAMES.map((m) => {
    const model = models[m];
    return [m, new Tokenizer(encoding[model.encoding])];
  })
);

async function walk(dir: string): Promise<string[]> {
  const entries = await readdir(dir);
  const results: string[] = [];
  for (const entry of entries) {
    const fullPath = path.join(dir, entry);
    const s = await stat(fullPath);
    if (s.isDirectory()) {
      results.push(...(await walk(fullPath)));
    } else if (entry.endsWith('.md')) {
      results.push(fullPath);
    }
  }
  return results;
}

async function main() {
  const files = await walk(PUBLIC_DIR);
  const result: Record<string, Record<string, number>> = {};
  for (const file of files) {
    const content = await readFile(file, 'utf-8');
    // normalize path (relative to /public)
    const relativePath = path.relative(PUBLIC_DIR, file).replace(/\\/g, '/');
    result[relativePath] = {};
    for (const [modelName, tokenizer] of Object.entries(tokenizers)) {
      const tokens = tokenizer.encode(content).length;
      result[relativePath][modelName] = tokens;
    }
  }
  await writeFile(
    path.join(TOKEN_MAP_JSON_PATH),
    JSON.stringify(result, null, 2),
    'utf-8'
  );
  console.log('✅ token-map.json generated');
}

main();

I the above code, we recursively walk through the public directory to find all Markdown files, calculate their token counts for various models using ai-tokenizer, and save the results in a token-map.json file. This allows us to easily retrieve token counts when serving Markdown content to agents.

Then in the cloudflare worker, we can set the x-markdown-tokens header like this:

const contentType = headers.get('Content-Type') || '';
if (contentType.startsWith('text/markdown')) {
  const tokenMap = await getTokenMap(env, request);
  const key = pathname.replace(/^\/+/, '');
  const tokens = tokenMap[key];
  const tokenCount = tokens ? Math.max(...Object.values(tokens)) : 0;
  headers.set('x-markdown-tokens', tokenCount.toString());
}
return new Response(assetResponse.body, {
  status: assetResponse.status,
  headers
});

In the above code snippet, we check if the response content type is Markdown. If it is, we retrieve the token count from our pre-generated token-map.json and set it in the x-markdown-tokens header before returning the response to the agent. I selected the maximum token count across different models to provide a conservative estimate that should work for any agent, regardless of the specific model they are using. (This is because different models may have different tokenization rules, and using the maximum ensures that agents can safely process the content without hitting unexpected token limits, and using the upper bound is generally more helpful for agents to make informed decisions about processing the content.)

What actually improved?

Agents skip HTML parsing entirely
Less data → fewer tokens → cheaper + faster
/blogs acts like a content index
Same URL works for everything

At this point, /blogs is basically a read-only content API - without building one.

Conclusion

This isn’t some big architecture.

It’s just using the web properly.

Same resource. Different representations.

We’ve had this for years.

We just ignored it.

What I learned

Content negotiation is a powerful, underutilized tool for serving different formats from the same URL.
Markdown is a fantastic format for agents - clean, efficient, and easy to parse.
Explicit content signals can help communicate usage permissions to agents, fostering responsible AI interactions.
Pre-calculating token counts and exposing them via headers can help agents make informed decisions about processing content.
Building for agents doesn’t have to mean building separate APIs - sometimes it just means serving the right format to the right consumer.
Testing is crucial to ensure that content negotiation works as intended and that agents receive the correct format.
Proper metadata and headers are essential for advertising capabilities and guiding agents on how to interact with your content.