Loading...
I wanted a simple thing:
Serve the same blog URL to both humans and AI agents - but give them what they actually need.
No duplicate routes. No overengineering. Just one URL doing the right thing.
AI agents are everywhere now. But here’s the issue:
Most websites are still built only for humans.
HTML is:
Agents don’t care about your navbar, animations, or CSS.
They just want:
clean, structured, token-efficient content
Markdown fits that perfectly.
Short answer: tokens + clarity
Long answer:
For example, take this blog post. When you tokenize the HTML version it will have 95k+ tokens (because of all the tags, attributes, and layout code). The Markdown version is only around 4k+ tokens. That’s a huge difference for agents that have token limits or want to process content efficiently. This difference is litterly 25x more tokens just for the HTML formatting. By serving Markdown to agents, we can save them from having to parse through all the unnecessary HTML and get straight to the content they care about. (This is a real-world example, and the exact token counts can vary based on the specific HTML structure and Markdown formatting used, but it illustrates the general point that Markdown can be significantly more token-efficient than HTML for agents.) This difference can lead to faster processing, lower costs, and better performance for agents.
To visualize this, copy both HTML and Markdown versions of the content and paste them into AI Tokenizer to see the token counts for yourself.
So instead of building separate endpoints like:
/blog/html
/blog/md
Why not just:
same URL → different response based on
Acceptheader
At some point it hit me:
I already write in Markdown… why am I converting it away just to serve it?
So I stopped doing that.
This is handled in proxy.ts (edge layer) - not inside page components.
/blogsif (pathname === '/blogs' && chosen === 'text/markdown') {
const res = NextResponse.rewrite(new URL(serverEnvConfig.BLOGS_INDEX_MD_URL));
res.headers.set('Content-Type', 'text/markdown; charset=utf-8');
res.headers.set('Vary', 'Accept');
return res;
}
const blogIdMatch = pathname.match(/^\/blogs\/i\/(\d+)$/);
if (blogIdMatch && chosen === 'text/markdown') {
const id = Number(blogIdMatch[1]);
const data = await fetch(serverEnvConfig.BLOGS_INDEX_JSON_URL).then((res) =>
res.json()
);
const blogEntry = data.find((entry) => entry.id === id);
if (!blogEntry?.mdUrl) {
return NextResponse.rewrite(new URL('/404', req.url), { status: 404 });
}
const res = NextResponse.rewrite(new URL(blogEntry.mdUrl));
res.headers.set('Content-Type', 'text/markdown; charset=utf-8');
res.headers.set('Vary', 'Accept');
res.headers.set('Cache-Control', 's-maxage=60, stale-while-revalidate=86400');
res.headers.set('Content-Signal', 'ai-train=no, search=yes, ai-input=no');
return res;
}
NOTE: The
NextResponse.rewritemethod will expose the original Markdown file URL in thex-middleware-rewriteheader. If your Markdown files are stored in a private API or protected location, make sure to remove or hide this header to prevent exposing internal endpoints to the public. In the example above, since the Markdown files are served from a public bucket, this is not an issue. However, if you are using private storage, you should ensure that thex-middleware-rewriteheader is not included in the response to avoid unintended exposure of your internal API endpoints.
For cut-off x-middleware-rewrite header, I added bellow code to my nginx config:
location {
proxy_hide_header x-middleware-path;
}
The relatable issue in Next.js GitHub discussions: Can I remove x-middleware-rewrite http header from the response headers when using NextResponse.rewrite?
In the above Cache-Control header is used to ensure that Markdown responses are cached at the edge for 60 seconds, and can be served stale while revalidating for up to 24 hours. This balances freshness with performance for agents.
The Content-Signal header is used to explicitly communicate content usage permissions to agents, indicating that the content can be used for search indexing but not for AI training or input. This header is interduced by Cloudflare's Content Signals initiative, which provides a standardized way for website operators to express content usage permissions to automated agents. See also Content Signals section below for more details.
/blogs)This is where things get interesting.
/blogs itself also supports Markdown.
So instead of scraping pages, agents can fetch everything in one request.
export function jsonToFlatMarkdown(blogs: BlogPost[]): string {
const header = `# Sivothayan's Blogs
This is a collection of all the blogs that I have written. You can find the latest blogs at:
👉 [Read as Markdown (Accept: text/markdown)](https://sivothayan.com/blogs)
👉 [Read as HTML (Accept: text/html)](https://sivothayan.com/blogs)
This resource supports HTTP content negotiation.
- For Markdown: send header \`Accept: text/markdown\`
- For HTML: send header \`Accept: text/html\`
Example:
\`curl -H "Accept: text/markdown" https://sivothayan.com/blogs\`
---
`;
const content = blogs
.filter((b) => b.isPublished)
.map((blog) => {
const tags = blog.tags.map((t) => `\`${t}\``).join(', ');
return `## ${blog.title}
- **Date:** ${blog.date}
- **Read Time:** ${blog.readTime} min
- **Language:** ${blog.language}
- **Tags:** ${tags}
${blog.description}
👉 [Read as Markdown (Accept: text/markdown)](https://sivothayan.com/blogs/i/${blog.id})
👉 [Read as HTML (Accept: text/html)](https://sivothayan.com/blogs/i/${blog.id})`;
})
.join('\n\n---\n\n');
return `${header}\n${content}\n`;
}
You also need to advertise Markdown support.
For Blog's layout page:
import type { Metadata } from 'next';
export const metadata: Metadata = {
alternates: {
canonical: `${site}/blogs/i/${id}`,
types: {
'text/markdown': `${site}/blogs/i/${id}`,
'application/rss+xml': `${site}/rss.xml`,
'application/atom+xml': `${site}/feed.xml`
}
}
};
for Blogs index page:
import type { Metadata } from 'next';
export const metadata: Metadata = {
alternates: {
canonical: `${site}/blogs`,
types: {
'text/markdown': `${site}/blogs`,
'application/rss+xml': `${site}/rss.xml`,
'application/atom+xml': `${site}/feed.xml`
}
}
};
Without this, agents won’t know Markdown is available.
If you don’t test this, you’re just guessing.
curl -H "Accept: text/markdown" https://sivothayan.com/blogs/i/8
curl https://sivothayan.com/blogs/i/8
curl -I -H "Accept: text/markdown" https://sivothayan.com/blogs/i/8
You should see:
Content-Type: text/markdown; charset=utf-8
Vary: Accept
Cache-Control: s-maxage=60, stale-while-revalidate=86400
Content-Signal: ai-train=no, search=yes, ai-input=no
Since this setup is explicitly agent-friendly, it also makes sense to be explicit about what is allowed and what is not.
Instead of relying on assumptions, I expose content usage rules directly.
NOTE: Currently, the content signals are communicated via the
Content-Signalheader. To add this torobots.txt, Next.js don't have built-in support for adding viarobots.tsormetadataexport, so you would need to serve a staticrobots.txtfile from your public directory with the appropriate content signals included. This way, agents that respectrobots.txtwill be able to read and adhere to the content usage permissions you have set. See Next.js GitHub discussions - Add support for robots.txt Content-Signals.
Add the below robots.txt file to the root of your website to communicate content usage permissions to agents via standardized content signals:
# As a condition of accessing this website, you agree to abide
# by the following content signals:
# (a) If a content-signal = yes, you may collect content for
# the corresponding use.
# (b) If a content-signal = no, you may not collect content for
# the corresponding use.
# (c) If the website operator does not include a content signal
# for a corresponding use, the website operator neither grants
# nor restricts permission via content signal with respect to
# the corresponding use.
# The content signals and their meanings are:
# search: building a search index and providing search results
# (e.g., returning hyperlinks and short excerpts from your
# website's contents). Search does not include providing
# AI-generated search summaries.
# ai-input: inputting content into one or more AI models (e.g.,
# retrieval augmented generation, grounding, or other real-time
# use for generative AI answers).
# ai-train: training or fine-tuning AI models.
# ANY RESTRICTIONS EXPRESSED VIA CONTENT SIGNALS ARE EXPRESS
# RESERVATIONS OF RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION
# DIRECTIVE 2019/790 ON COPYRIGHT AND RELATED RIGHTS IN THE
# DIGITAL SINGLE MARKET.
User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /
x-markdown-tokens headerTo make it even easier for agents, I also include a custom header x-markdown-tokens that estimates the number of tokens in the Markdown response. This allows agents to quickly assess the content size and make informed decisions about processing it, without having to parse the entire Markdown content first. This is especially useful for agents with token limits or those that want to prioritize certain content based on size.
I precalculated the token count for each markdown file using ai-tokenizer npm package and stored it in the JSON index. Then I set the header in the response of each markdown request:
To generate token-map.json:
import { readdir, readFile, writeFile, stat } from 'fs/promises';
import path from 'path';
import Tokenizer, { ModelName, models } from 'ai-tokenizer';
import * as encoding from 'ai-tokenizer/encoding';
const PUBLIC_DIR = path.join(process.cwd(), 'public');
const TOKEN_MAP_JSON_PATH = path.join(PUBLIC_DIR, 'token-map.json');
const MODEL_NAMES: ModelName[] = [
'openai/gpt-4o',
'openai/gpt-4o-mini',
'anthropic/claude-3.5-sonnet',
'anthropic/claude-3-haiku',
'google/gemini-2.5-pro',
'google/gemini-2.5-flash'
];
const tokenizers = Object.fromEntries(
MODEL_NAMES.map((m) => {
const model = models[m];
return [m, new Tokenizer(encoding[model.encoding])];
})
);
async function walk(dir: string): Promise<string[]> {
const entries = await readdir(dir);
const results: string[] = [];
for (const entry of entries) {
const fullPath = path.join(dir, entry);
const s = await stat(fullPath);
if (s.isDirectory()) {
results.push(...(await walk(fullPath)));
} else if (entry.endsWith('.md')) {
results.push(fullPath);
}
}
return results;
}
async function main() {
const files = await walk(PUBLIC_DIR);
const result: Record<string, Record<string, number>> = {};
for (const file of files) {
const content = await readFile(file, 'utf-8');
// normalize path (relative to /public)
const relativePath = path.relative(PUBLIC_DIR, file).replace(/\\/g, '/');
result[relativePath] = {};
for (const [modelName, tokenizer] of Object.entries(tokenizers)) {
const tokens = tokenizer.encode(content).length;
result[relativePath][modelName] = tokens;
}
}
await writeFile(
path.join(TOKEN_MAP_JSON_PATH),
JSON.stringify(result, null, 2),
'utf-8'
);
console.log('✅ token-map.json generated');
}
main();
I the above code, we recursively walk through the public directory to find all Markdown files, calculate their token counts for various models using ai-tokenizer, and save the results in a token-map.json file. This allows us to easily retrieve token counts when serving Markdown content to agents.
Then in the cloudflare worker, we can set the x-markdown-tokens header like this:
const contentType = headers.get('Content-Type') || '';
if (contentType.startsWith('text/markdown')) {
const tokenMap = await getTokenMap(env, request);
const key = pathname.replace(/^\/+/, '');
const tokens = tokenMap[key];
const tokenCount = tokens ? Math.max(...Object.values(tokens)) : 0;
headers.set('x-markdown-tokens', tokenCount.toString());
}
return new Response(assetResponse.body, {
status: assetResponse.status,
headers
});
In the above code snippet, we check if the response content type is Markdown. If it is, we retrieve the token count from our pre-generated token-map.json and set it in the x-markdown-tokens header before returning the response to the agent. I selected the maximum token count across different models to provide a conservative estimate that should work for any agent, regardless of the specific model they are using. (This is because different models may have different tokenization rules, and using the maximum ensures that agents can safely process the content without hitting unexpected token limits, and using the upper bound is generally more helpful for agents to make informed decisions about processing the content.)
/blogs acts like a content indexAt this point, /blogs is basically a read-only content API - without building one.
This isn’t some big architecture.
It’s just using the web properly.
Same resource. Different representations.
We’ve had this for years.
We just ignored it.
Markdown for Agents - Cloudflare Blog
Fundamentals & Reference of Markdown for Agents - Cloudflare Developers
Serve Markdown to agents with content negotiation - acceptmarkdown.com
Next.js GitHub Discussions - Add support for robots.txt Content-Signals