A user sends your Discord bot a screenshot and asks “what’s wrong with this error?” Your bot reads the text, ignores the image, and gives a generic answer. The image had the actual error message. The fix is shorter than you’d expect.
The Problem
Discord messages can include image attachments. The discord.js message object exposes them, but most bot implementations only process message.content (the text). The image attachment is metadata sitting right there, unused.
Claude’s API supports vision — you can send images as content blocks alongside text. But the message format is different from a plain string. You need to build an array of content blocks instead.
Step 1: Extract Attachments From Discord
The Discord message object has an attachments collection. Pull out the metadata you need:
// In your Discord message handler
let attachments = [];
if (message.attachments.size > 0) {
attachments = [...message.attachments.values()].map(a => ({
url: a.url,
filename: a.name,
contentType: a.contentType,
width: a.width,
height: a.height,
}));
}
Pass these through your transport layer alongside the message text. Your command handler needs both.
Step 2: Build Vision Content Blocks
Filter for images, build Claude-compatible content blocks, and append the text:
// In your command handler, before calling Claude
let userContent = text;
const imageAttachments = (attachments || []).filter(a =>
a.contentType?.startsWith('image/') ||
/\.(png|jpg|jpeg|gif|webp)$/i.test(a.filename || '')
);
if (imageAttachments.length > 0) {
const contentBlocks = [];
for (const img of imageAttachments) {
contentBlocks.push({
type: 'image',
source: { type: 'url', url: img.url },
});
}
if (text.trim()) {
contentBlocks.push({ type: 'text', text });
}
userContent = contentBlocks;
}
That’s it. userContent is either a plain string (no images) or an array of content blocks (images + text). Pass it to Claude as the user message.
Step 3: Send to Claude
The Claude API accepts either a string or an array of content blocks for the content field:
const response = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
system: systemPrompt,
messages: [
...conversationHistory,
{ role: 'user', content: userContent },
],
});
No special flag or parameter needed. If content is an array with image blocks, Claude processes them as vision input.
Why Images Go Before Text
Notice the code pushes image blocks first, then the text block. This matters.
Claude processes content blocks sequentially. When a user sends a screenshot and asks “what’s this error?”, you want Claude to see the image before reading the question. This mirrors how a human would process it — look at the image, then read the question about it.
// ✅ Good: image → text
[
{ type: 'image', source: { type: 'url', url: '...' } },
{ type: 'text', text: 'What error is this showing?' }
]
// ⚠️ Works but worse: text → image
[
{ type: 'text', text: 'What error is this showing?' },
{ type: 'image', source: { type: 'url', url: '...' } }
]
Both work. The first one produces better responses because the model has visual context before it encounters the question.
No Download Needed
Discord CDN URLs (cdn.discordapp.com) are publicly accessible without authentication. Claude’s API fetches the image directly from the URL when you use source.type: 'url'. No need to:
- Download the image to your server
- Base64-encode it
- Store it temporarily
- Manage cleanup
Just pass the URL. This is the simplest path and it works because Discord’s CDN doesn’t require auth for attachment URLs.
If you’re in an environment where the image URLs are behind authentication (not Discord, but other platforms), you’d need to download the image, base64-encode it, and use source.type: 'base64' with media_type and data fields instead.
Conversation History
If your agent maintains conversation memory, the mixed content blocks go into history as-is:
// Memory stores whatever content was sent
history.push({
role: 'user',
content: userContent, // string or array of blocks
timestamp: new Date().toISOString(),
});
On the next turn, the API receives the full history including image blocks from previous messages. Claude can reference images from earlier in the conversation — “that screenshot you showed me earlier” works correctly.
Multiple Images
The pattern handles multiple attachments naturally. Each image gets its own content block, all before the text:
// User sends 3 screenshots + "compare these"
[
{ type: 'image', source: { type: 'url', url: 'cdn.../screenshot1.png' } },
{ type: 'image', source: { type: 'url', url: 'cdn.../screenshot2.png' } },
{ type: 'image', source: { type: 'url', url: 'cdn.../screenshot3.png' } },
{ type: 'text', text: 'Compare these three — which layout is better?' }
]
Claude sees all three images and can compare them. The API supports up to 20 images per message.
Key Takeaway
Adding vision to a Discord bot is a content format change, not an architecture change. Extract attachment URLs from the message, filter for images, build content blocks with images before text, pass the array instead of a string. Five lines of block assembly and your bot can see what users send it. The Discord CDN URLs work directly with Claude’s URL-based image source — no download step, no encoding, no temp files.