Building a modern AI chat interface, you'll inevitably face the question: should responses stream in real-time like ChatGPT, or arrive all at once? This post chronicles our journey from implementing fake typing effects to understanding the fundamental architectural challenges of true streaming responses with AWS Lambda and Bedrock.
Our chat application initially used a simple approach - receive the complete AI response from AWS Bedrock, then simulate typing it out character by character:
// The "fake" streaming simulation we initially implemented
const simulateTyping = async (fullResponse: string) => {
for (let i = 0; i < fullResponse.length; i++) {
const char = fullResponse[i];
const delay = char === ' ' ? 50 : char.match(/[.!?]/) ? 200 : 30;
await new Promise(resolve => setTimeout(resolve, delay));
setDisplayText(fullResponse.substring(0, i + 1));
}
};
This worked reasonably well for UX - users saw a "typing" effect that made the response feel more natural and gave visual feedback during processing. But it wasn't real streaming.
The real issue emerged when we tried to support both modes. Our system had:
InvokeModelWithResponseStreamCommandThis created a confusing user experience where the "Stream" toggle sometimes worked, sometimes fell back to fake typing, and sometimes just returned static responses. Users couldn't predict what they'd get.
To implement true streaming, we dove deep into AWS documentation and discovered the architectural requirements:
AWS Bedrock supports real streaming through InvokeModelWithResponseStreamCommand:
const command = new InvokeModelWithResponseStreamCommand(params);
const response = await bedrockClient.send(command);
// Real streaming - process chunks as they arrive
for await (const chunk of response.body) {
if (chunk.chunk?.bytes) {
const chunkData = JSON.parse(new TextDecoder().decode(chunk.chunk.bytes));
if (chunkData.type === 'content_block_delta' && chunkData.delta?.text) {
// Send this token immediately to client
yield chunkData.delta.text;
}
}
}
This actually works - Bedrock will send response tokens as they're generated by the model.
However, Lambda functions have specific requirements for streaming:
// Required configuration for Lambda streaming
backend.bedrockChatStream.resources.lambda.addFunctionUrl({
authType: FunctionUrlAuthType.NONE,
invokeMode: InvokeMode.RESPONSE_STREAM, // Critical!
});
Our biggest discovery was that traditional Lambda functions are request-response only:
// Traditional Lambda - this CANNOT stream
export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
let sseResponse = '';
// This builds the ENTIRE response in memory
for await (const chunk of bedrockResponse.body) {
sseResponse += `data: ${JSON.stringify({ type: 'token', token: chunk })}\n\n`;
}
// Returns everything at once - NOT streaming!
return {
statusCode: 200,
body: sseResponse
};
};
This approach waits for the complete Bedrock response, formats it as Server-Sent Events (SSE), then returns everything in one payload. It's fake streaming disguised as real streaming.
Our investigation revealed fundamental incompatibilities:
After understanding these complexities, we made a architectural decision: remove streaming entirely and optimize for fast, reliable static responses.
// Clean, simple response handling
const handleResponse = async (response: Response) => {
if (response.ok) {
const data = await response.json();
const assistantMessage: Message = {
id: (Date.now() + 1).toString(),
content: data.message, // Complete response, displayed immediately
role: 'assistant',
timestamp: new Date()
};
setMessages(prev => [...prev, assistantMessage]);
}
};
We simplified the interface to be crystal clear:
// Before: Confusing streaming toggle
<button className={`streaming-toggle ${useRealStreaming ? 'active' : ''}`}>
{useRealStreaming ? '🔄 Live' : '⌨️ Typed'}
</button>
// After: Simple loading states
{isLoading && (
<div className="loading-spinner">
<span></span><span></span><span></span>
</div>
)}
Our testing revealed interesting performance characteristics:
For most chat interactions, static responses delivered a better experience.
Every additional mode (static, fake streaming, real streaming) multiplied our testing surface and potential failure points.
Users care about predictable, fast responses, not the underlying streaming technology.
Choosing Amplify Functions over raw Lambda limited our streaming options, but provided other benefits like easier deployment and GraphQL integration.
AWS docs show how to implement streaming, but don't clearly explain the architectural constraints and trade-offs.
Our production system now uses:
// Single, optimized response path
const response = await client.queries.chat({
message: userMessage.content,
conversationId: conversationId || null,
conversationHistory: JSON.stringify(completeHistory),
clientIp: clientIp,
userAgent: userAgent,
stream: false // Always static
});
// Immediate display of complete response
const assistantMessage: Message = {
id: (Date.now() + 1).toString(),
content: response.data.reply,
role: 'assistant',
timestamp: new Date()
};
setMessages(prev => [...prev, assistantMessage]);
Sometimes the best engineering decision is to remove complexity, not add it. Our journey from fake typing to real streaming taught us that user experience consistency often matters more than technical sophistication.
The chat interface is now faster, more reliable, and easier to maintain. Users get their answers quickly without the uncertainty of which "streaming" mode they're experiencing.
In the world of AI interfaces, sometimes the most advanced approach is the simplest one that actually works.
This post documents our actual implementation experience with AWS Bedrock, Lambda, and Next.js. Your mileage may vary depending on your specific architecture and requirements.