2026-01-28·2 min read

OpenAI API in Production: Patterns That Actually Work

AIOpenAINode.jsArchitecture

OpenAI API in Production: Patterns That Actually Work

Everyone is integrating OpenAI. Few are doing it in a way that holds up at scale. Here's what I've learned after shipping AI features into real production systems.

The Core Problem

LLMs are non-deterministic, slow, and expensive. Your architecture needs to account for all three.

Pattern 1: Always Stream

Never wait for a full completion. Users abandon requests after ~3 seconds.

// Stream from your Next.js API route
export async function POST(req: Request) {
  const { prompt } = await req.json();
  
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  const encoder = new TextEncoder();
  
  return new Response(
    new ReadableStream({
      async start(controller) {
        for await (const chunk of stream) {
          const text = chunk.choices[0]?.delta?.content ?? '';
          if (text) controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
        }
        controller.enqueue(encoder.encode('data: [DONE]\n\n'));
        controller.close();
      },
    }),
    { headers: { 'Content-Type': 'text/event-stream' } }
  );
}

Pattern 2: Cache Aggressively

AI calls are expensive. Cache everything you can.

import { Redis } from '@upstash/redis';

const redis = new Redis({ url: process.env.UPSTASH_URL!, token: process.env.UPSTASH_TOKEN! });

async function cachedCompletion(prompt: string): Promise<string> {
  const key = `ai:${Buffer.from(prompt).toString('base64').slice(0, 64)}`;
  
  const cached = await redis.get<string>(key);
  if (cached) return cached;
  
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }],
  });
  
  const result = response.choices[0].message.content ?? '';
  await redis.setex(key, 3600, result); // cache for 1 hour
  return result;
}

Pattern 3: Rate Limit Per User

async function checkRateLimit(userId: string): Promise<boolean> {
  const key = `ratelimit:ai:${userId}`;
  const requests = await redis.incr(key);
  
  if (requests === 1) {
    await redis.expire(key, 3600); // reset every hour
  }
  
  return requests <= 50; // 50 AI requests per hour
}

Pattern 4: Structured Outputs

Stop parsing free-form text. Use JSON mode.

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  response_format: { type: 'json_object' },
  messages: [{
    role: 'system',
    content: 'Always respond with valid JSON matching the schema: { score: number, reason: string, tags: string[] }',
  }, {
    role: 'user',
    content: userInput,
  }],
});

const parsed = JSON.parse(response.choices[0].message.content!);
// TypeScript now knows the shape ✓

The Bottom Line

AI features are a multiplier on your product — but only if the plumbing is solid. Get streaming, caching, and rate limiting right before you ship anything to users.