ChatGPT Images 2.0 (gpt-image-2) API Tutorial

I have opinions about image generation APIs, and almost none of them are flattering. Every release until now has been a slightly-better pixel machine wrapped in the same three-preset UI. Pick 1024×1024, 1536×1024, or 1024×1536. Cross your fingers on the text. Regenerate if anything looks like soup.

GPT-IMAGE-2 tutorial lede, NotHans Blue to Teal Cyan typography on dark charcoal with a grid of thumbnail icons
Tutorial lede, generated with gpt-image-2 itself at 1536×1024.

OpenAI shipped gpt-image-2 on April 21, 2026, and it is the first image model that actually belongs in a production pipeline. Not because the pictures are prettier. Because the API finally does the things I kept wanting the old one to do.

What actually changed

Three things, and you can ignore the rest of the announcement.

It reads and writes legible text. OpenAI claims ~99% accuracy on typography, including CJK and right-to-left scripts. That is a big deal if you have ever tried to generate a product label or a slide deck header and gotten cursed runes back. The old model was a pixel painter. The new one is a pixel painter that can spell.

It thinks before it draws. There is a reasoning pass baked into the model now, a “think about the scene, then render” step. You do not have to configure it. You do not pay a thinking-mode surcharge on the standard API call. It just converges faster. Prompts I used to iterate on three or four times now land on the first or second try.

It edits images. Real editing, not “here’s a new image that vaguely resembles your old one.” You pass in a picture and a description of what you want changed, and the rest stays put. This is the capability that makes it worth wiring into a pipeline.

The minimum viable call

If you have Node 20+ and an OpenAI API key, this is the whole thing:

curl https://api.openai.com/v1/images/generations \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-image-2",
    "prompt": "A cinematic 4K landscape of an AI data center at twilight",
    "size": "1536x1024",
    "quality": "high"
  }'

You get back base64 image data in data[0].b64_json. Write it to a file. Done.

One thing that tripped me up: do not send response_format. The docs say it is not supported, and they mean it. The API rejects the request with HTTP 400. All responses come back base64 only. If you want URLs, host them yourself.

The edit endpoint is the real unlock

Here is the image I generated first, a photorealistic 4K data center at twilight:

Photorealistic AI data center at twilight with rows of glowing blue server racks receding to a vanishing point, original generation from gpt-image-2

Now here is the same image after a single edit call: “replace the twilight clouds above the servers with a dramatic aurora borealis, ribbons of emerald green, magenta, and electric teal. Keep everything else unchanged.”

The same AI data center scene after a gpt-image-2 edit call added a green and magenta aurora borealis to the sky, everything else preserved"/><figcaption class="wp-element-caption">The same scene after one gpt-image-2 edit call. No mask. Only the sky changed.

No mask. No Photoshop. The server racks are in the same positions. The orange horizon is preserved. The blue light trails between the racks still flow toward the vanishing point. Only the sky changed.

Try doing that with gpt-image-1. You cannot. The /v1/images/edits endpoint existed before, but the results it gave you were not the kind of thing you shipped to production. This is the feature I was waiting for.

The endpoint accepts multiple reference images, which you address inside the prompt as “image 1” and “image 2” for compositing. Style transfer, product placement, character relocation, all one API call.

Three creative moves the marketing post does not tell you about

Aspect ratios no one else gives you. The preset list is short, but size accepts custom values. Both dimensions must be multiples of 16, max edge 3840, aspect ratio up to 3:1, and total pixels between 655,360 and 8,294,400. That range covers Twitter cards at 1200×628, Instagram stories at 1080×1920, blog heroes at 1920×1080, and full 4K landscapes at 3840×2160. No cropping, no upscaling, no extra tooling.

One caveat the docs hide: anything above 2,560×1,440 is officially “experimental.” It works. I generated a 4K image for this post. But OpenAI is not promising SLA on it yet, so budget for occasional failures in production.

Batch consistency. The n parameter goes up to 8, and the model keeps characters and objects consistent across the set. For a product shot or a children’s book page, one call gives you eight variations that actually share visual DNA. Eight variations at medium quality costs about thirty cents. That is a lot cheaper than eight separate prompt-engineering sessions.

Reasoning as a debugging tool. Because the model thinks before drawing, iteration feels different. Vague prompts still produce vague images, but specific prompts land harder. I stopped writing six-paragraph mega-prompts and started writing three-sentence scene descriptions with hex colors and composition direction. The output got better.

Gotchas

Things I wish the docs had told me louder:

  • No transparent backgrounds. If you need a PNG with alpha for icon work, you still reach for gpt-image-1. Route by use case.
  • Masks are prompt-guided, not pixel-exact. If you are coming from Stable Diffusion, this will feel wrong. The mask tells the model which region to focus on. The model decides how to blend.
  • C2PA watermarks are on by default. Every image ships with provenance metadata. Useful for trust, relevant if you were hoping to redistribute without attribution.
  • Streaming partials cost extra. Each partial_images frame adds 100 output tokens. Fine for prototyping a UI. Expensive at scale.
  • Pricing is per token. $8 per million input tokens, $30 per million output tokens, with the usual caching discount. A medium 1024×1024 lands around four cents. A high-quality 4K lands near eighty cents. The calculator on the docs page will save you some math.

Where to take it

If you are already using image generation in an app, switching is low-risk. Change the model string, drop response_format if you were sending it, and audit for transparent-background assumptions. Your latency will drop. Your text will be legible.

If you are not using image generation in an app, the editing endpoint is the reason to start. Every product that has ever wanted “make this photo match our brand” can now do it with three lines of code.

I rebuilt my cartoon pipeline around the new editing flow in about an afternoon. The first draft of this post was going to be a benchmark comparison. Then I looked at the aurora edit and realized there was nothing to benchmark. Either your tool can do it or it cannot.

gpt-image-2 can.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.