Skip to main content

OpenAI SDK

The gateway is a drop-in replacement for the OpenAI API. Point the official openai Python SDK at the gateway and all requests are proxied to the model provider with automatic memory.

Before you begin

Complete the OpenAI Developer quickstart first. It covers SDK installation, API keys, and the base project setup.

Configure the client

Create an OpenAI client with your gateway API key and base URL.

gateway.py
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://gateway-api.mastra.ai/v1",
)

All subsequent examples use this client instance.

Chat completions

Send a standard chat completion request.

chat.py
completion = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[
        {"role": "user", "content": "What is 2+2? Reply with just the number."}
    ],
    max_tokens=20,
)

print(completion.choices[0].message.content)
# "4"

System messages

Set the model's behavior with a system message in the messages list.

completion = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[
        {"role": "system", "content": "You are a calculator. Only respond with numbers, no words."},
        {"role": "user", "content": "What is 10 * 5?"},
    ],
    max_tokens=100,
)

print(completion.choices[0].message.content)
# "50"

Multi-turn conversations

Pass the full conversation history in the messages list so the model retains context across turns.

completion = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[
        {"role": "user", "content": "Remember this word: banana"},
        {"role": "assistant", "content": "Got it, I will remember it."},
        {"role": "user", "content": "What word did I ask you to remember? Reply with just the word."},
    ],
    max_tokens=100,
)

print(completion.choices[0].message.content)
# "banana"

Streaming

Pass stream=True to receive chunks incrementally.

stream.py
stream = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[
        {"role": "user", "content": "Count from 1 to 5, separated by commas."}
    ],
    max_tokens=50,
    stream=True,
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
# "1, 2, 3, 4, 5"

Memory with thread and resource IDs

Pass x-thread-id and x-resource-id via extra_headers to enable observational memory. The gateway stores observations per thread and injects them as context on subsequent requests.

memory.py
# First request: introduce yourself
client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[
        {"role": "user", "content": "My name is Alex and I prefer concise answers."}
    ],
    extra_headers={
        "x-thread-id": "my-thread-1",
        "x-resource-id": "user-42",
    },
)

# Second request: the gateway remembers
response = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[
        {"role": "user", "content": "What is my name?"}
    ],
    extra_headers={
        "x-thread-id": "my-thread-1",
        "x-resource-id": "user-42",
    },
)

print(response.choices[0].message.content)
# "Alex"

Streaming with memory

Combine streaming with memory headers to receive incremental responses that reference prior observations.

stream = client.chat.completions.create(
    model="google/gemini-2.5-flash",
    messages=[
        {"role": "user", "content": "Summarize what you know about me."}
    ],
    stream=True,
    extra_headers={
        "x-thread-id": "my-thread-1",
        "x-resource-id": "user-42",
    },
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
  • Features: Observational memory, streaming, BYOK, and gateway tools
  • Models: Supported providers and model routing
  • API reference: Complete endpoint documentation