Reduce Token Usage in OpenAI's JSON Mode with Stop Sequences

In our last blog post, you may have noticed we used stop in our API request.

response = await client.chat.completions.create(
    response_format={"type": "json_object"},  # Enforce JSON response
    model="gpt-4o-2024-08-06",  # Set model name
    max_tokens=20,  # Set an upper limit on token usage
    stop=["rue", "alse"],  # Early stopping conditions
    messages=[
        {
            "role": "system",
            "content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
        },
        {"role": "user", "content": "SEZC means Special Economic Zone Company."},
    ],
    temperature=0.0,  # Use deterministic sampling
    seed=31337,  # Random seed for reproducibility
)

Script with stop defined.

A good question to ask here, is why are we doing this? To understand, let's see the token usage when we don't use stop=["rue", "alse"].

response = await client.chat.completions.create(
    response_format={"type": "json_object"},  # Enforce JSON response
    model="gpt-4o-2024-08-06",  # Set model name
    max_tokens=20,  # Set an upper limit on token usage
    messages=[
        {
            "role": "system",
            "content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
        },
        {"role": "user", "content": "SEZC means Special Economic Zone Company."},
    ],
    temperature=0.0,  # Use deterministic sampling
    seed=31337,  # Random seed for reproducibility
)

print(response.usage.model_dump_json(indent=2))  # Print out the token usage
print(repr(response.choices[0].message.content)) # Print out the response

Script without stop defined.

{
  "completion_tokens": 5,
  "prompt_tokens": 38,
  "total_tokens": 43,
  "completion_tokens_details": null,
  "prompt_tokens_details": null
}

Token usage without stop.

{"fact": true}

Response output without stop .

Now, let's try adding back stop to our request and see what changes.

{
  "completion_tokens": 4,
  "prompt_tokens": 38,
  "total_tokens": 42,
  "completion_tokens_details": null,
  "prompt_tokens_details": null
}

Token usage with stop.

{"fact": t

Response output with stop.

Nice, we reduced our total output token usage by 20%, and we can still tell what the correct answer is! This might not seem like a lot for a simple example, but once you start scaling up, it adds up quickly. For example, if your framework is using 1 billion output tokens with gpt-4o-2024-08-06 per month, your bill would go from $10,000 to $8,000!

As a bonus, we can actually recover the full last token, simply by enabling logprobs.

response = await client.chat.completions.create(
    response_format={"type": "json_object"},  # Enforce JSON response
    model="gpt-4o-2024-08-06",  # Set model name
    max_tokens=20,  # Set an upper limit on token usage
    stop=["true", "false"],  # Early stopping conditions
    messages=[
        {
            "role": "system",
            "content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
        },
        {"role": "user", "content": "SEZC means Special Economic Zone Company."},
    ],
    temperature=0.0,  # Use deterministic sampling
    seed=31337,  # Random seed for reproducibility
    logprobs=True, # Enable probability logging
)

Script with stop and logprobs defined.

{"fact": true

Response output with stop and logprobs defined.

Awesome! However, this is a bug as OpenAI's documentation is clear that the stop word should not be included in the response, and enabling logprobs should not alter the content. Oops.

That said, even after OpenAI fixes this bug, we can script our way out of this by using response.choices[0].logprobs.content instead of response.choices[0].message.content.

reconstructed_message_content = "" # the reconstructed message content

# iterate through the content logprobs of the response
for content_logprob in response.choices[0].logprobs.content: 
    reconstructed_message_content += content_logprob.token # append the token to the reconstructed message content

print(repr(reconstructed_message_content))

Script to reconstruct the message content using logprobs.

{"fact": true

Response output with stop and logprobs defined, reconstructed using logprobs.

Success! We recommend using the reconstruction method, as it should continue to work indefinitely.