Optimizing Token Usage in OpenAI's JSON Mode with Stop Sequences
By strategically setting stop conditions, you can cut down on unnecessary tokens, saving 20% on your output in our example. Learn how to implement this technique and even recover full tokens with logprobs.

In our last blog post, you may have noticed we used stop
in our API request.
response = await client.chat.completions.create(
response_format={"type": "json_object"}, # Enforce JSON response
model="gpt-4o-2024-08-06", # Set model name
max_tokens=20, # Set an upper limit on token usage
stop=["rue", "alse"], # Early stopping conditions
messages=[
{
"role": "system",
"content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
},
{"role": "user", "content": "SEZC means Special Economic Zone Company."},
],
temperature=0.0, # Use deterministic sampling
seed=31337, # Random seed for reproducibility
)
Script with stop
defined.
A good question to ask here, is why are we doing this? To understand, let's see the token usage when we don't use stop=["rue", "alse"]
.
response = await client.chat.completions.create(
response_format={"type": "json_object"}, # Enforce JSON response
model="gpt-4o-2024-08-06", # Set model name
max_tokens=20, # Set an upper limit on token usage
messages=[
{
"role": "system",
"content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
},
{"role": "user", "content": "SEZC means Special Economic Zone Company."},
],
temperature=0.0, # Use deterministic sampling
seed=31337, # Random seed for reproducibility
)
print(response.usage.model_dump_json(indent=2)) # Print out the token usage
print(repr(response.choices[0].message.content)) # Print out the response
Script without stop
defined.
{
"completion_tokens": 5,
"prompt_tokens": 38,
"total_tokens": 43,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
Token usage without stop
.
{"fact": true}
Response output without stop
.
Now, let's try adding back stop
to our request and see what changes.
{
"completion_tokens": 4,
"prompt_tokens": 38,
"total_tokens": 42,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
Token usage with stop
.
{"fact": t
Response output with stop
.
Nice, we reduced our total output token usage by 20%, and we can still tell what the correct answer is! This might not seem like a lot for a simple example, but once you start scaling up, it adds up quickly. For example, if your framework is using 1 billion output tokens with gpt-4o-2024-08-06
per month, your bill would go from $10,000 to $8,000!
As a bonus, we can actually recover the full last token, simply by enabling logprobs
.
response = await client.chat.completions.create(
response_format={"type": "json_object"}, # Enforce JSON response
model="gpt-4o-2024-08-06", # Set model name
max_tokens=20, # Set an upper limit on token usage
stop=["true", "false"], # Early stopping conditions
messages=[
{
"role": "system",
"content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
},
{"role": "user", "content": "SEZC means Special Economic Zone Company."},
],
temperature=0.0, # Use deterministic sampling
seed=31337, # Random seed for reproducibility
logprobs=True, # Enable probability logging
)
Script with stop
and logprobs
defined.
{"fact": true
Response output with stop
and logprobs
defined.
Awesome! However, this is a bug as OpenAI's documentation is clear that the stop word should not be included in the response, and enabling logprobs
should not alter the content
. Oops.
That said, even after OpenAI fixes this bug, we can script our way out of this by using response.choices[0].logprobs.content
instead of response.choices[0].message.content
.
reconstructed_message_content = "" # the reconstructed message content
# iterate through the content logprobs of the response
for content_logprob in response.choices[0].logprobs.content:
reconstructed_message_content += content_logprob.token # append the token to the reconstructed message content
print(repr(reconstructed_message_content))
Script to reconstruct the message content using logprobs
.
{"fact": true
Response output with stop
and logprobs
defined, reconstructed using logprobs
.
Success! We recommend using the reconstruction method, as it should continue to work indefinitely.