Optimizing Token Usage in OpenAI's JSON Mode with Stop Sequences
In our last blog post, you may have noticed we used stop
in our API request.
response = await client.chat.completions.create(
response_format={"type": "json_object"}, # Enforce JSON response
model="gpt-4o-2024-08-06", # Set model name
max_tokens=20, # Set an upper limit on token usage
stop=["rue", "alse"], # Early stopping conditions
messages=[
{
"role": "system",
"content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
},
{"role": "user", "content": "SEZC means Special Economic Zone Company."},
],
temperature=0.0, # Use deterministic sampling
seed=31337, # Random seed for reproducibility
)
Script with stop
defined.
A good question to ask here, is why are we doing this? To understand, let's see the token usage when we don't use stop=["rue", "alse"]
.
response = await client.chat.completions.create(
response_format={"type": "json_object"}, # Enforce JSON response
model="gpt-4o-2024-08-06", # Set model name
max_tokens=20, # Set an upper limit on token usage
messages=[
{
"role": "system",
"content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
},
{"role": "user", "content": "SEZC means Special Economic Zone Company."},
],
temperature=0.0, # Use deterministic sampling
seed=31337, # Random seed for reproducibility
)
print(response.usage.model_dump_json(indent=2)) # Print out the token usage
print(repr(response.choices[0].message.content)) # Print out the response
Script without stop
defined.
{
"completion_tokens": 5,
"prompt_tokens": 38,
"total_tokens": 43,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
Token usage without stop
.
{"fact": true}
Response output without stop
.
Now, let's try adding back stop
to our request and see what changes.
{
"completion_tokens": 4,
"prompt_tokens": 38,
"total_tokens": 42,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
Token usage with stop
.
{"fact": t
Response output with stop
.
Nice, we reduced our total output token usage by 20%, and we can still tell what the correct answer is! This might not seem like a lot for a simple example, but once you start scaling up, it adds up quickly. For example, if your framework is using 1 billion output tokens with gpt-4o-2024-08-06
per month, your bill would go from $10,000 to $8,000!
As a bonus, we can actually recover the full last token, simply by enabling logprobs
.
response = await client.chat.completions.create(
response_format={"type": "json_object"}, # Enforce JSON response
model="gpt-4o-2024-08-06", # Set model name
max_tokens=20, # Set an upper limit on token usage
stop=["true", "false"], # Early stopping conditions
messages=[
{
"role": "system",
"content": 'Fact check the user\'s message and answer in a JSON object that follows {"fact": boolean}.',
},
{"role": "user", "content": "SEZC means Special Economic Zone Company."},
],
temperature=0.0, # Use deterministic sampling
seed=31337, # Random seed for reproducibility
logprobs=True, # Enable probability logging
)
Script with stop
and logprobs
defined.
{"fact": true
Response output with stop
and logprobs
defined.
Awesome! However, this is a bug as OpenAI's documentation is clear that the stop word should not be included in the response, and enabling logprobs
should not alter the content
. Oops.
That said, even after OpenAI fixes this bug, we can script our way out of this by using response.choices[0].logprobs.content
instead of response.choices[0].message.content
.
reconstructed_message_content = "" # the reconstructed message content
# iterate through the content logprobs of the response
for content_logprob in response.choices[0].logprobs.content:
reconstructed_message_content += content_logprob.token # append the token to the reconstructed message content
print(repr(reconstructed_message_content))
Script to reconstruct the message content using logprobs
.
{"fact": true
Response output with stop
and logprobs
defined, reconstructed using logprobs
.
Success! We recommend using the reconstruction method, as it should continue to work indefinitely.