Optimizing Token Usage in OpenAI's JSON Mode with Stop Sequences
By strategically setting stop conditions, you can cut down on unnecessary tokens, saving 20% on your output in our example. Learn how to implement this technique and even recover full tokens with logprobs.
In our last blog post, you may have noticed we used stop in our API request.
A good question to ask here, is why are we doing this? To understand, let's see the token usage when we don't use stop=["rue", "alse"].
Now, let's try adding back stop to our request and see what changes.
Nice, we reduced our total output token usage by 20%, and we can still tell what the correct answer is! This might not seem like a lot for a simple example, but once you start scaling up, it adds up quickly. For example, if your framework is using 1 billion output tokens with gpt-4o-2024-08-06 per month, your bill would go from $10,000 to $8,000!
As a bonus, we can actually recover the full last token, simply by enabling logprobs.
That said, even after OpenAI fixes this bug, we can script our way out of this by using response.choices[0].logprobs.content instead of response.choices[0].message.content.
Success! We recommend using the reconstruction method, as it should continue to work indefinitely.