Chaining OpenAI Models: Better and Faster
Overview
The goal is to fact-check a list of statements using a powerful model and then convert the verbose responses into a simple JSON format using a smaller, cheaper model.
The Need for Chaining Models
Large language models like gpt-4o
offer advanced reasoning and comprehension abilities, making them ideal for tasks like fact-checking. However, they can be costly and slow for simple tasks like data formatting or conversion. By chaining the output of a large model into a smaller one, we can:
- Optimize Costs: Use the expensive model only when necessary.
- Improve Efficiency: Delegate simpler tasks to faster models.
- Maintain Accuracy: Ensure complex reasoning tasks are handled by capable models.
While we will use only OpenAI models in this blog post for simplicity, we recommend this approach especially for models that do not have a strict JSON mode, like Anthropic Claude 3.5 Sonnet. (We also recommend using LiteLLM to expose Anthropic's API as a OpenAI API compatible endpoint.)
Step-by-Step Process
Let's break down the script and understand how each part contributes to the overall process.
1. Import Necessary Libraries
import asyncio
import logging
import httpx
from openai import AsyncOpenAI
import random
import os
import json
import numpy as np
2. Configure Logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
c_handler = logging.StreamHandler()
logger.addHandler(c_handler)
3. Retrieve API Credentials
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_API_BASE = os.getenv("OPENAI_API_BASE") or "https://api.openai.com/v1"
We obtain the OpenAI API key and base URL from environment variables. Make sure to have these set if you don't already!
4. Initialize the OpenAI Client
client = AsyncOpenAI(
api_key=OPENAI_API_KEY,
base_url=OPENAI_API_BASE,
http_client=httpx.AsyncClient(
http2=True,
limits=httpx.Limits(
max_connections=None,
max_keepalive_connections=None,
keepalive_expiry=None,
),
),
max_retries=0,
timeout=3600,
)
We create an asynchronous OpenAI client with customized HTTP settings.
5. Define the Main Asynchronous Function
async def main():
statements_to_check = [
"SEZC means Special Economic Zone Company.",
"SEZC means Special Economic Zone Corporation.",
]
The main
function contains the core logic. We define a list of statements that we want to fact-check.
6. Fact-Checking with the Large Model
for statement_to_check in statements_to_check:
logger.debug(f"--------")
logger.debug(f"Checking statement: {statement_to_check}")
response_text = await client.chat.completions.create(
model="gpt-4o-2024-08-06",
max_tokens=8192,
seed=random.randint(1, 0x7FFFFFFF),
messages=[
{
"role": "system",
"content": "Fact check the user's message.",
},
{
"role": "user",
"content": statement_to_check,
},
],
temperature=0.0,
)
We iterate over each statement and use the gpt-4o-2024-08-06
model to perform fact-checking. The model provides a detailed response about the accuracy of the statement.
7. Extract and Log the Response
text_answer = response_text.choices[0].message.content
logger.debug(text_answer)
We extract the assistant's response from the API reply and log it for review.
8. Converting the Response to JSON with the Smaller Model
response_json = await client.chat.completions.create(
response_format={"type": "json_object"},
model="gpt-4o-mini-2024-07-18",
max_tokens=20,
logprobs=True,
stop=[
"rue",
"alse",
"}",
],
seed=random.randint(1, 0x7FFFFFFF),
messages=[
{
"role": "user",
"content": f"Fact check this statement:\n\n```{statement_to_check}```",
},
{
"role": "assistant",
"content": text_answer,
},
{
"role": "user",
"content": 'Convert your answer into a JSON object that follows {"fact": boolean}',
},
],
temperature=0.0,
)
We pass the assistant's response to a smaller model, gpt-4o-mini-2024-07-18
, asking it to convert the answer into a JSON object. This model is faster and suitable for simple tasks like formatting.
We use the stop
parameter to save a bit on costs as well, which we covered in a previous blog post.
9. Handling the Response and Extracting Probabilities
if response_json.choices[0].logprobs is None:
raise ValueError("No logprobs in response")
last_logprob_content = response_json.choices[0].logprobs.content[-1]
last_logprob_token = last_logprob_content.token
last_logprob_float = last_logprob_content.logprob
last_logprob_percentage = np.exp(last_logprob_float)
We check that the response includes log probabilities, which help in assessing the confidence of the model's output (covered in a previous blog post of ours). We then extract the last token's log probability, which contains our true
or false
token.
10. Determining the Fact Status
if "t" in last_logprob_token.lower():
is_fact = True
elif "f" in last_logprob_token.lower():
is_fact = False
else:
raise ValueError("Invalid token")
Based on the token (expected to be true
or false
), we determine whether the statement is a fact. We only check for one letter, as it is possible that the last token is only a single character and not a full word.
11. Creating and Logging the Result Object
result_obj = {
"fact": is_fact,
"logprob": last_logprob_float,
"percentage": last_logprob_percentage,
}
logger.debug(json.dumps(result_obj, indent=2))
We compile the results into a JSON object and log it in a readable format.
Perfect! These are exactly the results we were hoping and expecting.
12. Minifying the JSON String
thin_json_str = json.dumps(result_obj, separators=(",", ":"))
logger.debug(f"Thin JSON: {repr(thin_json_str)}")
For compactness, we create a minified JSON string without extra whitespace and log it. This can be completely omitted; we included it for documentation purposes, as it's useful if you're going to be passing this output to another LLM.
13. Running the Main Function
if __name__ == "__main__":
asyncio.run(main())
We check if the script is run directly and execute the main
function using asyncio's event loop.
Full Code with Comments
Below is the complete script with comments explaining each step:
#!/usr/bin/env python3.11
# -*- coding: utf-8 -*-
# Author: David Manouchehri
import asyncio # For asynchronous operations
import logging # For logging messages
import httpx # For making HTTP requests asynchronously
from openai import AsyncOpenAI # OpenAI client
import random # For generating random numbers
import os # For accessing environment variables
import json # For handling JSON data
import numpy as np # For numerical operations
# Set up logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
c_handler = logging.StreamHandler()
logger.addHandler(c_handler)
# Get OpenAI API credentials
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_API_BASE = os.getenv("OPENAI_API_BASE") or "https://api.openai.com/v1"
# Initialize OpenAI client
client = AsyncOpenAI(
api_key=OPENAI_API_KEY,
base_url=OPENAI_API_BASE,
http_client=httpx.AsyncClient(
http2=True,
limits=httpx.Limits(
max_connections=None,
max_keepalive_connections=None,
keepalive_expiry=None,
),
),
max_retries=0,
timeout=3600,
)
# Main asynchronous function
async def main():
# List of statements to fact-check
statements_to_check = [
"SEZC means Special Economic Zone Company.",
"SEZC means Special Economic Zone Corporation.",
]
# Iterate over each statement
for statement_to_check in statements_to_check:
logger.debug(f"--------")
logger.debug(f"Checking statement: {statement_to_check}")
# Fact-check using the large model
response_text = await client.chat.completions.create(
model="gpt-4o-2024-08-06",
max_tokens=8192,
seed=random.randint(1, 0x7FFFFFFF),
messages=[
{
"role": "system",
"content": "Fact check the user's message.",
},
{
"role": "user",
"content": statement_to_check,
},
],
temperature=0.0,
)
# Extract the assistant's response
text_answer = response_text.choices[0].message.content
logger.debug(text_answer)
# Convert the response to JSON using the smaller model
response_json = await client.chat.completions.create(
response_format={"type": "json_object"},
model="gpt-4o-mini-2024-07-18",
max_tokens=20,
logprobs=True,
stop=[
"rue",
"alse",
"}",
],
seed=random.randint(1, 0x7FFFFFFF),
messages=[
{
"role": "user",
"content": f"Fact check this statement:\n\n```{statement_to_check}```",
},
{
"role": "assistant",
"content": text_answer,
},
{
"role": "user",
"content": 'Convert your answer into a JSON object that follows {"fact": boolean}',
},
],
temperature=0.0,
)
# Ensure response includes log probabilities
if response_json.choices[0].logprobs is None:
raise ValueError("No logprobs in response")
# Extract the last token's log probability
last_logprob_content = response_json.choices[0].logprobs.content[-1]
last_logprob_token = last_logprob_content.token
last_logprob_float = last_logprob_content.logprob
last_logprob_percentage = np.exp(last_logprob_float)
# Determine if the statement is a fact
if "t" in last_logprob_token.lower():
is_fact = True
elif "f" in last_logprob_token.lower():
is_fact = False
else:
raise ValueError("Invalid token")
# Create result object
result_obj = {
"fact": is_fact,
"logprob": last_logprob_float,
"percentage": last_logprob_percentage,
}
# Log the result
logger.debug(json.dumps(result_obj, indent=2))
# Minify the JSON string
thin_json_str = json.dumps(result_obj, separators=(",", ":"))
logger.debug(f"Thin JSON: {repr(thin_json_str)}")
# Execute the main function
if __name__ == "__main__":
asyncio.run(main())
Conclusion
By chaining the responses from a larger model into a smaller one, we can optimize both the cost and efficiency of our applications. The larger model handles complex tasks like fact-checking, while the smaller model takes care of formatting and conversion tasks.
Key Takeaways:
- Resource Optimization: Use powerful models only when necessary.
- Cost Efficiency: Leverage smaller models for simpler tasks.
- Structured Outputs: Convert complex responses into structured data formats like JSON.