Optimizing Claude MCP Server Usage: Leveraging Caching

MCP servers are great, but can be very costly on input tokens when Claude has to make multiple calls. Multiple calls are billed the same as if you make another API request.

Let's use this as an example request to begin:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 1000,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "List all the Linux kernel arguments."
        }
      ]
    }
  ],
  "mcp_servers": [
    {
      "type": "url",
      "url": "https://mcp.deepwiki.com/sse",
      "name": "DeepWiki"
    }
  ]
}

Request body without caching enabled

In our response, we can see there's four tool calls happening.

{
  "id": "msg_013MQCyvkhCFrx9Cn5CriahH",
  "type": "message",
  "role": "assistant",
  "model": "claude-sonnet-4-20250514",
  "content": [
    {
      "type": "text",
      "text": "I'll help you... <REMOVED_FOR_BLOG_LENGTH>"
    },
    {
      "type": "mcp_tool_use",
      "id": "mcptoolu_0198sXxscecG6ECQ367Bbftp",
      "name": "ask_question",
      "input": {
        "repoName": "torvalds/linux",
        "question": "What are all... <REMOVED_FOR_BLOG_LENGTH>"
      },
      "server_name": "DeepWiki"
    },
    {
      "type": "mcp_tool_result",
      "tool_use_id": "mcptoolu_0198sXxscecG6ECQ367Bbftp",
      "is_error": false,
      "content": [
        {
          "type": "text",
          "text": "It appears you... <REMOVED_FOR_BLOG_LENGTH>"
        }
      ]
    },
    {
      "type": "text",
      "text": "Let me try a more... <REMOVED_FOR_BLOG_LENGTH>"
    },
    {
      "type": "mcp_tool_use",
      "id": "mcptoolu_01TvGoXPMgRCapDFSReNvcRo",
      "name": "ask_question",
      "input": {
        "repoName": "torvalds/linux",
        "question": "Where is the documentation... <REMOVED_FOR_BLOG_LENGTH>"
      },
      "server_name": "DeepWiki"
    },
    {
      "type": "mcp_tool_result",
      "tool_use_id": "mcptoolu_01TvGoXPMgRCapDFSReNvcRo",
      "is_error": false,
      "content": [
        {
          "type": "text",
          "text": "The documentation... <REMOVED_FOR_BLOG_LENGTH>"
        }
      ]
    },
    {
      "type": "text",
      "text": "Now let me get... <REMOVED_FOR_BLOG_LENGTH>"
    },
    {
      "type": "mcp_tool_use",
      "id": "mcptoolu_011KQQt3Xa2EYHYPveBx9roW",
      "name": "ask_question",
      "input": {
        "repoName": "torvalds/linux",
        "question": "Show... <REMOVED_FOR_BLOG_LENGTH>"
      },
      "server_name": "DeepWiki"
    },
    {
      "type": "mcp_tool_result",
      "tool_use_id": "mcptoolu_011KQQt3Xa2EYHYPveBx9roW",
      "is_error": false,
      "content": [
        {
          "type": "text",
          "text": "I cannot provide... <REMOVED_FOR_BLOG_LENGTH>"
        }
      ]
    },
    {
      "type": "text",
      "text": "Let me try... <REMOVED_FOR_BLOG_LENGTH>"
    },
    {
      "type": "mcp_tool_use",
      "id": "mcptoolu_014RqVvjACG3BDxRzfgMjFY7",
      "name": "ask_question",
      "input": {
        "repoName": "torvalds/linux",
        "question": "What are the main... <REMOVED_FOR_BLOG_LENGTH>"
      },
      "server_name": "DeepWiki"
    },
    {
      "type": "mcp_tool_result",
      "tool_use_id": "mcptoolu_014RqVvjACG3BDxRzfgMjFY7",
      "is_error": false,
      "content": [
        {
          "type": "text",
          "text": "The user is asking about... <REMOVED_FOR_BLOG_LENGTH>"
        }
      ]
    },
    {
      "type": "text",
      "text": "Based on my search of the Linux kernel repository... <REMOVED_FOR_BLOG_LENGTH>"
    }
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 7356,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0,
    "output_tokens": 1436
  }
}

Response body without caching enabled

We don't really care about the actual answer, let's focus on the usage instead:

{
  "input_tokens": 7356,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 0,
  "output_tokens": 1436
}

Response usage without caching enabled

Ouch, we end up being billed for over 7,000 input tokens at full price, because we aren't caching anything.

While not explicitly documented, MCP servers are really just tools behind the scenes (even for Anthropic models). This means we can use prompt caching just like "normal" tools can be cached!

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 1000,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "List all the Linux kernel arguments.",
          "cache_control": {
            "type": "ephemeral"
          }
        }
      ]
    }
  ],
  "mcp_servers": [
    {
      "type": "url",
      "url": "https://mcp.deepwiki.com/sse",
      "name": "DeepWiki"
    }
  ]
}

Request body with caching enabled

How's our usage now?

{
  "input_tokens": 1744,
  "cache_creation_input_tokens": 3218,
  "cache_read_input_tokens": 7489,
  "output_tokens": 1331
}

Response usage with caching enabled

Much better. On just one prompt, we went from paying $0.0220 without caching down to $0.0195 on input tokens with caching!

7356*(3/1000000)+0*(3.75/1000000)+0*(0.30/1000000)

Cost without prompt caching

1744*(3/1000000)+3218*(3.75/1000000)+7349*(0.30/1000000)

Cost with prompt caching

In most situations, you should see a significant cost saving if you remember to add prompt caching whenever you're using remote MCP servers, even if you are only sending one API request.

We did notice that enabling caching like this can sometimes result in a high cache_creation_input_tokens ; we'll look into this in a future blog post.