Identifying Anthropic Claude Errors on Vertex AI

So you've switched to using Claude 3.5 Sonnet on Google Vertex AI, congratulations! A good next question is: how reliable is this? You can set up Langfuse or Cloudflare AI Gateway (we recommend both), but Google Cloud has a much simpler solution:

Audit Logs and Log Explorer!

First, you'll need to enable Audit Logs for the Vertex AI API. I would recommend you enable Admin Read, Data Read, and Data Write to be safe.

That's it. Now, it's as simple as waiting until you've sent some requests, and hopping over to the Log Explorer. A simple query like this will do the trick.

protoPayload.serviceName="aiplatform.googleapis.com"
protoPayload.resourceName=~"projects/.*/locations/us-east5/publishers/anthropic/models/claude-3-5-sonnet@20240620"

We can see out of our 10,236 requests this week, 167 of them errored out. This doesn't mean that it was Vertex AI's fault though, just that the request failed.

By using the protoPayload.status.message field, we can filter out prompts that failed due to our mistake.

-protoPayload.status.message="{\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"message\":\"The request body is not valid JSON.\"}}"

We can also filter out by status code, which might be more robust in case Vertex AI changes the message string.

-protoPayload.status.code="3"

Putting this all together, we get the following query:

protoPayload.serviceName="aiplatform.googleapis.com"
protoPayload.resourceName=~"projects/.*/locations/us-east5/publishers/anthropic/models/claude-3-5-sonnet@20240620"
severity=ERROR
-protoPayload.status.code="3"

This is roughly a 1.6% error rate on GCP's end, so be sure to have retrying, or ideally failover logic in your application!