ChatGPT-Next-Web/docs/bedrock-response-format.md

# Understanding Bedrock Response Format

The AWS Bedrock streaming response format consists of multiple Server-Sent Events (SSE) chunks. Each chunk follows this structure:

```
:event-type chunk
:content-type application/json
:message-type event
{"bytes":"base64_encoded_data","p":"signature"}
```

## Model-Specific Response Formats

### Claude 3 Format

When using Claude 3 models (e.g., claude-3-haiku-20240307), the decoded messages include:

1. **message_start**
```json
{
  "type": "message_start",
  "message": {
    "id": "msg_bdrk_01A6sahWac4XVTR9sX3rgvsZ",
    "type": "message",
    "role": "assistant",
    "model": "claude-3-haiku-20240307",
    "content": [],
    "stop_reason": null,
    "stop_sequence": null,
    "usage": {
      "input_tokens": 8,
      "output_tokens": 1
    }
  }
}
```

2. **content_block_start**
```json
{
  "type": "content_block_start",
  "index": 0,
  "content_block": {
    "type": "text",
    "text": ""
  }
}
```

3. **content_block_delta**
```json
{
  "type": "content_block_delta",
  "index": 0,
  "delta": {
    "type": "text_delta",
    "text": "Hello"
  }
}
```

### Mistral Format

When using Mistral models (e.g., mistral-large-2407), the decoded messages have a different structure:

```json
{
  "id": "b0098812-0ad9-42da-9f17-a5e2f554eb6b",
  "object": "chat.completion.chunk",
  "created": 1732582566,
  "model": "mistral-large-2407",
  "choices": [{
    "index": 0,
    "logprobs": null,
    "context_logits": null,
    "generation_logits": null,
    "message": {
      "role": null,
      "content": "Hello",
      "tool_calls": null,
      "index": null,
      "tool_call_id": null
    },
    "stop_reason": null
  }],
  "usage": null,
  "p": null
}
```

### Llama Format

When using Llama models (3.1 or 3.2), the decoded messages use a simpler structure focused on generation tokens:

```json
{
  "generation": "Hello",
  "prompt_token_count": null,
  "generation_token_count": 2,
  "stop_reason": null
}
```

Each chunk contains:
- generation: The generated text piece
- prompt_token_count: Token count of the input (only present in first chunk)
- generation_token_count: Running count of generated tokens
- stop_reason: Indicates completion (null until final chunk)

First chunk example (includes prompt_token_count):
```json
{
  "generation": "\n\n",
  "prompt_token_count": 10,
  "generation_token_count": 1,
  "stop_reason": null
}
```

### Titan Text Format

When using Amazon's Titan models (text or TG1), the response comes as a single chunk with complete text and metrics:

```json
{
  "outputText": "\nBot: Hello! How can I help you today?",
  "index": 0,
  "totalOutputTextTokenCount": 13,
  "completionReason": "FINISH",
  "inputTextTokenCount": 3,
  "amazon-bedrock-invocationMetrics": {
    "inputTokenCount": 3,
    "outputTokenCount": 13,
    "invocationLatency": 833,
    "firstByteLatency": 833
  }
}
```

Both Titan text and Titan TG1 use the same response format, with only minor differences in token counts and latency values. For example, here's a TG1 response:

```json
{
  "outputText": "\nBot: Hello! How can I help you?",
  "index": 0,
  "totalOutputTextTokenCount": 12,
  "completionReason": "FINISH",
  "inputTextTokenCount": 3,
  "amazon-bedrock-invocationMetrics": {
    "inputTokenCount": 3,
    "outputTokenCount": 12,
    "invocationLatency": 845,
    "firstByteLatency": 845
  }
}
```

Key fields:
- outputText: The complete generated response
- totalOutputTextTokenCount: Total tokens in the response
- completionReason: Reason for completion (e.g., "FINISH")
- inputTextTokenCount: Number of input tokens
- amazon-bedrock-invocationMetrics: Detailed performance metrics

## Model-Specific Completion Metrics

### Mistral
```json
{
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 29,
    "completion_tokens": 24
  },
  "amazon-bedrock-invocationMetrics": {
    "inputTokenCount": 5,
    "outputTokenCount": 24,
    "invocationLatency": 719,
    "firstByteLatency": 148
  }
}
```

### Claude 3
Included in the message_delta with stop_reason.

### Llama
Included in the final chunk with stop_reason "stop":
```json
{
  "amazon-bedrock-invocationMetrics": {
    "inputTokenCount": 10,
    "outputTokenCount": 11,
    "invocationLatency": 873,
    "firstByteLatency": 550
  }
}
```

### Titan
Both Titan text and TG1 include metrics in the single response chunk:
```json
{
  "amazon-bedrock-invocationMetrics": {
    "inputTokenCount": 3,
    "outputTokenCount": 12,
    "invocationLatency": 845,
    "firstByteLatency": 845
  }
}
```

## How the Response is Processed

1. The raw response is first split into chunks based on SSE format
2. For each chunk:
   - The base64 encoded data is decoded
   - The JSON is parsed to extract the message content
   - Based on the model type and message type, different processing is applied:

### Claude 3 Processing
- message_start: Initializes a new message with model info and usage stats
- content_block_start: Starts a new content block (text, tool use, etc.)
- content_block_delta: Adds incremental content to the current block
- message_delta: Updates message metadata

### Mistral Processing
- Each chunk contains a complete message object with choices array
- The content is streamed through the message.content field
- Final chunk includes token usage and invocation metrics

### Llama Processing
- Each chunk contains a generation field with the text piece
- First chunk includes prompt_token_count
- Tracks generation progress through generation_token_count
- Simple streaming format focused on text generation
- Final chunk includes complete metrics

### Titan Processing
- Single chunk response with complete text
- No streaming - returns full response at once
- Includes comprehensive metrics in the same chunk

## Handling in Code

The response is processed by the `transformBedrockStream` function in `app/utils/aws.ts`, which:

1. Reads the stream chunks
2. Parses each chunk using `parseEventData`
3. Handles model-specific formats:
   - For Claude: Processes message_start, content_block_start, content_block_delta
   - For Mistral: Extracts content from choices[0].message.content
   - For Llama: Uses the generation field directly
   - For Titan: Uses the outputText field from the single response
4. Transforms the parsed data into a consistent format for the client
5. Yields the transformed data as SSE events

This allows for real-time streaming of the model's response while maintaining a consistent format for the client application, regardless of which model is being used.