Chat Completions

Create model responses for chat conversations using OpenAI-compatible API.

Base URL

https://api.pinference.ai/api/v1

Authentication

All requests require a Bearer token in the Authorization header:

Authorization: Bearer your_api_key

Team Account Usage

When using a team account, you must include the X-Prime-Team-ID header. Without this header, requests default to your personal account instead of your team account.

X-Prime-Team-ID: your-team-id-here

Find your Team ID on your Team’s Profile page.

Create Chat Completion

Generate a response from a language model given a conversation history.

Request

curl -X POST https://api.pinference.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.1-70b-instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

# With team account (add X-Prime-Team-ID header)
curl -X POST https://api.pinference.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Prime-Team-ID: your-team-id-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.1-70b-instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Parameters

Parameter	Type	Required	Description
`model`	string	Yes	Model ID to use for completion
`messages`	array	Yes	Conversation messages
`max_tokens`	integer	No	Maximum tokens to generate
`temperature`	number	No	Sampling temperature (0-2)
`stream`	boolean	No	Enable streaming responses
`stop`	string/array	No	Stop sequences

Response

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1693721698,
  "model": "meta-llama/llama-3.1-70b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}

Streaming

Enable real-time response streaming by setting stream: true:

stream = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Advanced Parameters

Temperature Control

# Deterministic output
response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Solve: 2+2=?"}],
    temperature=0.1
)

# Creative output
response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.9
)

System Messages

response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful math tutor."},
        {"role": "user", "content": "Explain calculus"}
    ]
)

Error Handling

Rate Limit (429)

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Invalid Model (400)

{
  "error": {
    "message": "Invalid model specified",
    "type": "invalid_request_error",
    "code": "invalid_model"
  }
}

Context Length Exceeded (400)

{
  "error": {
    "message": "Context length exceeded",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}

API Documentation

Compute API

Inference API

Availability

Disks

evals

Pods

Sandbox

SSH Keys

user

Base URL

Authentication

Team Account Usage

Create Chat Completion

Request

Parameters

Response

Streaming

Advanced Parameters

Temperature Control

System Messages

Error Handling

Rate Limit (429)

Invalid Model (400)

Context Length Exceeded (400)

API Documentation

Compute API

Inference API

Availability

Disks

evals

Pods

Sandbox

SSH Keys

user

​Base URL

​Authentication

​Team Account Usage

​Create Chat Completion

​Request

​Parameters

​Response

​Streaming

​Advanced Parameters

​Temperature Control

​System Messages

​Error Handling

​Rate Limit (429)

​Invalid Model (400)

​Context Length Exceeded (400)

Base URL

Authentication

Team Account Usage

Create Chat Completion

Request

Parameters

Response

Streaming

Advanced Parameters

Temperature Control

System Messages

Error Handling

Rate Limit (429)

Invalid Model (400)

Context Length Exceeded (400)