Create model responses for chat conversations using OpenAI-compatible API.

Base URL

https://api.pinference.ai/api/v1

Create Chat Completion

Generate a response from a language model given a conversation history.

Request

curl -X POST https://api.pinference.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.1-70b-instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Parameters

ParameterTypeRequiredDescription
modelstringYesModel ID to use for completion
messagesarrayYesConversation messages
max_tokensintegerNoMaximum tokens to generate
temperaturenumberNoSampling temperature (0-2)
streambooleanNoEnable streaming responses
stopstring/arrayNoStop sequences

Response

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1693721698,
  "model": "meta-llama/llama-3.1-70b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}

Streaming

Enable real-time response streaming by setting stream: true:
stream = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Advanced Parameters

Temperature Control

# Deterministic output
response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Solve: 2+2=?"}],
    temperature=0.1
)

# Creative output
response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.9
)

System Messages

response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful math tutor."},
        {"role": "user", "content": "Explain calculus"}
    ]
)

Error Handling

Rate Limit (429)

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Invalid Model (400)

{
  "error": {
    "message": "Invalid model specified",
    "type": "invalid_request_error",
    "code": "invalid_model"
  }
}

Context Length Exceeded (400)

{
  "error": {
    "message": "Context length exceeded",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}