Create model responses for chat conversations using OpenAI-compatible API.
Base URL
https://api.pinference.ai/api/v1
Authentication
All requests require a Bearer token in the Authorization header:
Authorization: Bearer your_api_key
Team Account Usage
When using a team account, you must include the X-Prime-Team-ID
header. Without this header, requests default to your personal account instead of your team account.
X-Prime-Team-ID: your-team-id-here
Find your Team ID on your Team’s Profile page.
Create Chat Completion
Generate a response from a language model given a conversation history.
Request
curl -X POST https://api.pinference.ai/api/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.1-70b-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
# With team account (add X-Prime-Team-ID header)
curl -X POST https://api.pinference.ai/api/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "X-Prime-Team-ID: your-team-id-here" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.1-70b-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Parameters
Parameter | Type | Required | Description |
---|
model | string | Yes | Model ID to use for completion |
messages | array | Yes | Conversation messages |
max_tokens | integer | No | Maximum tokens to generate |
temperature | number | No | Sampling temperature (0-2) |
stream | boolean | No | Enable streaming responses |
stop | string/array | No | Stop sequences |
Response
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1693721698,
"model": "meta-llama/llama-3.1-70b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
}
}
Streaming
Enable real-time response streaming by setting stream: true
:
stream = client.chat.completions.create(
model="meta-llama/llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Advanced Parameters
Temperature Control
# Deterministic output
response = client.chat.completions.create(
model="meta-llama/llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "Solve: 2+2=?"}],
temperature=0.1
)
# Creative output
response = client.chat.completions.create(
model="meta-llama/llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "Write a poem"}],
temperature=0.9
)
System Messages
response = client.chat.completions.create(
model="meta-llama/llama-3.1-70b-instruct",
messages=[
{"role": "system", "content": "You are a helpful math tutor."},
{"role": "user", "content": "Explain calculus"}
]
)
Error Handling
Rate Limit (429)
{
"error": {
"message": "Rate limit exceeded",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
Invalid Model (400)
{
"error": {
"message": "Invalid model specified",
"type": "invalid_request_error",
"code": "invalid_model"
}
}
Context Length Exceeded (400)
{
"error": {
"message": "Context length exceeded",
"type": "invalid_request_error",
"code": "context_length_exceeded"
}
}