Scaling
Production-ready agent deployment and scaling strategies
Agentic applications demand near-instant agent availability to maintain user engagement and responsiveness. Pipecat Cloud manages the complexity of scaling these agent deployments in production environments, providing granular controls for compute resources and cost optimization.
Core concepts
Instances
An instance represents a single unit of compute that runs your agent.
Instance costs are determined by:
- Active session runtime duration
- Warm instance maintenance time
- The compute profile specified in your deployment
Pipecat Cloud automatically provisions and manages instances to handle active sessions, ensuring that your deployment can scale to meet demand within the limits of your deployment configuration.
Instance pool
Making a deployment to Pipecat Cloud creates a managed pool of instances that:
- Routes requests to available instances
- Scales based on demand within configured limits
- Maintains optimal performance through auto-scaling
Developers can configure the upper and lower limit of a deployment’s instance pool, providing a cost-effective way to handle varying loads.
Minimum instances
min-instances
- Maintains specified number of warm instances to serve incoming requests
- Immediately ready to become active, reducing cold starts
- Defaults to
0
if unspecified
Minimum number of instances maintained in a pool at all times.
Developers specify a min-instances
configuration to determines the number of instances that should be kept warm in their deployment pool. A warm instance is kept running and can immediately be used to serve an active session.
Maintaining a minimum number of instances is important to keep agent start times fast and reduce cold starts.
Maximum instances
max-instances
- Sets hard limit on concurrent sessions
- Acts as a cost control / load mechanism
- Returns HTTP 429 when pool capacity is reached
Maximum instances is the hard limit on the number of instances in your pool.
During beta, each deployment made to Pipecat Cloud has a maximum allowed pool size of 10. Please contact us at help@daily.co or via Discord if you require more capacity.
Deployments can optionally be made with a max-instances
configuration that limits the number of instances that your pool can contain.
This exists as a cost control measure, allowing developers to limit the total number of active sessions that can be run at any one time.
The maximum instance count is a hard limit, meaning requests made to a pool that is at capacity will receive a 429
response. See starting sessions for more information for how to handle this in your application code.
Agent lifecycle
Pool Initialization
- Provisions the minimum number of warm instances based on
min-instances
configuration (defaults to0
) - Listens for session requests to route to available instances
Session Assignment
- ✅ If a warm instance is available, the session will be assigned to that instance
- ⏳ If no warm instances are available, and your pool is not at capacity, a new instance will be provisioned to handle the request (e.g a cold-start)
- ❌ If your pool is at capacity, your application will receive a
429
response from the start request
Auto-scaling
- The Pipecat Cloud auto-scaler determines if additional warm instances should be created to support further requests.
- Once a session concludes, instances are either returned to the pool to serve another session or discarded
Your are billed for warm instances, even if they are not handling active sessions. Developers should consider their deployment strategy when cost optimizing, adjusting the minimum and maximum instance count accordingly. See current pricing for details.
Cold-starts
A cold start may occur when an active session request is made and no warm instances are available in the pool to handle it. In this case, Pipecat Cloud will provision a new instance to handle the request.
Cold starts require additional time to provision the instance and load the agent, which may result in a delay for the user. To minimize cold starts, you can configure your pool to maintain a minimum number of warm instances at all times.
Pipecat Cloud aims to mitigate cold starts as much as possible through auto-scaling.
Scale-to-zero
For some deployments, using a minimum instance count of 0 is preferable (e.g. while in development.) Since you are only charged for warm instances and active sessions, this can be a cost-effective way to manage deployments where fast start times are not required.
When the minimum instance count is set to 0, the pool will scale down to 0 instances when there are no active sessions. Idle instances are maintained for 5 minutes before being terminated. This timeout is not currently configurable but will be in the future.
Scale-to-zero is not recommended for production deployments where immediate response is required.
Auto-scaling
Pipecat Cloud performs auto-scaling by default on all deployments. Auto-scaling is accomplished throught the following mechanisms:
- Scaling up based on request velocity
- Maintaining efficiency within max-instances limit
- Scaling down to min-instances (or zero) during low usage
- Supporting burst workloads automatically
Concurrency
You can specify a concurrency configuration that dictates how many sessions can run on a single instance.
This can help further cost optimize your deployment by reducing the amount of instances required to handle active sessions.
Developers must ensure their instance types specify enough compute resources to handle the number of concurrent sessions they require.
Each agent is securely fenced from other agents running on the same instance, ensuring that they cannot access each other’s data.
Updating scaling configuration
You can update your deployment’s configuration at any time via the CLI or Pipecat Cloud Dashboard.
Please note that changing your scaling parameters will not disrupt any active sessions. If you reduce your max instance count below the number of currently active sessions, you will still be billed for the duration of those sessions.
Usage summary
Billing is based on:
- Running instances (warm or active)
- Duration of additional instance uptime
For example, if you specify min-instances
as 2
, you will be billed for the time those instances are kept warm or running active sessions.
If this same deployment receives 2 session requests, and they run concurrently, the auto-scaler may provision further warm instances to serve any subsequent sessions (within the maximum instance limit.)
For example, if you specify min-instances
as 2
and max-instances
as 5
, and you have 2
active sessions, the auto-scaler may provision a further warm instance to support the next incoming request.
In this example, you would be charged for 2
instances plus the duration of the third active session.
Once the third session has concluded, the auto-scaler disposes of the third instance, meaning you’ll continue to be charged for the 2 minimum warm instances your
configuration asserts.