1. Key lifecycle
Treat API keys as production credentials. Recommended rules:- One key per environment
- Prefer one key per integration when practical
- Store only in secret managers or protected runtime configuration
- Rotate on a schedule and after any suspected exposure
- Revoke old keys after rollout confirmation
2. Configuration discipline
Keep these values explicit in environment configuration:- Base URL
- API key
- Tenant ID
- Site ID
- Timeout settings
- Retry settings
3. HTTP client baseline
Every client should define:- Connection timeout
- Read timeout
- Sensible total request timeout
- Retry policy limited to retryable failures
- Structured error logging
- Common headers
- Error parsing
- Non-2xx response handling
4. Retry policy
Do not use one retry policy for every endpoint and error type. Recommended handling:400,401,403,404: fix the request, config, or access problem; do not blind-retry429: retry with backoff and jitter5xx: bounded retries with backoff- network timeout or connect failure: bounded retries if the operation is idempotent or safely repeatable
5. Session design
Define yoursession_id strategy before the first production deployment.
Good strategies are:
- One session per visitor
- One session per authenticated user
- One session per business thread or ticket
- One session per messaging channel identity
6. Content rollout strategy
Do not bulk-load large amounts of content before you have validated answer quality. Recommended rollout:- Start with a small approved document set
- Cover the highest-frequency support or sales questions first
- Add Q&A entries for direct repeated questions
- Expand only after chat quality is acceptable
- Check that wording is current
- Remove conflicting guidance
- Separate public guidance from team-only notes
- Verify that sensitive customer data is not present
7. Sync vs async usage
Use sync chat when the caller can wait and the user experience needs an immediate answer. Use async chat when:- You already have a queue or job architecture
- The caller should not block on completion
- You need stronger control over polling or orchestration
8. Feedback collection
Feedback is one of the most valuable quality signals in production. Good practice:- Submit feedback only for completed responses
- Keep the feedback decision close to the user interaction point
- Store
request_idwith the surrounding business event so the rating can be tied back to the exact answer
9. Monitoring and observability
Track at minimum:- Request volume by endpoint
- Success and failure rate by endpoint
- P50, P95, and timeout trends
429frequency5xxfrequency- Async completion latency
- Feedback trend
- Environment
- Integration name
- Tenant ID
- Site ID
10. Logging rules
Logs should help debug production issues without leaking credentials or unnecessary personal data. Recommended logging:- Request path
- Tenant ID and site ID
- Session ID
- Request ID
- Response status
- Latency
- Retry count
- API keys
- Full secret configuration
- Raw document contents unless explicitly required for a controlled debug environment
11. Change management
Before promoting integration changes to production:- Run a staging smoke flow
- Validate auth
- Validate at least one document operation
- Validate sync chat
- Validate async chat if used
- Validate feedback submission if used
- Validate statistics read access
- Key rotations
- Environment configuration changes
- New document rollout batches
- Endpoint behavior changes in your client
12. Incident handling
Prepare for these common incident classes:- Auth failures after rotation
- Wrong tenant or site targeting
- Empty or poor answer quality after document changes
- Async polling never completing within the expected window
- High
429or5xxrate
- Where credentials are stored
- Who can rotate or revoke keys
- How to run a connectivity test quickly
- How to disable or pause upstream traffic if needed
- How to confirm recovery
13. Production checklist
- Correct key for the correct environment
- Correct tenant ID and site ID
- Stable session ID strategy
- Small validated document set
- Retries only on retryable failures
- Async polling timeout defined
- Feedback wiring reviewed
- Endpoint metrics and alerts in place
- Staging smoke test passed before production rollout