Skip to main content

Telemetry Guidelines (OpenTelemetry)

Viben observability implementation guide based on OpenTelemetry standards

Overview

Viben uses OpenTelemetry to implement the three pillars of observability:

PillarImplementation StatusStorage Format
TracesCompleteJSONL files (by traceId directory)
MetricsBasicJSONL files (by date)
LogsCompletePino + JSONL files (by date)

Core Architecture

Data Storage Structure

~/.viben/telemetry/
├── traces/
│ └── YYYY-MM-DD/
│ └── {traceId}.jsonl # One file per trace
├── metrics/
│ └── YYYY-MM-DD.jsonl # Aggregated by date
└── logs/
└── YYYY-MM-DD.jsonl # Aggregated by date

Dependencies

{
"@opentelemetry/api": "^1.9.0",
"@fastify/otel": "^0.18.1",
"@opentelemetry/instrumentation-http": "^0.212.0",
"@opentelemetry/sdk-metrics": "^2.5.1",
"@opentelemetry/sdk-node": "^0.212.0",
"@opentelemetry/sdk-trace-base": "^2.5.1",
"@opentelemetry/semantic-conventions": "^1.39.0"
}

Tracing Guide

1. Get Tracer

import { trace } from "@viben/core/telemetry";

// Create named tracer (module level, at top of file)
const tracer = trace.getTracer("viben-gateway", "1.0.0");

Naming Conventions:

Tracer NamePurpose
viben-gatewayGateway API routes
viben-gateway-wsWebSocket routes
viben-cronCron service

2. Create Span

import { trace, SpanStatusCode, context } from "@viben/core/telemetry";
import { getSpanName } from "@viben/core/telemetry/route-names";

// Create top-level span
const span = tracer.startSpan(getSpanName("agent.run"), {
attributes: {
"agent.name": agentName,
"agent.model": model,
},
});

// Create child span (using parent context)
const parentContext = trace.setSpan(context.active(), parentSpan);
const childSpan = tracer.startSpan(
getSpanName("agent.run.stream"),
{ attributes: { "stream.session_id": sessionId } },
parentContext
);

3. Span Naming Conventions

Use getSpanName() to get display names:

import { getSpanName } from "@viben/core/telemetry/route-names";

// Get display names
getSpanName("agent.run"); // -> "Execute Agent"
getSpanName("tool.Read"); // -> "Tool: Read"
getSpanName("cron.execute"); // -> "Execute Cron Job"

Defined Span Names:

Original NameDisplay Name
agent.runExecute Agent
agent.run.session_createCreate Session
agent.run.sdk_initInitialize SDK
agent.run.streamStream Response
tool_useTool Use
tool_resultTool Result
cron.executeExecute Cron Job
ws.sessionWebSocket Session
ws.message.receiveReceive WebSocket Message

4. Add Attributes and Events

// Set attributes
span.setAttributes({
"agent.status": "completed",
"agent.message_count": messageCount,
"http.request.body": JSON.stringify(requestBody),
});

// Add events
span.addEvent("sse.text", {
"sse.message_index": index,
"sse.type": "text",
"sse.payload": content.slice(0, 4000), // Truncate large payloads
});

5. Set Status and End

// Success
span.setStatus({ code: SpanStatusCode.OK });

// Error
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);

// Must call end()
span.end();

6. Tool Call Span Pattern

Tool calls require pairing tool_use and tool_result:

// Create span on tool_use
const toolSpan = tracer.startSpan(
getSpanName(`tool.${toolName}`),
{
attributes: {
"tool.id": toolId,
"tool.name": toolName,
"tool.input": JSON.stringify(input),
},
},
streamContext
);
pendingToolSpans.set(toolId, toolSpan);

// End span on tool_result
const toolSpan = pendingToolSpans.get(toolUseId);
if (toolSpan) {
toolSpan.setAttributes({
"tool_result.is_error": isError,
"tool_result.output": output.slice(0, 2000),
});
toolSpan.setStatus({
code: isError ? SpanStatusCode.ERROR : SpanStatusCode.OK,
});
toolSpan.end();
pendingToolSpans.delete(toolUseId);
}

Metrics Guide

Implemented Metrics

Agent Metrics

Metric NameTypeLabelsDescription
viben_agent_requests_totalCounteragent_name, status, error_categoryTotal agent requests
viben_agent_duration_secondsHistogramagent_name, statusAgent execution duration
viben_agent_tool_calls_totalCounteragent_name, tool_name, statusTool call count
viben_agent_text_chars_totalCounteragent_nameText response character count
viben_agent_messages_totalCounteragent_name, message_typeSSE message count
viben_agent_active_sessionsGauge-Current active sessions

Cron Metrics

Metric NameTypeLabelsDescription
viben_cron_executions_totalCounterjob_id, job_name, job_type, status, triggerCron execution count
viben_cron_duration_secondsHistogramjob_id, job_name, job_typeCron execution duration
viben_cron_jobs_totalGaugeenabled, job_typeTotal cron jobs

WebSocket Metrics

Metric NameTypeLabelsDescription
viben_ws_connections_totalCounter-Total WebSocket connections
viben_ws_disconnects_totalCounterreasonTotal WebSocket disconnections
viben_ws_messages_totalCounterdirection, message_typeWebSocket message count
viben_ws_active_connectionsGauge-Current active connections

Using Helper Functions

Recommended to use predefined helper functions for recording metrics:

import { recordAgentRequest, recordAgentToolCall, recordCronExecution } from "@viben/core/telemetry";

// Record agent request completion
recordAgentRequest({
agentName: "my-agent",
status: "success", // "success" | "error" | "cancelled"
durationMs: 5000,
toolUseCount: 3,
toolResultCount: 3,
textLength: 1500,
messageCount: 10,
});

// Record tool call
recordAgentToolCall({
agentName: "my-agent",
toolName: "Read",
status: "success", // "success" | "error"
});

// Record cron execution
recordCronExecution({
jobId: "job-123",
jobName: "Daily Report",
jobType: "agent", // "agent" | "script"
status: "success",
trigger: "schedule", // "schedule" | "manual"
durationMs: 3000,
});

Direct Metrics API

For more flexible control, you can use the exported metrics objects directly:

import { agentRequestsTotal, agentDurationSeconds } from "@viben/core/telemetry";

// Manually increment counter
agentRequestsTotal.add(1, { agent_name: "my-agent", status: "success" });

// Record histogram value
agentDurationSeconds.record(5.0, { agent_name: "my-agent", status: "success" });

Observable Gauge Callbacks

Observable Gauges report current values (such as active connection counts) and are automatically registered during Gateway initialization:

// Automatically called during Gateway initialization (gateway/index.ts)
import { registerGaugeCallbacks } from "@viben/core/telemetry";
import { agentService } from "../services/agent";
import { getActiveWsConnectionCount } from "./routes/ws";

registerGaugeCallbacks({
getActiveAgentSessions: () => agentService.getActiveSessionCount(),
getActiveWsConnections: () => getActiveWsConnectionCount(),
getCronJobCounts: () => state.cron.getJobStats(),
});

Data Sources:

GaugeData SourceMethod
viben_agent_active_sessionsAgentServiceagentService.getActiveSessionCount()
viben_ws_active_connectionsws.tsgetActiveWsConnectionCount()
viben_cron_jobs_totalCronServicestate.cron.getJobStats()

Logging Guide

Using Pino Logger

import { createLogger, createDualLogger } from "@viben/core/telemetry";

// Create logger
const logger = process.env.NODE_ENV === "production"
? createLogger(config)
: createDualLogger(config); // Output to both file and console

// Usage
logger.info({ userId, action }, "User performed action");
logger.error({ error: err.message, stack: err.stack }, "Operation failed");

Log Levels

LevelPurpose
traceVery detailed debug information
debugDebug information (development)
infoGeneral information (production default)
warnWarning messages
errorError messages
fatalFatal errors

Initializing Telemetry

Initialization in Gateway

import { initTelemetry, getDefaultTelemetryDir } from "@viben/core/telemetry";

const telemetry = initTelemetry({
serviceName: "viben-gateway",
serviceVersion: "1.0.0",
baseDir: getDefaultTelemetryDir(),
enabled: true,
trace: {
batchSize: 100,
flushDelayMs: 5000,
},
metrics: {
exportIntervalMs: 60000,
},
retentionDays: 7,
});

// Cleanup on shutdown
process.on("SIGTERM", async () => {
await telemetry.shutdown();
});

HTTP Auto Instrumentation

The SDK automatically instruments:

  • HTTP inbound/outbound requests
  • Fastify routes

Ignore Rules:

  • /health and /api/health health check endpoints
  • WebSocket upgrade requests (to avoid conflicts with @fastify/websocket)

Desktop Integration

The Desktop application consumes telemetry data through the Gateway API:

  1. No need to install OpenTelemetry packages in the desktop app
  2. Use /api/telemetry/* endpoints to fetch data
  3. Display trace visualization in the chat-monitor.tsx page

Common Patterns

Pattern 1: Route-Level Tracing

fastify.post("/api/resource", async (request, reply) => {
const span = tracer.startSpan(getSpanName("resource.create"), {
attributes: {
"http.request.body": JSON.stringify(request.body),
},
});

try {
const result = await createResource(request.body);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});

Pattern 2: Nested Spans

const parentSpan = tracer.startSpan("parent.operation");
const parentContext = trace.setSpan(context.active(), parentSpan);

try {
// Child operation 1
const childSpan1 = tracer.startSpan("child.step1", {}, parentContext);
await step1();
childSpan1.end();

// Child operation 2
const childSpan2 = tracer.startSpan("child.step2", {}, parentContext);
await step2();
childSpan2.end();

parentSpan.setStatus({ code: SpanStatusCode.OK });
} finally {
parentSpan.end();
}

Pattern 3: SSE/Streaming Tracing

// Create stream span
const streamSpan = tracer.startSpan("stream", {}, parentContext);
const streamContext = trace.setSpan(context.active(), streamSpan);

for await (const message of stream) {
// Record each event
streamSpan.addEvent(`sse.${message.type}`, {
"sse.payload": JSON.stringify(message).slice(0, 4000),
});
}

streamSpan.setAttributes({
"stream.message_count": count,
});
streamSpan.end();

Common Mistakes

Don't: Forget to End Span

// Bad - span never ends
const span = tracer.startSpan("operation");
await doSomething();
// Forgot to call span.end()

// Good - use try/finally to ensure ending
const span = tracer.startSpan("operation");
try {
await doSomething();
} finally {
span.end();
}

Don't: Store Oversized Attributes

// Bad - storing full response
span.setAttribute("response", JSON.stringify(largeResponse));

// Good - truncate large data
span.setAttribute(
"response",
JSON.stringify(response).slice(0, 2000) + "...[truncated]"
);

Don't: Create Tracer in Hot Path

// Bad - creating tracer on every request
fastify.get("/api/data", async () => {
const tracer = trace.getTracer("my-tracer"); // Repeated creation
// ...
});

// Good - create at module level
const tracer = trace.getTracer("my-tracer", "1.0.0");
fastify.get("/api/data", async () => {
// Use existing tracer
});

API Endpoints

Telemetry data can be queried via REST API:

EndpointMethodDescription
/api/telemetry/datesGETGet list of available dates
/api/telemetry/tracesGETGet traces for specified date
/api/telemetry/trace/:idGETGet trace details (tree structure)
/api/telemetry/trace/:id/spansGETGet raw spans
/api/telemetry/cleanDELETEClean old files
/api/telemetry/statsGETGet statistics

Pending Features

High Priority

  1. More Route Instrumentation
    • Audit all routes to ensure manual spans are in place
    • Add database operation spans

Medium Priority

  1. Distributed Tracing Context Propagation

    • Pass trace context to external services
    • Support W3C Trace Context standard
  2. SLI/SLO Metrics

    • Error rate
    • Latency percentiles
    • Availability

Low Priority

  1. Frontend Telemetry
    • Browser-side OpenTelemetry
    • Frontend-to-backend trace correlation