Soklet Logo

Core Concepts

Production Readiness

Soklet is designed to be a small HTTP/1.1 application server that you run behind production edge infrastructure. It owns request routing, response writing, streaming, SSE, MCP transport, lifecycle hooks, and metrics. It intentionally does not own every platform concern.

Deployment Boundary

Run Soklet behind a load balancer, ingress, or reverse proxy that handles internet-facing transport policy:

  • Terminate TLS at the edge
  • Speak HTTP/2 or HTTP/3 to clients if you need those protocols
  • Forward HTTP/1.1 to Soklet
  • Enforce coarse connection, IP, WAF, and rate-limit policy before requests reach the JVM
  • Preserve Host and trusted X-Forwarded-* headers if your application uses effective-origin resolution

Soklet does not provide in-process TLS termination, HTTP/2, HTTP/3, or WebSockets. For server push, use Server-Sent Events. For MCP clients, use Soklet's MCP POST, GET, and DELETE transport support.

For browser-facing SSE, prefer an edge that speaks HTTP/2 or HTTP/3 to clients even though it forwards HTTP/1.1 to Soklet. Browsers enforce low per-origin connection limits for HTTP/1.x, and long-lived SSE streams can consume those slots. HTTP/2 and HTTP/3 multiplex many browser-side streams over fewer client-to-edge connections, which avoids SSE starving ordinary page/API traffic while keeping Soklet's backend protocol simple.

If Soklet is directly reachable by untrusted clients, configure origin and forwarded-header handling defensively. Use TrustPolicy.TRUST_NONE unless the forwarding proxy is under your control, and use CORS allowlists rather than permissive origin reflection.

Timeouts And Backpressure

Production configs should set explicit limits instead of relying only on defaults:

  • requestHeaderTimeout is a transport-layer read bound for the request line and headers. Lower values strengthen slow-client and slow-loris protection.
  • requestBodyTimeout is a transport-layer read bound for the complete request body after headers have been received. It is a total body-phase timeout, not an idle-progress timeout. It applies to standard HTTP and MCP; SSE handshakes do not accept request bodies.
  • requestHandlerTimeout bounds application handler execution. For HTTP it covers the resource method and response marshaling. For SSE it covers the handshake handler. For MCP it covers JSON-RPC handler execution, including McpEndpoint::initialize. This is the knob to raise for long-running handlers.
  • requestHandlerConcurrency and requestHandlerQueueCapacity bound active and queued handler work.
  • maximumRequestSizeInBytes rejects oversized request lines, headers, framing, and bodies before application code sees them.
  • maximumHeaderCount and maximumRequestTargetLengthInBytes reject request-shape attacks that can otherwise stay under a total byte-size ceiling.
  • writeTimeout bounds SSE and MCP stream writes; use a nonzero value when slow or stalled clients should be disconnected.
  • streamingResponseTimeout and streamingResponseIdleTimeout bound HTTP streaming producers.
  • shutdownTimeout controls how long Soklet waits for server executors to drain before interrupting stragglers.

Treat queue capacity as a memory and latency budget, not just a throughput knob. A large queue can absorb bursts, but it can also hide overload and increase tail latency. In most production systems, a bounded queue plus fast 503 Service Unavailable is preferable to unbounded request accumulation.

These settings are configured on the transport builders before the transport is added to SokletConfig. The setting names above match builder method names where the setting applies.

Builder references:

See Server Configuration for complete examples showing where these builder methods are set.

A starting-point table for three common archetypes. Tune from here based on observed queue depth, p99 latency, and saturation behavior.

SettingJSON APIMCP serverStreaming-heavy
requestHeaderTimeout5s5s5s (handshake only)
requestBodyTimeout30s30snot applicable
requestHandlerTimeout10s120s5s (handshake)
requestHandlerConcurrency2× CPU4× CPU1× CPU
requestHandlerQueueCapacity200200100
maximumRequestSizeInBytes1 MB10 MB1 MB
maximumHeaderCount100100100
maximumRequestTargetLengthInBytes8 KB8 KB8 KB
shutdownTimeout30s30s30s
writeTimeoutnot applicable30s30s
streamingResponseTimeoutnot applicablenot applicableunset (or large)
streamingResponseIdleTimeoutnot applicablenot applicable60s

Notes:

  • Concurrency multipliers assume virtual-thread executors. For platform-thread pools, start lower (1× CPU) and rely on the queue.
  • shutdownTimeout should match the deployment platform's grace period. See the Kubernetes recipe in Deployment Recipes for the terminationGracePeriodSeconds relationship.
  • Long-running MCP tool calls (LLM, RAG retrieval, external tool execution) drive requestHandlerTimeout higher than typical web APIs; size it to the worst-case tool latency you actually want to accept. Leave requestHeaderTimeout tight to preserve slow-client protection. Raising read timeouts does not help long handler execution because handler work is governed by requestHandlerTimeout.
  • Large or slow uploads may need a larger requestBodyTimeout because it bounds total body-read time even when bytes continue arriving.
  • streamingResponseTimeout is intentionally left unset for streaming-heavy workloads; rely on streamingResponseIdleTimeout to disconnect stalled clients without bounding total stream lifetime.

Threading Model

Soklet uses transport threads for network mechanics and separate executors for application work:

  • HTTP resource methods and response marshaling run on the HTTP request-handler executor.
  • HTTP StreamingResponseBody producers run on the HTTP streaming executor, so long-lived streams do not occupy request-handler threads.
  • SSE handshakes run on the SSE request-handler executor.
  • Established SSE connections are processed on a separate connection executor; SseBroadcaster::broadcastEvent enqueues work and returns without performing socket writes inline.
  • MCP JSON-RPC handling runs on the MCP request-handler executor, including framework-managed McpEndpoint::initialize.
  • MCP GET SSE streams run on the MCP connection executor.
  • Lifecycle observers and metrics collectors are called on the thread performing the observed operation.

Default HTTP and MCP executors use virtual threads when the runtime supports them and fall back to bounded platform-thread pools otherwise. SSE specifically requires virtual-thread support: the SSE server refuses to start on a runtime without virtual threads. You can supply custom executor services through the server builders when you need to integrate with an application-specific executor policy.

Lifecycle observers and metrics collectors are hot-path callbacks. Keep them thread-safe, non-blocking, and failure-contained. Do not perform network I/O, blocking exports, or heavyweight logging directly inside those callbacks.

Process Configuration

JVM Flags

A reasonable production baseline for Java 21+:

-Xmx2g
-XX:+UseZGC -XX:+ZGenerational
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heap-dump.hprof
-XX:+ExitOnOutOfMemoryError

Notes:

  • Generational ZGC suits Soklet's typical workload mix of many short HTTP requests plus long-lived SSE/MCP streams. Sub-millisecond pauses keep streaming connections from being starved during collection.
  • Set -Xmx explicitly. The JVM autodetects container memory limits, but explicit sizing keeps heap behavior deterministic across base-image and runtime updates.
  • -XX:+ExitOnOutOfMemoryError is preferable for orchestrated deployments. The orchestrator restarts the pod cleanly rather than leaving a degraded JVM running with intermittent failures.
  • For CPU-constrained environments, also set -XX:ActiveProcessorCount=N to match the container's CPU limit. The JVM otherwise sizes its internal pools to the host's CPU count, not the container's.

File Descriptor Limits

Long-lived SSE and MCP streams hold one file descriptor per connection. Default container limits (often 1024) saturate at modest concurrency. Verify and raise:

ulimit -n 65536

Container images that run the JVM as PID 1 often inherit unhelpful defaults. Verify the effective limit inside the container with cat /proc/$$/limits.

Kubernetes does not expose RLIMIT_NOFILE as a portable per-pod field — securityContext.sysctls configures kernel /proc/sys parameters, not per-process resource limits. Practical options:

  • Bake the limit into your container image entrypoint (e.g., a wrapper script that calls ulimit -n 65536 before exec java ...).
  • Configure your container runtime's default ulimits at the node level (containerd default_ulimits, Docker --default-ulimit).
  • Use a platform-specific extension if your distribution offers one.

Graceful Shutdown

Stopping a Soklet instance first stops accepting new connections. Active SSE and MCP streams are terminated with StreamTerminationReason.SERVER_STOPPING. Server executors are then asked to shut down gracefully so already-queued request, handshake, and stream-write work can complete within shutdownTimeout. If work is still running after the deadline, Soklet interrupts the remaining executor tasks.

Streaming and producer code should cooperate with shutdown and client disconnects:

Observability

Use LifecycleObserver for detailed event hooks and request/stream tracing. Use MetricsCollector for counters, gauges, and histograms. The default collector can expose Prometheus/OpenMetrics-compatible text. If your platform standardizes on OpenTelemetry, use soklet-otel for both metrics and spans.

Soklet parses inbound W3C traceparent and tracestate headers into Request::getTraceContext. Core Soklet does not create spans, make sampling decisions, generate child span IDs, or install ambient tracing scope. Application code can read the parsed context directly for logs and outbound clients. OpenTelemetryLifecycleObserver uses that parsed context as the remote parent for emitted OpenTelemetry spans.

Keep trace IDs out of metric labels. Trace IDs belong in spans and logs; metric labels should stay low-cardinality. If you need metrics-to-trace drill-down, use OpenTelemetry exemplars instead of adding trace IDs to counters or histograms.

Exposing Metrics For Scraping

For self-hosted scraping with the default in-memory collector, expose a resource method:

@GET("/metrics")
public MarshaledResponse metrics(@NonNull MetricsCollector metricsCollector) {
  String body = metricsCollector.snapshotText(
    SnapshotTextOptions.fromMetricsFormat(MetricsFormat.PROMETHEUS)
  ).orElse(null);

  if (body == null)
    return MarshaledResponse.fromStatusCode(204);

  return MarshaledResponse.withStatusCode(200)
    .headers(Map.of("Content-Type", Set.of("text/plain; charset=UTF-8")))
    .body(body.getBytes(StandardCharsets.UTF_8))
    .build();
}

See Metrics Collection for filter/format options, OpenMetrics output, and OpenTelemetry-backed collectors. OpenTelemetryMetricsCollector exports through OTel's pipeline and does not require a /metrics endpoint.

At minimum, production dashboards should cover:

  • Accepted and rejected connections by server type
  • Request read failures and request-handler rejections
  • HTTP response counts and durations by route
  • Streaming, SSE, and MCP stream terminations by reason
  • Active SSE clients, active MCP sessions, and active MCP SSE streams
  • MCP JSON-RPC outcomes and request durations by endpoint and method

Multi-Node Deployments

SSE broadcasters are node-local. A broadcast on one Soklet node only reaches clients connected to that node. For clustered SSE, publish domain events to a shared queue or pub/sub system and let each node rebroadcast locally. If you support Last-Event-ID catch-up, store replay data in a shared durable log.

MCP session metadata can use a custom shared McpSessionStore, but live MCP GET streams are still node-local. Route all requests for a given MCP-Session-Id to the node that owns that stream, commonly through consistent hashing or explicit affinity at the edge.

Health Checks

Expose an application-level readiness endpoint that verifies the dependencies your resource methods require. For simple transport-level health, Soklet also supports OPTIONS *, which lets a load balancer query server-wide capabilities without targeting an application route.

Keep readiness and liveness separate. A process can be alive while temporarily not ready to receive traffic because a dependency is unavailable, migrations are running, or a graceful shutdown has started.

Static Files

For bundled web assets, prefer StaticFiles over hand-rolled path joins. Keep the configured root read-only during normal operation and publish changes atomically.

For high-volume public assets, many production systems should serve from object storage plus a CDN, such as S3 plus CloudFront or an equivalent setup, rather than sending every asset request through Soklet. Use Soklet for assets that are naturally app-owned, bundled with the deployment, or need application-local routing and policy.

When Soklet does serve files, use edge or CDN caching where appropriate. Configure Soklet cache policy deliberately: long-lived immutable caching for fingerprinted assets, and revalidation or no-cache for HTML and other frequently changing entrypoints.

The default weak metadata ETag is cheap and deterministic across nodes serving the same filesystem metadata. EntityTagResolver::fromContentHash provides strong ETags, but it reads the full file on the request-handling thread, including for HEAD. Use manifest-backed ETags for large files or HEAD-heavy traffic.

Do not follow symlinks unless the deployment requires it. Add X-Content-Type-Options: nosniff through headersResolver when serving browser-facing assets, and use accessResolver for path- or attribute-based hide/deny policy. Request-aware authorization should happen before calling StaticFiles.

See Static Files for routing examples, resolver configuration, MIME defaults, cache validators, and range behavior.

Deployment Recipes

These are starting points. Tune sizes, timings, and flags to your workload and platform.

Dockerfile

A minimal production-shaped image. The exec form on ENTRYPOINT ensures the JVM receives SIGTERM directly so graceful shutdown engages.

FROM amazoncorretto:25

ENV JAVA_OPTS="-Xmx2g \
  -XX:+UseZGC -XX:+ZGenerational \
  -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp \
  -XX:+ExitOnOutOfMemoryError"

# Application artifact
COPY build/libs/myapp.jar /app/myapp.jar

EXPOSE 8080
USER 1000

ENTRYPOINT ["sh", "-c", "exec java $JAVA_OPTS -jar /app/myapp.jar"]

This recipe assumes myapp.jar is already built. If your container image builds Java source, make sure your build invokes Soklet's annotation processor, for example through Maven or Gradle annotation processing, or direct javac -processor com.soklet.SokletProcessor.

For a more complete buildable example, see the barebones-app Docker recipe.

Kubernetes Deployment

A starting-point manifest. Pay particular attention to the terminationGracePeriodSeconds / shutdownTimeout relationship and the preStop hook.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      # Must exceed Soklet's shutdownTimeout. K8s sends SIGKILL after this elapses.
      terminationGracePeriodSeconds: 35
      containers:
        - name: myapp
          image: myorg/myapp:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              memory: "2Gi"
          lifecycle:
            preStop:
              # Let the load balancer remove this pod from rotation
              # before the JVM begins shutting down.
              exec:
                command: ["sh", "-c", "sleep 5"]
          readinessProbe:
            httpGet:
              path: /readiness
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 30
            failureThreshold: 3

Key invariants:

  • terminationGracePeriodSeconds must be greater than Soklet's shutdownTimeout, with a few seconds of buffer. K8s sends SIGKILL after the grace period; if Soklet is still draining, in-flight requests get cut.
  • The preStop sleep lets the service mesh or ingress controller observe the pod's Terminating state and remove it from rotation before the JVM starts shutting down. Without it, a small number of requests will hit a draining pod and 503.
  • Readiness and liveness should target distinct paths and have different failure semantics. Readiness can fail temporarily without restarting the pod (dependency unavailable, migrations running, graceful shutdown started); liveness failures restart the container.
  • For MCP-heavy workloads, configure ingress for session affinity by MCP-Session-Id header. Live MCP GET SSE streams are node-local; non-sticky load balancing breaks session continuity. See Multi-Node Deployments.

Service And Ingress

A standard Service exposing the deployment, plus your ingress of choice (nginx, ALB, Envoy, Cloudflare). The ingress is responsible for TLS termination, HTTP/2 or HTTP/3 negotiation with clients, WAF, rate limiting, and forwarded-header propagation. See Deployment Boundary.

Common Pitfalls

  • terminationGracePeriodSeconds shorter than shutdownTimeout. Kubernetes sends SIGKILL before Soklet finishes draining. Pods drop in-flight requests. Verify the grace period exceeds shutdownTimeout with at least 5 seconds of buffer.
  • Default ulimit -n in containers. Many SSE/MCP connections saturate the per-process file-descriptor limit at modest concurrency. Raise to 65536 or higher.
  • Blocking I/O inside lifecycle observers or metrics collectors. Hot-path callbacks. Blocking exports, network I/O, or heavyweight logging here back-pressure the request pipeline. Buffer and flush asynchronously.
  • Trace IDs as metric labels. Cardinality explosion. Use OpenTelemetry exemplars for metrics-to-trace drill-down instead.
  • Reflecting Origin instead of allowlisting. Permissive CORS in production. See CORS for safe defaults.
  • MCP GET SSE streams without sticky routing. Streams are node-local; non-sticky load balancing breaks session continuity. Route by MCP-Session-Id header at the edge.
  • Request-time hashing of large static files. EntityTagResolver.fromContentHash() reads the whole file on the request-handling thread, including for HEAD. Use metadata ETags or a manifest-backed resolver for large files.
  • Unbounded requestHandlerQueueCapacity. Hides overload as growing tail latency rather than surfacing it as fast 503 Service Unavailable. Prefer a small queue.
  • Missing -Xmx in containers. The JVM autodetects container memory but explicit sizing keeps heap behavior deterministic across base-image and runtime updates.
  • Shell-form ENTRYPOINT or CMD in Dockerfiles. Wraps the JVM in a shell that swallows SIGTERM; graceful shutdown never engages. Use exec form or exec sh -c "exec java ...".
Previous
Metrics Collection