OpenTelemetry

This feature is available in v37 and later.

OpenTelemetry allows you to monitor the usage and health of your Hyperscience instance alongside that of other applications in your IT infrastructure. The data stream includes metrics for submission volume and throughput, time to completion for blocks and submissions, response times, error rates, and connectivity issues, among others.

The Hyperscience application emits only traces and metrics telemetry and is instrumented with the OpenTelemetry Python SDK. For more details, see the SDK’s GitHub repository page. Logs are not supported in OpenTelemetry format at this time.

Enabling OpenTelemetry in Hyperscience

Follow the steps below to expose an OpenTelemetry data stream to your application-performance monitoring tool.

Kubernetes deployments

1. Enable OpenTelemetry metrics and traces

To enable OpenTelemetry from the Hyperscience Helm chart to export metrics and traces to an OpenTelemetry collector, add the following to the values.yaml file:

opentelemetry:
 enabled: true
 metrics:
   endpoint: 'http://<collector-fqdn>:<port>/v1/metrics'
 traces:
    endpoint: 'http://<collector-fqdn>:<port>/v1/traces'

By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.

NOTE: In principle, Hyperscience could be configured with the gRPC protocol. However we strongly discourage use of OTLP/gRPC with Hyperscience. Our internal tests surfaced critical issues with the currently used versions of the Python OpenTelemetry client and gRPC. In a multiprocessing environment, it results in segfaults, hanging threads, and gRPC exceptions. For more details, see the Forking with PeriodicExportingMetricReader results in ValueError issue in GitHub.

2. Secure an OTLP connection

It’s possible to enable TLS for the telemetry stream emitted from the Hyperscience application. However, only TLS is supported in OpenTelemetry Python; mTLS is not currently supported (see details of the Support mtls for otlp exporter issue in GitHub).

Note that, if you’re exporting OTLP data with TLS to an OpenTelemetry collector, then the collector must be configured to receive encrypted data.

To enable TLS in the Hyperscience application:

a. The server certificate has to be added as a Kubernetes Secret object to your cluster. In the values.yaml file for the Helm chart, the following snippet must be added (shown below are the default values from the chart, in case a given property is omitted):

opentelemetry:
 tls:
    certSecretName: ''
    certName: 'certificate.pem'
  • opentelemetry.tls.certSecretName should be the name of the created Kubernetes Secret, holding the server certificate.
  • opentelemetry.tls.certNameshould be the name of the item inside the Secret data, where the certificate is stored (defaults to certificate.pem).

b. The schema of the endpoints must be changed to “https://”. Additionally, the port must be changed to the TLS receiving port of the OpenTelemetry collector (see Secure an OTLP connection under "Docker Compose deployments" in this article for more information).

opentelemetry:
 enabled: true
 metrics:
   endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/metrics'
 traces:
    endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/traces'

NOTE: The value of the OTEL_EXPORTER_OTLP_CERTIFICATE “.env” file variable is automatically set by the Hyperscience Helm chart; it shouldn’t be configured manually!

3. Add extra configuration variables as needed.

OpenTelemetry Python's opentelemetry.sdk.environment_variables describes the variables provided by the SDK. Additionally, Hyperscience-specific environment variables are also available and described at the end of this document.

Environment variables should be added to both the app and trainer sections in values.yaml. For example, if you would like metrics and traces to be exported every 10 seconds (instead of the default 30 seconds):

app:
 dotenv:
   OTEL_METRIC_EXPORT_INTERVAl=10000

trainer
 env:
    OTEL_METRIC_EXPORT_INTERVAl=10000

Docker Compose deployments

1.  Enable OpenTelemetry metrics and traces.

To enable OpenTelemetry in the Hyperscience application to export metrics and traces to an OpenTelemetry collector, the following environment variables must be set in the “.env” file:

OTEL_SDK_DISABLED=false
OTEL_PYTHON_LOG_CORRELATION=true

OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://<collector-fqdn>:<port>/v1/metrics
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://<collector-fqdn>:<port>/v1/traces

OTEL_SDK_DISABLED instruments the application with metrics and traces, and OTEL_PYTHON_LOG_CORRELATION enables the addition of span and trace IDs in the application logs.

Our application also sets default values for the following environment variables:

OTEL_SERVICE_NAME=hyperscience

OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp

OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'

OTEL_METRIC_EXPORT_INTERVAl=30000

By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.

NOTE: In principle, Hyperscience could be configured with the gRPC protocol. However we strongly discourage use of OTLP/gRPC with Hyperscience. Our internal tests surfaced critical issues with the currently used versions of the Python OpenTelemetry client and gRPC. In a multiprocessing environment, it results in segfaults, hanging threads, and gRPC exceptions. For more details, see the Forking with PeriodicExportingMetricReader results in ValueError issue in GitHub.

If you do not add these variables to your “.env” file, the above default values will be used after you add OTEL_SDK_DISABLED to your ".env" file and run one of the commands described in Editing the “.env” file and running the application.

Many more options are available for configuring OpenTelemetry exporters and protocols in the Hyperscience application. Refer to OpenTelemetry’s official documentation for details:

2.  Secure an OTLP connection.

For OTLP specifically, if you want to use a secure connection, the schema must be changed to “https://”, and extra environment variables should be set up (e.g., pointing to the certificate path, key, etc.). The certificate must be available in HS application. The steps below give an example configuration and instructions for adding it.

  1. Create the directory that will hold the certificate, if it does not already exist. Assuming “/mnt/hs/” is used as the HS_PATH, run the following command to create the directory:
    mkdir -p /mnt/hs/certs
  1. Copy the certificate to this directory by running the following command:
    chmod -R 1000:1000 /mnt/hs/certs
  1. If SELinux is enabled, execute:
    chcon -t container_file_t -R /mnt/hs/certs/
    If SELinux is enabled, each time a file is added to the certs directory above for any reason, you will need to execute the chcon command again.
  1. Add the following to the “.env” file:
    OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://<collector-fqdn>:<TLS-port>/v1/metrics
    OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=https://<collector-fqdn>:<TLS-port>/v1/traces
    OTEL_EXPORTER_OTLP_CERTIFICATE=/etc/nginx/certs/<cert_name>

More information is provided in OpenTelemetry’s documentation:

3.  Add extra configuration variables as needed.

OpenTelemetry Python's opentelemetry.sdk.environment_variables describes the variables provided by the SDK. Additionally, Hyperscience-specific environment variables are also available and described at the end of this document. As usual, any environment variables should be configured in the “.env” file.

OpenTelemetry traces

Hyperscience creates and emits a trace for every created submission and every API request received. Traces are useful when troubleshooting issues.

Application logs contain trace_id and span_id data, as shown below. Logs are recorded automatically when the OTEL_PYTHON_LOG_CORRELATION “.env” file variable is set to true.

OpenTelemetryTraceIDsSpanIDs.png

When troubleshooting a problem (e.g., latency, errors), you can copy the trace_id from the corresponding logs and search for them in your tracing backend, if configured. An example with Grafana and Tempo is shown below.

OpenTelemetryTraceExample.png

Note that when viewing traces of submissions, gaps in time are possible in the trace. They occur when the submission is waiting for manual tasks to be performed by keyers (e.g., Transcription Supervision).

Available Hyperscience metrics

Group Metric What do we use if for Notes
Hyperflow
hyperflow_task_count Number of tasks; set when task changes status. attributes: name, ref_name, status
hyperflow_task_backlog_duration Duration of task in the backlog; set when task becomes IN_PROGRESS. attributes: name, ref_name
hyperflow_task_run_duration Duration of task run time; set when task terminates. attributes: name, ref_name
hyperflow_task_poll_duration Duration of fetching a limited count of pending tasks. attributes: name, count
hyperflow_workflow_count Number of flows; set when flow changes status.

attributes: name, top_level, version, status

The 'top_level=True' tag is useful for tracking number of submissions.

hyperflow_workflow_run_duration Duration of flow run; set when flow terminates. attributes: name, top_level, version
hyperflow_payload_offload_count Number of WFE payloads offloaded to the object store. attributes: bucket_10k (size in 10s of KiB)
hyperflow_payload_store_duration Duration of storing WFE payload into the object store. attributes: bucket_10k (size in 10s of KiB)
hyperflow_payload_fetch_duration Duration of fetching WFE payload from the object store.  
hyperflow_engine_poll_workflows_duration Duration of fetching workflows ready to be advanced.  
hyperflow_engine_advance_workflow_duration Duration of advancing a single workflow instance.  
Job queue
jobs_count Number of jobs; set when job changes state. attributes: type, state
jobs_backlog_duration Duration jobs wait in the queue. attributes: type
jobs_exec_duration Duration of the job run time. attributes: type
jobs_cpu_time CPU time for running the job. attributes: type
jobs_system_time System time for running the job. attributes: type
jobs_query_duration Total duration of the DB queries executed in job. attributes: type
jobs_worker_duration Total duration of job execution (queries + everything else). attributes: type
jobs_worker_query_count Number of DB queries executed in job. attributes: type
Transcription
hs_task_time_taken_ms Milliseconds a human or machine transcribed a single field. attributes: task_type (transcription), entry_type (machine/human); optional attributes: user_is_staff, username
hs_machine_field_transcriptions Milliseconds a machine transcribed a single field with non-zero confidence. attributes: confidence_rd5, ml_model
hs_completed_human_entries_count Number of completed fields in a Transcription SV task. attributes: worker (username), task_type, status (DONE)
hs_finished_qa_records Number of completed Transcription QA supervision tasks. attributes: (multiple)
Submission pages
hs_submission_page_count Number of created submission pages.  
hs_submission_page_completed_count Number of completed submission pages. attributes: (optional) error_type
TDM (fka KDM)
hs_kdm_table_loaded_count Number of loaded (shown) training documents for Tables through the TDM API. example dashboard 
hs_kdm_table_saved_count Number of updated training documents for Tables through the TDM API.  
Table Layouts
hs_live_layout_tables_count Number of tables in live layouts (sent once daily). attributes: n_items
hs_live_layout_columns_count Number of table columns in live layouts (sent once daily). attributes: n_items
hs_working_layout_tables_count Number of tables in draft layouts (sent once daily). attributes: n_items
hs_working_layout_columns_count Number of table columns in draft layouts (sent once daily). attributes: n_items
Table SV
hs_copycat_time_taken_ms Milliseconds for copy-cat algorithm runtime (part of API call)  
hs_table_id_qa_tasks_until_consensus Number of times consensus was reached in Table ID QA, tagged with number of QA tasks used. attributes: num_qa_tasks
SV
task_response_time_taken_ms Time spent on a manual supervision task. attributes: worker (username), task_type, status (DONE)
task_response_submit_success Number of successfully completed manual supervision tasks. attributes: worker (username), task_type, status (DONE)
task_response_submit_fail Number of invalid responses for manual supervision tasks. attributes: worker (username), task_type, status (DONE)
crowd_user_activity Number of times a user starts/stops working on a manual supervision task. attributes: worker (username), activity (take a break/start working)
crowd_query_next_task Misc. timing for several small DB queries when fetching next manual tasks.  
DB deadlocks hs_retry_db_transaction_count Number of times a DB transaction is retried, e.g. due to DB deadlock. attributes: deadlock_retrial_count (optional, when set is 1), retry_transaction_exception_count (optional, when set is 1)

Hyperscience-specific environment variables

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION=[0,100,250,500,750,1000,2500,5000,7500,10sec,15sec,20sec,30sec,45sec,1min,2min,3min,4min,5min,10min,30min,60min,3h,12h,24h]

This variable configures the bucket boundaries for all Histogram metrics that have a name ending in "duration" or "time" (e.g., http.client.duration or jobs_cpu_time). The bucket boundaries are in milliseconds; for display purposes in this document, we have used abbreviations (e.g. 60min). If you wish to change the above default values, all time representations MUST be substituted with their millisecond equivalents (e.g., 60min would become 3600000).

The Hyperscience application uses Explicit Bucket Histograms. Explicit buckets are stated in terms of their upper boundary. Buckets are exclusive of their lower boundary and inclusive of their upper boundary, except at positive infinity. Each measurement belongs to the greatest-numbered bucket with a boundary that is greater than or equal to the measurement. For information, see OpenTelemetry’s Metrics SDK documentation in GitHub. 

Too few buckets leads to less accurate metrics, but too many could potentially cause high RAM usage, high disk usage, and slower performance.

We do not recommend changing the boundaries once the Hyperscience application is deployed and running, as it would lead to further incompatible bucket ranges. These ranges would be problematic, for example, when calculating quantiles over overlapping periods of time (e.g., overlapping old and new layouts). Also, excessive buckets could lead to high cardinality, which could potentially cause high RAM usage, high disk usage, and slower performance.

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS=[0,1000,5000,15sec,30sec,1min,5min,15min,30min,60min]

This variable configures the bucket boundaries for Histogram metrics that have a name ending in "_duration_tasks" (e.g., hyperflow_task_backlog_duration_tasks or hyperflow_task_run_duration_tasks).

By default, these metrics have fewer buckets, because they have higher cardinality than the rest. Otherwise, they would generate higher load on the observability infrastructure, leading to higher RAM usage, higher disk usage, and slower query performance.

Also, see the details on buckets given in OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION.

Was this article helpful?
1 out of 1 found this helpful