OpenTelemetry allows you to monitor the usage and health of your Hyperscience instance alongside that of other applications in your IT infrastructure. The data stream includes metrics for submission volume and throughput, time to completion for blocks and submissions, response times, error rates, and connectivity issues, among others.
The Hyperscience application emits only traces and metrics telemetry and is instrumented with the OpenTelemetry Python SDK. For more details, see the SDK’s GitHub repository page. Logs are not supported in OpenTelemetry format at this time.
Enabling OpenTelemetry in Hyperscience
Follow the steps below to expose an OpenTelemetry data stream to your application-performance monitoring tool.
Kubernetes deployments
1. Enable OpenTelemetry metrics and traces
To enable OpenTelemetry from the Hyperscience Helm chart to export metrics and traces to an OpenTelemetry collector, add the following to the values.yaml file:
opentelemetry:
enabled: true
metrics:
endpoint: 'http://<collector-fqdn>:<port>/v1/metrics'
traces:
endpoint: 'http://<collector-fqdn>:<port>/v1/traces'
By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.
NOTE: In principle, Hyperscience could be configured with the gRPC protocol. However we strongly discourage use of OTLP/gRPC with Hyperscience. Our internal tests surfaced critical issues with the currently used versions of the Python OpenTelemetry client and gRPC. In a multiprocessing environment, it results in segfaults, hanging threads, and gRPC exceptions. For more details, see the Forking with PeriodicExportingMetricReader results in ValueError issue in GitHub.
2. Secure an OTLP connection
It’s possible to enable TLS for the telemetry stream emitted from the Hyperscience application. However, only TLS is supported in OpenTelemetry Python; mTLS is not currently supported (see details of the Support mtls for otlp exporter issue in GitHub).
To enable TLS in the Hyperscience application:
a. The server certificate has to be added as a Kubernetes Secret object to your cluster. In the values.yaml file for the Helm chart, the following snippet must be added (shown below are the default values from the chart, in case a given property is omitted):
opentelemetry:
tls:
certSecretName: ''
certName: 'certificate.pem'
- opentelemetry.tls.certSecretName should be the name of the created Kubernetes Secret, holding the server certificate.
- opentelemetry.tls.certNameshould be the name of the item inside the Secret data, where the certificate is stored (defaults to certificate.pem).
b. The schema of the endpoints must be changed to “https://”. Additionally, the port must be changed to the TLS receiving port of the OpenTelemetry collector (see Secure an OTLP connection under "Docker Compose deployments" in this article for more information).
opentelemetry:
enabled: true
metrics:
endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/metrics'
traces:
endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/traces'
NOTE: The value of the OTEL_EXPORTER_OTLP_CERTIFICATE “.env” file variable is automatically set by the Hyperscience Helm chart; it shouldn’t be configured manually!
3. Add extra configuration variables as needed.
OpenTelemetry Python's opentelemetry.sdk.environment_variables describes the variables provided by the SDK. Additionally, Hyperscience-specific environment variables are also available and described at the end of this document.
Environment variables should be added to both the app and trainer sections in values.yaml. For example, if you would like metrics and traces to be exported every 10 seconds (instead of the default 30 seconds):
app:
dotenv:
OTEL_METRIC_EXPORT_INTERVAl=10000
…
trainer
env:
OTEL_METRIC_EXPORT_INTERVAl=10000
Docker Compose deployments
1. Enable OpenTelemetry metrics and traces.
To enable OpenTelemetry in the Hyperscience application to export metrics and traces to an OpenTelemetry collector, the following environment variables must be set in the “.env” file:
OTEL_SDK_DISABLED=false
OTEL_PYTHON_LOG_CORRELATION=true
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://<collector-fqdn>:<port>/v1/metrics
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://<collector-fqdn>:<port>/v1/traces
OTEL_SDK_DISABLED instruments the application with metrics and traces, and OTEL_PYTHON_LOG_CORRELATION enables the addition of span and trace IDs in the application logs.
Our application also sets default values for the following environment variables:
OTEL_SERVICE_NAME=hyperscience
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'
OTEL_METRIC_EXPORT_INTERVAl=30000
By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.
NOTE: In principle, Hyperscience could be configured with the gRPC protocol. However we strongly discourage use of OTLP/gRPC with Hyperscience. Our internal tests surfaced critical issues with the currently used versions of the Python OpenTelemetry client and gRPC. In a multiprocessing environment, it results in segfaults, hanging threads, and gRPC exceptions. For more details, see the Forking with PeriodicExportingMetricReader results in ValueError issue in GitHub.
If you do not add these variables to your “.env” file, the above default values will be used after you add OTEL_SDK_DISABLED to your ".env" file and run one of the commands described in Editing the “.env” file and running the application.
Many more options are available for configuring OpenTelemetry exporters and protocols in the Hyperscience application. Refer to OpenTelemetry’s official documentation for details:
- OpenTelemetry SDK - exporter selection
- OTLP Exporter Spec
- OTEL Exporter Configuration
- OpenTelemetry Python SDK - environment variables
2. Secure an OTLP connection.
For OTLP specifically, if you want to use a secure connection, the schema must be changed to “https://”, and extra environment variables should be set up (e.g., pointing to the certificate path, key, etc.). The certificate must be available in HS application. The steps below give an example configuration and instructions for adding it.
- Create the directory that will hold the certificate, if it does not already exist. Assuming “/mnt/hs/” is used as the HS_PATH, run the following command to create the directory:
mkdir -p /mnt/hs/certs
- Copy the certificate to this directory by running the following command:
chmod -R 1000:1000 /mnt/hs/certs
- If SELinux is enabled, execute:
If SELinux is enabled, each time a file is added to the certs directory above for any reason, you will need to execute the chcon command again.chcon -t container_file_t -R /mnt/hs/certs/
- Add the following to the “.env” file:
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://<collector-fqdn>:<TLS-port>/v1/metrics
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=https://<collector-fqdn>:<TLS-port>/v1/traces
OTEL_EXPORTER_OTLP_CERTIFICATE=/etc/nginx/certs/<cert_name>
More information is provided in OpenTelemetry’s documentation:
3. Add extra configuration variables as needed.
OpenTelemetry Python's opentelemetry.sdk.environment_variables describes the variables provided by the SDK. Additionally, Hyperscience-specific environment variables are also available and described at the end of this document. As usual, any environment variables should be configured in the “.env” file.
OpenTelemetry traces
Hyperscience creates and emits a trace for every created submission and every API request received. Traces are useful when troubleshooting issues.
Application logs contain trace_id and span_id data, as shown below. Logs are recorded automatically when the OTEL_PYTHON_LOG_CORRELATION “.env” file variable is set to true.
When troubleshooting a problem (e.g., latency, errors), you can copy the trace_id from the corresponding logs and search for them in your tracing backend, if configured. An example with Grafana and Tempo is shown below.
Available Hyperscience metrics
Group | Metric | What do we use if for | Notes |
---|---|---|---|
Hyperflow
|
hyperflow_task_count | Number of tasks; set when task changes status. | attributes: name, ref_name, status |
hyperflow_task_backlog_duration | Duration of task in the backlog; set when task becomes IN_PROGRESS. | attributes: name, ref_name | |
hyperflow_task_run_duration | Duration of task run time; set when task terminates. | attributes: name, ref_name | |
hyperflow_task_poll_duration | Duration of fetching a limited count of pending tasks. | attributes: name, count | |
hyperflow_workflow_count | Number of flows; set when flow changes status. |
attributes: name, top_level, version, status The 'top_level=True' tag is useful for tracking number of submissions. |
|
hyperflow_workflow_run_duration | Duration of flow run; set when flow terminates. | attributes: name, top_level, version | |
hyperflow_payload_offload_count | Number of WFE payloads offloaded to the object store. | attributes: bucket_10k (size in 10s of KiB) | |
hyperflow_payload_store_duration | Duration of storing WFE payload into the object store. | attributes: bucket_10k (size in 10s of KiB) | |
hyperflow_payload_fetch_duration | Duration of fetching WFE payload from the object store. | ||
hyperflow_engine_poll_workflows_duration | Duration of fetching workflows ready to be advanced. | ||
hyperflow_engine_advance_workflow_duration | Duration of advancing a single workflow instance. | ||
Job queue
|
jobs_count | Number of jobs; set when job changes state. | attributes: type, state |
jobs_backlog_duration | Duration jobs wait in the queue. | attributes: type | |
jobs_exec_duration | Duration of the job run time. | attributes: type | |
jobs_cpu_time | CPU time for running the job. | attributes: type | |
jobs_system_time | System time for running the job. | attributes: type | |
jobs_query_duration | Total duration of the DB queries executed in job. | attributes: type | |
jobs_worker_duration | Total duration of job execution (queries + everything else). | attributes: type | |
jobs_worker_query_count | Number of DB queries executed in job. | attributes: type | |
Transcription
|
hs_task_time_taken_ms | Milliseconds a human or machine transcribed a single field. | attributes: task_type (transcription), entry_type (machine/human); optional attributes: user_is_staff, username |
hs_machine_field_transcriptions | Milliseconds a machine transcribed a single field with non-zero confidence. | attributes: confidence_rd5, ml_model | |
hs_completed_human_entries_count | Number of completed fields in a Transcription SV task. | attributes: worker (username), task_type, status (DONE) | |
hs_finished_qa_records | Number of completed Transcription QA supervision tasks. | attributes: (multiple) | |
Submission pages
|
hs_submission_page_count | Number of created submission pages. | |
hs_submission_page_completed_count | Number of completed submission pages. | attributes: (optional) error_type | |
TDM (fka KDM)
|
hs_kdm_table_loaded_count | Number of loaded (shown) training documents for Tables through the TDM API. | example dashboard |
hs_kdm_table_saved_count | Number of updated training documents for Tables through the TDM API. | ||
Table Layouts
|
hs_live_layout_tables_count | Number of tables in live layouts (sent once daily). | attributes: n_items |
hs_live_layout_columns_count | Number of table columns in live layouts (sent once daily). | attributes: n_items | |
hs_working_layout_tables_count | Number of tables in draft layouts (sent once daily). | attributes: n_items | |
hs_working_layout_columns_count | Number of table columns in draft layouts (sent once daily). | attributes: n_items | |
Table SV
|
hs_copycat_time_taken_ms | Milliseconds for copy-cat algorithm runtime (part of API call) | |
hs_table_id_qa_tasks_until_consensus | Number of times consensus was reached in Table ID QA, tagged with number of QA tasks used. | attributes: num_qa_tasks | |
SV
|
task_response_time_taken_ms | Time spent on a manual supervision task. | attributes: worker (username), task_type, status (DONE) |
task_response_submit_success | Number of successfully completed manual supervision tasks. | attributes: worker (username), task_type, status (DONE) | |
task_response_submit_fail | Number of invalid responses for manual supervision tasks. | attributes: worker (username), task_type, status (DONE) | |
crowd_user_activity | Number of times a user starts/stops working on a manual supervision task. | attributes: worker (username), activity (take a break/start working) | |
crowd_query_next_task | Misc. timing for several small DB queries when fetching next manual tasks. | ||
DB deadlocks | hs_retry_db_transaction_count | Number of times a DB transaction is retried, e.g. due to DB deadlock. | attributes: deadlock_retrial_count (optional, when set is 1), retry_transaction_exception_count (optional, when set is 1) |
Hyperscience-specific environment variables
OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION
OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION=[0,100,250,500,750,1000,2500,5000,7500,10sec,15sec,20sec,30sec,45sec,1min,2min,3min,4min,5min,10min,30min,60min,3h,12h,24h]
This variable configures the bucket boundaries for all Histogram metrics that have a name ending in "duration" or "time" (e.g., http.client.duration or jobs_cpu_time). The bucket boundaries are in milliseconds; for display purposes in this document, we have used abbreviations (e.g. 60min). If you wish to change the above default values, all time representations MUST be substituted with their millisecond equivalents (e.g., 60min would become 3600000).
The Hyperscience application uses Explicit Bucket Histograms. Explicit buckets are stated in terms of their upper boundary. Buckets are exclusive of their lower boundary and inclusive of their upper boundary, except at positive infinity. Each measurement belongs to the greatest-numbered bucket with a boundary that is greater than or equal to the measurement. For information, see OpenTelemetry’s Metrics SDK documentation in GitHub.
Too few buckets leads to less accurate metrics, but too many could potentially cause high RAM usage, high disk usage, and slower performance.
OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS
OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS=[0,1000,5000,15sec,30sec,1min,5min,15min,30min,60min]
This variable configures the bucket boundaries for Histogram metrics that have a name ending in "_duration_tasks" (e.g., hyperflow_task_backlog_duration_tasks or hyperflow_task_run_duration_tasks).
By default, these metrics have fewer buckets, because they have higher cardinality than the rest. Otherwise, they would generate higher load on the observability infrastructure, leading to higher RAM usage, higher disk usage, and slower query performance.
Also, see the details on buckets given in OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION.