Skip to content

E-Stop and Health


dev_host monitors the health of all devices on the network and the health of the host itself.

Device E-Stop

If a fault condition is present on a device the following occurs:

  • The device will enters a safe error state. Any actuating action is disabled.
  • The device returns an e-stop to dev_host.
  • dev_host propagates e-stop to all devices on the JCS network. All devices enter a safe state.
  • dev_host exits cyclic mode.

dev_host will attempt to query the device with an error and return any useful information. If the network itself is in error, dev_host will still attempt to get data, however this will most likely fail.

Device e-stop recovery

Once the error source has been cleared, the JCS network can be started again by:

  • Calling reset() to reset all devices. This will reset all devices in an error state and move them to the stopped state.
  • Calling start() to start all devices and processes again.

Not all error conditions can be cleared. For example, a network error is most likely unrecoverable. In the case of an error that cannot be cleared, the JCS system must be shutdown and restarted.


dev_host Health

dev_host monitors its own internal health and performance. The following is monitored:

  • Pending signals overruns.
  • Transport health.
  • Thread offset variance (jitter).

dev_host health is configured in dev_host.yaml under the health_config tag.

Pending signals overruns

A pending signal overrun is a result of a signal that has not reached it's destination by the time it's next sync comes around.

It is inevitable that during the operation of a real time robotic system, some communication errors will occur. JCS will not e-stop immediately on a pending signal overrun, rather we adopt the following approach: dev_host will allow some signals to be dropped and will attempt to continue on. If too many signal overruns occur up to a configure limit, at a fast enough rate, then dev_host will e-stop and attempt to safely shut down.

Pending signal overruns are monitored by an overrun detector, whereby each signal overrun increments a counter. If the counter reaches a percentage limit of the defined rate the system will e-stop. The detector will decrement the counter at a configured "leak" time.

Configure the detector with the following attributes:

  • pending_sequence_overrun_percent_limit : Percentage of base rate at which the detector counter will expire. Default: 1%.
  • pending_sequence_overrun_leak_time_s : Time in seconds at which the detector will leak counts. Default: 250ms.
  • pending_sequence_overrun_start_snooze_s : Time in seconds to snooze the detector after a call to start(). Default: 1s.

Pending signal overruns are typically caused by:

Example

Consider a system running with base_freq_hz: 1000 with the following detector configuration:

health_config:
    pending_sequence_overrun_percent_limit: 25.0
    pending_sequence_overrun_leak_time_s:   0.1

This system will e-stop if the counter reaches 250 (from pending signal errors occurring). The system will leak one pending signal error every 100mS.

Base frequency based errors

It takes a finite amount of time for data to travel about the network and for a JCS system to do all it's work each tick. Driving the base frequency too high can lead to stability issues. As such there is a trade off between system complexity and base frequency.

For complex systems, base frequencies of 1kHz are reasonable. For less complex systems the base frequency can be pushed higher.

Base frequency based error troubleshooting

Some indications of a base frequency set too high are:

  • Pending signal overruns that occur immediately as the system starts.
  • Data corruption. Parameter exchanges result in data corruption or calls to parameter functions fail once the system enters cyclic mode.
  • Inability to start the system. Stuck at "waiting for cyclic warmup",

Some solution to any of these symptoms are:

  • Lower the base system rate until the system becomes stable.
  • Decrease the number of fast rate signals being moved around the system.

Process execution time

Some indications of a process execution time errors are:

  • Pending signal overruns that occur immediately as the system starts. Inspect the process timing.
  • Pending signal overruns that occur regularly as the system operates, but not enough to cause the system to shut down. Inspect the process timing, particularly for slow rate processes.

Transport health

Transport errors are typically a result of signal or parameter CRC errors, wiring issues or internal transport (Ethercat) errors.

CRC errors

Signal CRC errors will result in pending signal overruns. Parameter CRC errors will result in a parameter return status jcs::RET_ERROR.

Transport errors

Internal transport errors are a result of an error in the Ethercat subsystem. Most of the time JCS is able to recover from these errors.

dev_host attempts to continue on from internal transport errors by monitoring the errors via an overrun detector. The transport overrun detector operates the same as the pending signal overrun detector. See pending signal overruns for details.

Configure the transport overrun detector with the following attributes:

  • transport_overrun_percent_limit : Percentage of base rate at which the detector counter will expire. Default: 0.5%.
  • transport_overrun_leak_time_s : Time in seconds at which the detector will leak counts. Default: 250ms.
Example

Consider a system running with base_freq_hz: 500 with the following detector configuration:

health_config:
    transport_overrun_percent_limit: 25.0
    transport_overrun_leak_time_s:   0.1
This system will e-stop if the counter reaches 125 (from transport errors occurring). The system will leak one pending signal error every 100mS.

Thread offset variance

Any real time system will occasionally suffer from thread jitter. dev_host monitors the performance of the real time thread to ensure it stays within a configurable bounds.

Like the previous error handling strategies, JCS will not e-stop immediately on a thread jitter excursion event. We adopt the following approach: dev_host will allow some jitter to occur and will attempt to continue on. If the thread offset time variance exceeds some configurable bounds, frequently enough, then dev_host will e-stop and attempt to safely shutdown.

An overrun detector is used to monitor the thread health and can be configured with the following attributes:

  • thread_offset_variance_percent : Variance percentage limit. Any thread offset variance excursion beyond this limit will cause a count to be added to the thread variance overrun detector. Default: 1%.
  • thread_offset_variance_overrun_percent_limit : Percentage of base rate at which the detector counter will expire. Default: 5%.
  • thread_offset_variance_overrun_leak_time_s : Time in seconds at which the detector will leak counts. Default: 250ms.
  • thread_offset_variance_start_snooze_s : Time in seconds to snooze the detector after a call to start(). Default: 2s.
Example

Consider a system running with base_freq_hz: 1000 with the following detector configuration:

health_config:
    thread_offset_variance_percent:  5.0
    transport_overrun_percent_limit: 25.0
    transport_overrun_leak_time_s:   0.1
If the thread variance goes beyond 50uS a count will be added to the overrun detector. This system will e-stop if the overrun detector counter reaches 250. The system will leak one thread overrun error every 100mS.


Health and Timing Statistics

dev_host maintains health and timing statistics that are updated at each call to step_rt().

See jcs_host.h and jcs_host_types.h.