E-Stop and Health

dev_host monitors the health of all devices on the network and the health of the host itself.

Host and device estop

Estop may be generated by jcs_host itself, any network device or by the user via jcs_host::trigger_estop(). How the system reacts to the generated estop is dependent on both the users requirements and the severity of the estop.

Some estops are considered unrecoverable and will be propagated to all devices regardless of user settings. These are typically transport errors and synchronous loop errors that result in a loss of communications with devices on the network.

The user may configure the behaviour of the system for device generated and externally generated estops.

Why do this?

Estop behaviour is somewhat configurable to allow the user to assert higher level control over the devices on the network in the event of a device estop. Some examples:

If a device generates an estop, the user wishes for the whole system to shut down and is happy to allow the devices themselves to manage this.
If a device generates an estop, the user wishes to transition a higher level controller that manages placing the machine into a kinematic configuration that enables safe shut down. They wish for the remaining active devices to remain running to achieve this.

Device estop

Almost all devices can generate an estop, however not all devices need to take action in the event of an estop. For example:

A motor controller may generate an estop and it may need to shut down or take some action in the event of an estop.
An analog interface may generate an estop, for example, if a limit is reached. However it does not need to shut down or enter a safe operating mode.

Thus, all devices can receive a network estop, but how they react to the network estop is device dependent. Where required, devices may have the action to be taken in the event of an estop defined by parameter network_estop_config.

Propagation of device estop to rest of network

Management of device estop propagation is dependent of the configuration of dev_host behaviour parameter propagate_estop.

If propagate_estop is True and a device enters estop, then the following occurs:

The device generates an estop. Device specific estop behaviour occurs.
The device returns an estop to dev_host.
dev_host asserts has_estop().
dev_host propagates estop to all devices on the JCS network. All devices execute their device specific estop behaviour.
dev_host remains in cyclic mode.

If propagate_estop is False and a device enters estop, then the following occurs:

The device will enters a safe error state. Device specific estop behaviour occurs.
The device returns an estop to dev_host.
dev_host asserts has_estop().
dev_host remains in cyclic mode.

dev_host will attempt to query the device with an error and return any useful information. If the network itself is in error, dev_host will still attempt to get data, however this will most likely fail.

Configure estop propagation in dev_HOST.yaml under the behaviour tag:

behaviour:
    # We wish to keep other devices running in the event that a device estops
    propagate_estop: false

behaviour:
    # If one device estops, we want all other devices to estop too.
    propagate_estop: true

Propagation of external user estop to rest of network

If dev_host behaviour parameter propagate_estop is set to False then an estop generated by network devices or by the external user command trigger_estop() will not be propagated to the devices on the network.

In some cases the user may wish to manage the device behaviour in the event of an estop, but still make use of trigger_estop() to estop all devices on the network. This may be enabled by setting dev_host behaviour parameter propagate_external_estop to true.

Configure external estop propagation in dev_HOST.yaml under the behaviour tag:

behaviour:
    # We wish to manage other devices in the case of an estopped device
    propagate_estop: false
    # However we want to be able to propagate an external estop from estop_triggered() to all devices on the network
    propagate_external_estop: true

Device estop recovery

Once the error source has been cleared, the JCS network can be started again by:

Calling reset() to reset all devices. This will reset all devices in an error state and move them to the stopped state.
Calling start() to start all devices and processes again.

Not all error conditions can be cleared. For example, a network error is most likely unrecoverable. In the case of an error that cannot be cleared, the JCS system must be shutdown and restarted.

If a device has an error condition when a call to start() is made the following may happen:

If there is no list_start host command list or the device is not affected by the host command list, then dev_host will start. The device will generate an estop.
If the device is affected by a list_start host command list, then the command list will fail upon execution attempt. A stop command will be propagated to all devices on the network and dev_host will remain in cyclic mode.

dev_host Health

dev_host monitors its own internal health and performance. The following is monitored:

Pending signals overruns.
Transport health.
Thread offset variance (jitter).

dev_host health is configured in dev_host.yaml under the health_config tag.

Pending signals overruns

A pending signal overrun is a result of a signal that has not reached it's destination by the time it's next sync comes around.

It is inevitable that during the operation of a real time robotic system, some communication errors will occur. JCS will not estop immediately on a pending signal overrun, rather we adopt the following approach: dev_host will allow some signals to be dropped and will attempt to continue on. If too many signal overruns occur up to a configure limit, at a fast enough rate, then dev_host will estop and attempt to safely shut down.

Pending signal overruns are monitored by an overrun detector, whereby each signal overrun increments a counter. If the counter reaches a percentage limit of the defined rate the system will estop. The detector will decrement the counter at a configured "leak" time.

Configure the detector with the following attributes:

pending_sequence_overrun_percent_limit : Percentage of base rate at which the detector counter will expire. Default: 1%.
pending_sequence_overrun_leak_time_s : Time in seconds at which the detector will leak counts. Default: 250ms.
pending_sequence_overrun_start_snooze_s : Time in seconds to snooze the detector after a call to start(). Default: 1s.

Pending signal overruns are typically caused by:

Base frequency too high.
Too many signals (base rate and full rate) for selected frequencies.
dev_host real time thread jitter.
Network errors.
Processes that take too long to execute.

Example

Consider a system running with base_freq_hz: 1000 with the following detector configuration:

health_config:
    pending_sequence_overrun_percent_limit: 25.0
    pending_sequence_overrun_leak_time_s:   0.1

This system will estop if the counter reaches 250 (from pending signal errors occurring). The system will leak one pending signal error every 100mS.

Base frequency based errors

It takes a finite amount of time for data to travel about the network and for a JCS system to do all it's work each tick. Driving the base frequency too high can lead to stability issues. As such there is a trade off between system complexity and base frequency.

For complex systems, base frequencies of 1kHz are reasonable. For less complex systems the base frequency can be pushed higher.

Base frequency based error troubleshooting

Some indications of a base frequency set too high are:

Pending signal overruns that occur immediately as the system starts.
Data corruption. Parameter exchanges result in data corruption or calls to parameter functions fail once the system enters cyclic mode.
Inability to start the system. Stuck at "waiting for cyclic warmup",

Some solution to any of these symptoms are:

Lower the base system rate until the system becomes stable.
Decrease the number of fast rate signals being moved around the system.

Process execution time

Some indications of a process execution time errors are:

Pending signal overruns that occur immediately as the system starts. Inspect the process timing.
Pending signal overruns that occur regularly as the system operates, but not enough to cause the system to shut down. Inspect the process timing, particularly for slow rate processes.

Transport health

Transport errors are typically a result of signal or parameter CRC errors, wiring issues or internal transport (Ethercat) errors.

CRC errors

Signal CRC errors will result in pending signal overruns. Parameter CRC errors will result in a parameter return status jcs::RET_ERROR.

Transport errors

Internal transport errors are a result of an error in the Ethercat subsystem. Most of the time JCS is able to recover from these errors.

dev_host attempts to continue on from internal transport errors by monitoring the errors via an overrun detector. The transport overrun detector operates the same as the pending signal overrun detector. See pending signal overruns for details.

Configure the transport overrun detector with the following attributes:

transport_overrun_percent_limit : Percentage of base rate at which the detector counter will expire. Default: 0.5%.
transport_overrun_leak_time_s : Time in seconds at which the detector will leak counts. Default: 250ms.

Example

Consider a system running with base_freq_hz: 500 with the following detector configuration:

health_config:
    transport_overrun_percent_limit: 25.0
    transport_overrun_leak_time_s:   0.1

This system will estop if the counter reaches 125 (from transport errors occurring). The system will leak one pending signal error every 100mS.

Thread offset variance

Any real time system will occasionally suffer from thread jitter. dev_host monitors the performance of the real time thread to ensure it stays within a configurable bounds.

Like the previous error handling strategies, JCS will not estop immediately on a thread jitter excursion event. We adopt the following approach: dev_host will allow some jitter to occur and will attempt to continue on. If the thread offset time variance exceeds some configurable bounds, frequently enough, then dev_host will estop and attempt to safely shutdown.

An overrun detector is used to monitor the thread health and can be configured with the following attributes:

thread_offset_variance_percent : Variance percentage limit. Any thread offset variance excursion beyond this limit will cause a count to be added to the thread variance overrun detector. Default: 1%.
thread_offset_variance_overrun_percent_limit : Percentage of base rate at which the detector counter will expire. Default: 5%.
thread_offset_variance_overrun_leak_time_s : Time in seconds at which the detector will leak counts. Default: 250ms.
thread_offset_variance_start_snooze_s : Time in seconds to snooze the detector after a call to start(). Default: 2s.

Example

Consider a system running with base_freq_hz: 1000 with the following detector configuration:

health_config:
    thread_offset_variance_percent:  5.0
    transport_overrun_percent_limit: 25.0
    transport_overrun_leak_time_s:   0.1

If the thread variance goes beyond 50uS a count will be added to the overrun detector. This system will estop if the overrun detector counter reaches 250. The system will leak one thread overrun error every 100mS.

Health and Timing Statistics

dev_host maintains health and timing statistics that are updated at each call to step_rt().

See jcs_host.h and jcs_host_types.h.