Achieve auto-restart mechanism for Nifi with Docker

Kun-Hung Tsai
2 min readMar 21, 2022

Background

Recently, I was building Nifi system in our on-premise environment. Everything went well before I found that JVM process OOMed unexpectedly and caused the whole Nifi out of service.

To resolve this problem, in addition to setting proper JVM heap size for Nifi (I was using default value 512 MB at first), I also needed to find a way to automatically recover Nifi from failure state.

You might wonder: If I am using Docker runtime to run container, won’t it maintain the failure state and restart the container automatically?

Yes indeed, but only for process with pid 1. Nifi official Dockerfile uses start.sh script as its entrypoint and Docker will only detect the failure when this process failed. JVM (pid 96 in the example pic) OOM is not being detected and this causes Nifi service hang forever.

Implementation

After doing some research on Google, I decided to use willfarrell/autoheal image with Docker healthcheck mechanism to achieve auto recovery.

Docker healthcheck will report the return value set in the healcheck configuration and then autoheal container will help to monitor and restart unhealthy docker containers.

Docker healthcheck configuration

I added healthcheck configuration in the origin docker-compose file. This test will check the response code of curl https://localhost:8443/nifi/.

The -k option is used to bypass certificate error for self-signed certificate in Nifi.

# Healthcheck configuration for Nifi container
healthcheck:
test: “${DOCKER_HEALTHCHECK_TEST:-curl -k https://localhost:8443/nifi/}"
interval: “60s”
timeout: “3s”
start_period: “5s”
retries: 5

Autoheal container

I added another container service autoheal to control the auto-restart process of Nifi container. The autoheal=true is important here for autoheal container to identify which containers it should monitor.

# Nifi container
labels:
— “autoheal=true”
# Autoheal container
autoheal:
image: willfarrell/autoheal:1.2.0
tty: true
container_name: autoheal
restart: always
volumes:
— /var/run/docker.sock:/var/run/docker.sock
environment:
AUTOHEAL_INTERVAL: 60
AUTOHEAL_START_PERIOD: 300
AUTOHEAL_DEFAULT_STOP_TIMEOUT: 120

Complete docker-compose file

The complete docker-compose.yaml is in below. You are free to set the following parameter for autoheal depending on your need.

AUTOHEAL_INTERVAL: 60 # check every 60 seconds
AUTOHEAL_START_PERIOD: 300 # wait 300 seconds before first health check
AUTOHEAL_DEFAULT_STOP_TIMEOUT: 120 # Docker waits max 120 seconds (the Docker default) for a container to stop before killing during restarts

Test result

For test, I removed the -k option in the health check endpoint. The curl result will fail due to curl: (60) SSL certificate problem: self signed certificate. Then autoheal will detect this unhealthy state and restart Nifi container automatically.

$ docker logs autoheal -f
Monitoring containers for unhealthy status in 300 second(s)
21–03–2022 08:04:35 Container /nifi (***) found to be unhealthy — Restarting container now with 120s timeout

Reference

https://sdr-enthusiasts.gitbook.io/ads-b/useful-extras/auto-restart-unhealthy-containers
https://wshs0713.github.io/posts/b8226bad/ (in Chinese)

I think that’s it. Thanks for reading and feel free to discuss with me in the comment.

--

--