Monitoring Data Pipelines - Study Notes from Data Engineering with Python Ch 9

You built a data pipeline. It is idempotent, uses atomic transactions, and has version control. It is production ready. But can you tell when it breaks?

Chapter 9 of Data Engineering with Python by Paul Crickard is about the thing people forget until it is too late: monitoring. Your pipeline will fail. The network will drop. Elasticsearch will go down. A bad record will sneak through. The question is whether you find out in five minutes or five days.

This chapter covers three ways to monitor NiFi pipelines: using the GUI, using processors, and using Python with the REST API.

Monitoring NiFi with the GUI

The simplest way to monitor NiFi is to just look at it. The GUI gives you several built-in tools.

The Status Bar

At the top of the NiFi window, right below the component toolbar, there is a status bar. It shows you everything at a glance:

  • Active threads tell you how many tasks are running and give you a sense of load
  • Total queued data shows how many flowfiles are waiting and how much disk space they take
  • Remote process groups show whether NiFi instances on different machines are communicating
  • Component states tell you how many processors are running, stopped, invalid, or disabled
  • Version info shows which processor groups are up to date, locally modified, or stale
  • Last refresh tells you when the status bar data was last updated (usually every five minutes)

Each processor group and each individual processor also has its own mini status bar. You can see the same metrics scoped to just that component. The “In” and “Out” numbers on a processor show how much data passed through it in the last five minutes.

Bulletins

Bulletins are NiFi’s built-in error notifications. When something goes wrong inside a processor, a red square icon appears on the component. Hover over it and you see the error message.

For example, if Elasticsearch is down, the processor sending data to it will show a connection timeout bulletin. You can see these at the processor level or at the processor group level. If you want the full picture, open the Bulletin Board from the waffle menu in the top-right corner. It shows all active bulletins across your entire NiFi instance.

You can control how noisy bulletins are. Each processor has a Bulletin Level setting under its configuration. Set it higher to only see critical errors. Set it lower to see warnings and info messages too.

Counters

Counters are a simple way to track how many flowfiles pass through a specific point in your pipeline. You add an UpdateCounter processor between two other processors, give the counter a name, and set a delta (how much to increment by).

Here is how it works. Say you want to know how many records get sent to Elasticsearch. Drop an UpdateCounter processor right before the Elasticsearch processor. Every flowfile that passes through increments the count by one (or whatever delta you set).

To see the counts, go to the waffle menu and select Counters. You will see the count for each named counter and an aggregate total across all UpdateCounter processors that share the same name.

It is basic. But knowing that 162 records went through one part of your pipeline while 1,173 went through another tells you a lot about where data is flowing and where it might be getting stuck.

Reporting Tasks

Reporting tasks are background jobs that monitor NiFi itself and post results to the bulletin. Think of them as processors that watch the system instead of processing data.

You set them up from Controller Settings in the waffle menu. NiFi ships with several built-in reporting tasks. The book walks through MonitorDiskUsage as an example.

The setup is straightforward. You specify a directory to watch and a threshold. Set the threshold to 80% and you get a bulletin when disk usage passes that mark. Set it to 1% (like the book does for demo purposes) and you get a bulletin immediately because almost any drive is more than 1% full.

This is useful for catching problems before they become emergencies. If your NiFi logs directory is filling up, you want to know before it runs out of space and everything crashes.

Monitoring NiFi with Processors

The GUI is great when you are sitting in front of NiFi. But here is the problem: you cannot watch NiFi all day. A better approach is to have NiFi notify you when something goes wrong.

Sending Slack Messages on Failure

The book demonstrates this with the PutSlack processor. The idea is simple. When a processor fails, route the failed flowfile to PutSlack, which sends a message to your Slack channel (or directly to you).

Setting it up requires a few steps on the Slack side first. You need to create a Slack app in your workspace, enable Incoming Webhooks, pick a channel, and copy the webhook URL.

Back in NiFi, you add a PutSlack processor and connect it to the failure relationship of whatever processor you want to monitor. In the book’s example, the ElasticSCF processor (which sends data to Elasticsearch) has its failure output connected to PutSlack.

The PutSlack configuration takes:

  • Webhook URL (the Slack webhook you copied)
  • Username (the display name for the bot in Slack)
  • Webhook Text (the message to send)

The clever part is the Webhook Text. You can use NiFi’s expression language to include flowfile attributes in the message. The book uses ${id:append(': Record failed Upsert Elasticsearch')} which takes the record ID and appends a failure description. So instead of a generic “something failed” message, you get “7781316: Record failed Upsert Elasticsearch” in your Slack DM.

Now you can be away from NiFi and still know the moment something breaks. And you know exactly which record caused the problem.

PutSlack is just one option. NiFi has processors for email, file writing, HTTP calls, and many other output channels. The pattern is the same: connect to the failure relationship and send a notification wherever your team looks.

Using Python with the NiFi REST API

The GUI and processors cover most monitoring needs. But if you want full control, NiFi has a REST API that lets you build your own monitoring tools in Python.

System Diagnostics

The system diagnostics endpoint gives you resource usage information. You can check heap size, thread count, and heap utilization with a single GET request to the /nifi-api/system-diagnostics endpoint.

The response contains an aggregate snapshot with fields like maxHeap, totalThreads, and heapUtilization. In the book’s example, the NiFi instance was using 81% of its 512 MB heap across 108 threads. That is the kind of data you would feed into a monitoring dashboard.

Process Group and Processor Info

You can query any processor group by its ID using the /nifi-api/process-groups/{id} endpoint. The response includes the group name, version state, flowfile counts, queue sizes, bytes read and written, and active thread count.

Individual processors work the same way. Hit /nifi-api/processors/{id} and you get the processor name, status, and all the metrics you would see in the GUI status bar.

Reading Flowfile Queues

This one is interesting. You can actually read the contents of flowfiles sitting in a queue. It is a three-step process:

  1. Make a POST request to create a listing request for the queue
  2. GET the status of that listing request to get individual flowfile IDs
  3. GET the content of a specific flowfile by its ID

The book demonstrates this by pulling a full JSON record from the SeeClickFix pipeline queue. You get the complete data object with all its fields. This is extremely useful for debugging. Instead of clicking through the NiFi GUI to inspect queue contents, you can script it.

You can also clear queues programmatically by making a POST to the drop-requests endpoint. Useful when you need to flush bad data out of a pipeline.

Bulletins, Counters, and Reporting Tasks

The API exposes the same information you see in the GUI:

  • /nifi-api/flow/bulletin-board returns all active bulletin messages
  • /nifi-api/counters returns all counter values and their aggregates
  • /nifi-api/reporting-tasks/{id} returns the configuration and state of reporting tasks

With these endpoints, you could build a Python script that polls NiFi every few minutes, checks for errors in the bulletin, verifies counter values are within expected ranges, and sends alerts through whatever channel you prefer.

Key Takeaways

  • Use the GUI for quick checks. The status bar, bulletins, and counters give you a fast overview without any extra setup.
  • Use processors for automated alerts. PutSlack (or PutEmail, or any output processor) on the failure relationship means you hear about problems without watching the screen.
  • Use the REST API for custom monitoring. If you need dashboards, automated responses, or integration with other systems, the API gives you everything the GUI shows and more.
  • Counters are underrated. A simple count of records passing through key points in your pipeline can reveal bottlenecks and data loss fast.
  • Reporting tasks catch infrastructure problems. Disk usage, memory pressure, and other system-level issues can kill your pipeline before any data-level error does.

My Take

This is a short chapter. Crickard covers a lot of ground without going too deep into any one area. That works for a book that is trying to give you the full data engineering picture from start to finish.

The PutSlack example is practical and immediately useful. If you are running NiFi in any real environment, hooking up Slack (or email, or PagerDuty) notifications on failure relationships should be one of the first things you do. Finding out about a pipeline failure from an end user is a bad day.

The REST API section is the most valuable part for engineers who want to go beyond the basics. Once you can query NiFi programmatically, you can build monitoring into whatever observability stack your team already uses. Feed the data into Prometheus, Grafana, Datadog, or a simple Python script that sends you a text at 2 AM. The API gives you the raw material.

One thing the chapter does not cover much is what to monitor. It shows you how to monitor, which is important. But knowing that you should track things like processing latency, queue depth trends over time, and error rates by processor type takes experience. That is the kind of knowledge you build by running pipelines in production and learning what breaks.

Still, this chapter gives you the tools. And having the tools is the first step toward building a monitoring practice that keeps your pipelines healthy.


About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More