Real-Time Edge Data with MiNiFi and Spark - Study Notes from Data Engineering with Python Ch 15
You have NiFi running. Kafka is streaming. Spark is processing. But what about the data source? What happens when your data comes from a tiny sensor or a Raspberry Pi that can barely run a web browser?
Chapter 15 of Data Engineering with Python by Paul Crickard tackles the last mile of the data pipeline: getting data off small devices at the edge using Apache MiNiFi. This is the final project chapter of the book, and it brings together the tools you have been learning throughout.
What Is MiNiFi?
MiNiFi is NiFi’s little sibling. It is a stripped-down, lightweight version of Apache NiFi designed for devices that do not have the resources to run a full NiFi installation.
Think IoT sensors. Raspberry Pi boards. Small servers sitting in remote locations. These devices generate data, but they cannot run a heavyweight data platform. They need something small that can collect data and send it somewhere bigger for processing.
Here is how it works. You design your data pipeline in NiFi (with the full GUI), export it as a template, convert it to a config file, and deploy that config file to MiNiFi on the edge device. MiNiFi runs the pipeline and streams data back to your main NiFi instance.
No GUI on the edge device. No heavy Java processes eating all the memory. Just a small agent doing one job: collecting data and sending it upstream.
Java vs C++ Version
MiNiFi comes in two flavors: Java and C++.
The C++ version has the smallest footprint. If your device has very limited memory and CPU, this is the one to use.
The Java version supports more processors and is easier to extend. If your device has a bit more power and you need a wider set of capabilities, go with Java.
You can also copy NiFi processor NAR files into MiNiFi’s lib directory to add processors that are not included by default. Some processors also need their controller service NAR files copied over.
The book uses the Java version for its examples.
Setting Up MiNiFi
The setup is straightforward. Download the MiNiFi binary and the MiNiFi toolkit. Extract both. Move them to your home directory.
Then set the environment variable so your shell knows where MiNiFi lives:
export MINIFI_HOME=/home/youruser/minifi
export PATH=$MINIFI_HOME/bin:$PATH
Add those lines to your .bashrc file so they persist across sessions.
The toolkit stays on the machine where NiFi runs. MiNiFi goes on the edge device. For development, you can run both on the same machine, which is what the book does.
Building the MiNiFi Pipeline
This is the interesting part. You do not build MiNiFi pipelines on the edge device. You build them in NiFi, then deploy them.
Step 1: Set Up Site-to-Site on NiFi
MiNiFi talks to NiFi through a feature called Site-to-Site. You need to make sure the port is configured. Open the nifi.properties file and set:
nifi.remote.input.socket.port=1026
Then create an input port on the NiFi canvas and name it something like “minifi”. This is the door through which MiNiFi data enters NiFi.
Step 2: Build the Receiving Pipeline in NiFi
On the NiFi canvas (outside any processor group), wire up processors to handle incoming MiNiFi data. The book builds a simple pipeline:
- Input port (named “minifi”) receives data from MiNiFi
- EvaluateJsonPath extracts a filename from the JSON payload
- UpdateAttribute sets the flowfile filename attribute
- PutFile writes the file to disk on the NiFi machine
This is the server side. It receives data and does something with it.
Step 3: Build the MiNiFi Task
Now you build what will actually run on the edge device. Create a processor group in NiFi and name it something like “minifitask”.
Inside that group, add two things:
- GenerateFlowFile processor, scheduled to run every 30 seconds. Set the custom text to a JSON payload like
{"fname":"minifi.txt","body":"Some text"}. - Remote Processor Group pointing to your NiFi instance URL with HTTP as the transport protocol.
Connect the GenerateFlowFile to the Remote Processor Group. When you make the connection, NiFi will ask which input port to send data to. Pick the “minifi” port you created earlier.
Right-click the Remote Processor Group and enable transmission. You should see a blue circle icon, which means it is ready to send.
Step 4: Export and Convert the Template
Here is where MiNiFi’s workflow gets a bit unusual. You need to:
- Exit the processor group
- Right-click the group and save it as a template
- Download the template as an XML file from the NiFi Templates menu
- Use the MiNiFi toolkit to convert that XML into a YAML config file
The conversion command looks like:
./bin/config.sh transform /path/to/template.xml /path/to/config.yml
If everything works, you get a success message and a config.yml file.
Step 5: Deploy and Run
Copy the generated config.yml to the $MINIFI_HOME/conf directory on the edge device (replacing the default config that ships with MiNiFi).
Start MiNiFi:
./minifi.sh start
MiNiFi reads the config file, starts the pipeline, and begins streaming data to your NiFi instance. You can check the logs at $MINIFI_HOME/logs/minifi-app.log.
Back in NiFi, you should see data flowing in through the input port. The processor group you used to create the template will be stopped, because its job is done. The actual data is coming from MiNiFi on the (simulated) edge device.
Why This Matters
The power of MiNiFi is that once data hits NiFi, you have the full toolkit available. You can:
- Route it to a Kafka topic and make it available to any consumer
- Process it with Spark for real-time analytics
- Store it in Elasticsearch for searching
- Write it to a data warehouse for long-term analysis
MiNiFi is the bridge between tiny devices and your full data platform. The edge device only needs to do one thing: send data. NiFi, Kafka, and Spark handle everything else.
Bonus: NiFi Clustering (Appendix)
The book’s appendix covers NiFi clustering, which ties nicely into the edge data story. If you are receiving streams from hundreds of MiNiFi devices, a single NiFi instance might not be enough.
NiFi uses Zero-Master Clustering. There is no permanent master node. Every node can perform the same work. ZooKeeper elects two special roles:
- Cluster Coordinator: handles new node connections and distributes updated flows
- Primary Node: runs isolated processes (processors that should only run on one node to avoid race conditions, like reading from a single file or database)
Changes made on any node get replicated to all other nodes. You build your pipeline once and it runs everywhere.
The practical concern is processors like PutFile or GetFile. If you have three nodes and all three try to read the same file, you get a race condition. The fix is to schedule those processors to run on the Primary Node only.
NiFi also handles node failure gracefully. If a node disconnects, its flowfiles get redistributed to the remaining nodes. When it reconnects, the load rebalances automatically.
Setting up a cluster means editing several config files: zookeeper.properties, nifi.properties, and the hosts file. It is not hard, but it is detailed. The book walks through a two-node cluster setup step by step.
Key Takeaways
- MiNiFi is NiFi for small devices. It runs on Raspberry Pi, IoT sensors, and edge servers without needing the full NiFi installation.
- You design MiNiFi pipelines in NiFi. Build the pipeline in the GUI, export it as a template, convert it to YAML, and deploy.
- Site-to-Site is the connection method. MiNiFi uses NiFi’s Site-to-Site protocol to stream data back to the main instance.
- Once data hits NiFi, everything is available. Kafka, Spark, Elasticsearch, databases. The edge device just needs to send.
- NiFi clustering scales the receiving end. Zero-Master Clustering distributes the load and handles node failures.
My Take
This is a short chapter, and it feels more like a demo than a deep technical guide. But it serves an important purpose: it shows you the complete picture. Data engineering is not just about what happens in your server room. It is about getting data from where it is generated to where it needs to be processed.
MiNiFi fills a real gap. In any IoT or edge computing scenario, you need something lightweight running on the device. You do not want to SSH into a hundred Raspberry Pis to set up data pipelines manually. The workflow of designing in NiFi and deploying via config files makes it manageable at scale.
The version compatibility issues the book mentions (needing older versions of NiFi and Java for MiNiFi toolkit 0.5.0) are a sign of a project that was still maturing at the time of writing. The Apache NiFi ecosystem has evolved since, but the core concepts remain the same.
As the final project chapter, it wraps up the book’s progression nicely: you started with flat files and databases, moved through NiFi and Airflow, added Kafka for streaming, Spark for processing, and now MiNiFi for edge collection. That is a complete data engineering stack from end to end.
The appendix on clustering is a solid bonus. In production, you will almost certainly want NiFi running on more than one machine. Understanding Zero-Master Clustering, the role of the Primary Node, and how flowfiles get redistributed is essential knowledge for running NiFi at scale.