The Data Consumption Layer - Querying with Trino

You’ve built your ingestion, you’ve processed your data with Spark, and it’s all sitting neatly in your S3 “Gold” bucket. Now what? You can’t ask every business analyst to learn PySpark just to see last month’s sales.

This is where the Data Consumption Layer comes in. In Chapter 9, Neylson Crepalde introduces a tool that feels like magic: Trino.

SQL directly on your Data Lake

Trino (formerly PrestoSQL) is a distributed SQL query engine. The key word here is engine, not database. Trino doesn’t store data itself. Instead, it reaches out to where your data lives (like S3) and queries it in place using standard SQL.

Why is this a big deal?

  • No more ETL to a Warehouse: You don’t need to pay for a separate Snowflake or Redshift instance. You just query your files directly.
  • Extreme Speed: Trino is designed for interactive analytics. It uses a coordinator-worker architecture to split your query across dozens of pods and run them in parallel.
  • Cost-Effective: You only pay for the compute while the query is running. Your data stays in low-cost S3 storage.

Setting up Trino on Kubernetes

We use Helm to deploy Trino to our EKS cluster. The configuration is surprisingly simple. You just need to tell Trino how many workers you want and how to find your data.

A crucial part of this setup is the AWS Glue Data Catalog. This acts as the “phone book” for Trino, telling it exactly which S3 folders represent which tables and what the columns are.

The Analyst Experience

Once Trino is running, you can connect almost any SQL client to it. The book uses DBeaver, but you could use Tableau, PowerBI, or even Excel.

To the analyst, it looks and feels like a regular Postgres or MySQL database. They write SELECT * FROM titanic, and Trino handles the massive task of reading those files from S3, parsing them, and returning the result in milliseconds.

Trino is perfect for batch data, but what if you need to search through millions of logs in real-time? For that, we need a different set of tools. In the next post, we’ll look at Elasticsearch and Kibana.

Next: Real-Time Visualization with Elasticsearch and Kibana Previous: Deploying the Big Data Stack on Kubernetes - Part 2

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More