This is part 2 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.
- Part 1: Big Data Engineering — Best Practices
- Part 2: Big Data Engineering — Apache Spark
- Part 3: Big Data Engineering — Declarative Data Flows
- Part 4: Big Data Engineering — Flowman up and running
What to expect
This series is about building data pipelines with Apache Spark for batch processing. But some aspects are also valid for other frameworks or for stream processing. Eventually I will introduce Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing.
Introduction
This second part highlights the reason why Apache Spark is so well suited as a framework for implementing data processing pipelines. There are many other alternatives, especially in the domain of stream processing. But from my point of view when working in a batch world (and there are good reasons to do that, especially if many non-trivial transformations are involved that require a larger amount of history, like grouped aggregations and huge joins) Apache Spark is an almost unrivaled framework that excels specifically in the domain of batch processing.
This article tries to shed some light on the capabilities Spark offers that provides a solid foundation for batch processing.
Technical requirements for data engineering
I already commented in the first part on the typical parts of a data processing pipeline. Let’s just repeat those steps:
- Extraction. Read data from some source system (be it a shared filesystem like HDFS or in an object store like S3 or some database like MySQL or MongoDB)
- Transformation. Apply some transformations like data extraction, filtering, joining or even aggregation.
- Loading. Store the results back again into some target system. Again this can be a shared filesystem, object store or some database.
We can now deduce some requirements of the framework or tool to be used for data engineering by mapping each of these steps to a desired capability — with some additional requirements added to the end.
- Broad range of connectors. We need a framework that is able to read in from a broad range of data sources like files in a distributed file system, records from a relational database or a column store or even a key value store.
- Broad and extensible range of transformations. In order to “apply transformations” the framework should clearly support and implement transformations. Typical transformations are simple column-wise transformations like string operations, filtering, joins, grouped aggregations — all the stuff that is offered by traditional SQL. On top of that the framework should offer a clean and simple API to extend the set of transformations, specifically the column-wise transformations. This is important for implementing custom logic that cannot be implemented with the core functionality.
- Broad range of connectors. Again we need a broad range of connectors for writing the results back into the desired target storage system.
- Extensibility. I already mentioned this in the second requirement above, but I feel this aspect is important enough for an explicit point. Extensibility may not only be limited to the kind of transformations, but it should also include extension points for new input/output formats and new connectors.
- Scalability. Whatever solution is chosen, it should be able to handle an every growing amount of data. First in many scenarios you should be prepared to handle more data than what would fit into RAM. This helps to avoid getting completely stuck by the amount of data. Second you might want to be able to distribute the workload onto multiple machines if the amount of data slows down processing too much.
What is Apache Spark
Apache Spark provides good solutions to all these requirements above. Apache Spark itself is a collection of libraries, a framework for developing custom data processing pipelines. This means that Apache Spark itself is not a full-blown application, but requires you to write programs which contains the transformation logic, while Spark takes care of executing the logic in an efficient way distributed on multiple machines in a cluster.
Spark was initially started at UC Berkeley’s AMPLab in 2009, and open sourced in 2010. Eventually in 2013, the project was donated to the Apache Software Foundation. The project soon caught on traction, especially from people used to work with Hadoop Map Reduce before. Initially Spark offered its core API around so called RDDs (Resilient Distributed Datasets) which provide a much higher level of abstraction in comparison to Hadoop and thereby helped developers to work much more efficiently.
Later on the newer on preferred DataFrame API was added, which implements a relational algebra with an expressiveness comparable to SQL. This API provides concepts very similar to tables in a database with named and strongly typed columns.
While Apache Spark itself is developed in Scala (a mixed functional and object oriented programming language running on the JVM), it provides APIs to write applications using Scala, Java, Python or R. When looking at the official examples, you quickly realize that the API is really expressive and simple.
- Connectors. With Apache Spark only being a processing framework with no built in persistence layer, it always relied on connectivity to storage systems like HDFS, S3 or relational databases via JDBC. This implies that a clean connectivity design was built in from the beginning, specifically with the advent of DataFrames. Nowadays almost every storage or database technology simply needs to provide an adaptor for Apache Spark to be considered as a possible choice on many environments.
- Transformations. The original core library provides the RDD abstraction with many common transformations like filtering, joining and grouped aggregations. But nowadays the newer DataFrame API is to be preferred and provides a huge set of transformations mimicking SQL. This should be enough for most needs.
- Extensibility. New transformations can be easily implemented with so called user defined functions (UDFs), where you only need to provide a small snippet of code working on an individual record or column and Spark wraps it up such that the function can be executed in parallel and distributed in a cluster of computers.
Since Spark has a very high code quality, you can even go down one or two layers and implement new functionality using the internal developers API. This might be a little bit more difficult, but can be very rewarding for those rare cases which cannot be implemented using UDFs. - Scalability. Spark was designed to be a Big Data tool from the very beginning, and as such it can scale to many hundreds nodes within different types of clusters (Hadoop YARN, Mesos and lately Kubernetes, of course). It can process data much bigger than what would fit into RAM. One very nice aspect is that Spark applications can also run very efficiently on a single node without any cluster infrastructure, which is nice from a developers point of view for testing, but which also enables to use Spark for not-so-huge amounts of data and still benefit from Sparks features and flexibility.
By these four aspects Apache Spark is very well suited to typical data transformation tasks formerly done with dedicated and expensive ETL software from vendors like Talend or Informatica. By using Spark instead, you get all the benefits of a vivid open source community and the freedom of tailoring applications precisely to your needs.
Although Spark was created with huge amounts of data in mind, I would always consider it even for smaller amounts of data simply because of its flexibility and the option to seamlessly grow with the amount of data.
Alternatives
Of course Apache Spark isn’t the only option for implementing data processing pipelines. Software vendors like Informatica and Talend also provide very solid products for people who prefer to buy in into complete eco systems (with all the pros and cons).
But even in the Big Data open source world, there are some projects which could seem to be alternatives at the first glance.
First we still have Hadoop around. But Hadoop actually consists of three components, which have been split up cleanly: First we have the distributed file system HDFS which is capable of storing really huge amounts of data (petabytes to say). Next we have the cluster scheduler YARN for running distributed applications. Finally we have the Map Reduce framework for developing a very specific type of distributed data processing applications. While the first two components HDFS and YARN are still being widely used and deployed (although they feel the pressure from cloud storage and Kubernetes are possible replacements), the Map Reduce framework nowadays simply shouldn’t be used by any project and more. The programming model is much too complicated and writing non-trivial transformations can become really hard. So, yes, HDFS and YARN are fine as infrastructure services (storage and compute) and Spark is well integrated with both.
Other alternatives could be SQL execution engines (without integrated persistence layer) like Hive, Presto, Impala, etc. While these tools often also provide a broad connectivity to different data sources they are all limited to SQL. For one, SQL queries itself can become quite tricky for long chains of transformations with many common table expressions (CTEs). Second it is often more difficult to extend SQL with new features. I wouldn’t say that Spark is better than these tools in general, but I say that Spark is better for data processing pipelines. These tools really shine for querying existing data. But I would not want to use these tools for creating data — that was never their primary scope. On the other hand, while you can use Spark via Spark Thrift Server for executing SQL for serving data, it wasn’t really created for that scenario.
Development
One question I often hear is what programming language should be used for accessing the power of Spark. As I wrote above, Spark out of the box provides bindings for Scala, Java, Python and R —so the question really makes sense.
My advise is either to use Scala or Python (maybe R — I don’t have experience with that) depending on the task. Never use Java (it really feels much more complicated than the clean Scala API), invest some time to learn some basic Scala instead.
Now that leaves us with the question “Python or Scala”.
- If you are doing data engineering (read, transform, store), then I strongly advise to use Scala. First since Scala is a statically typed language, it is actually simpler to write correct programs than with Python. Second whenever you need to implement new functionality not found in Spark, you are better off with the native language of Spark. Although Spark well supports UDFs in Python, you will pay a performance penalty and you cannot dive any deeper. Implementing new connectors or file formats with Python will be very difficult, maybe even impossible.
- If you are doing Data Science (which is not the scope of this article series), then Python is the much better option with all those Python packages like Pandas, SciPy, SciKit Learn, Tensorflow etc.
Except for the different libraries in those two scenarios above, the typical development workflow is also much different: The applications developed by data engineers often run in production every day or even every hour. Data Scientists on the other hand often work interactively with data and some insight is the final deliverable. So production readiness is much more critical for data engineers than for data scientists. And even though many people will disagree, “production readiness” is much harder with Python or any other dynamically typed language.
Shortcomings of a Framework
Now since Apache Spark is such a nice framework for complex data transformations, we can simply start implementing our pipelines. Within a few lines of code, we can instruct Spark to perform all the magic to process our multi terabytes data set into something more accessible.
Wait, not so fast! I did that multiple times in the past for different companies and after some time I found out that many aspects have to be implemented over and over again. While Spark excels at data processing itself, I argued in the first part of this series that robust data engineering is about more than only the processing itself. Logging, monitoring, scheduling, schema management all come to my mind and all these aspects need to be addressed for every serious project.
Those non-functional aspects often require non-trivial code to be written, some of which can be very low level and technical. Therefore Spark is not enough to implement a production quality data pipeline. Since those issues arise independent of the specific project and company, I propose to split up the application into two layers: One top layer containing the business logic encoded in data transformations and the specifications of the data source and data target. One lower layer then should take care of executing the whole data flow, providing relevant logging and monitoring metrics, taking care of schema management.
Final Words
This was the second part of a series about building robust data pipelines with Apache Spark. We had a strong focus on why Apache Spark is very well suited for replacing traditional ETL tools. Next time I will discuss why another layer of abstraction will help you to focus on business logic instead of technical details.