This is part 1 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.
Part 1: Big Data Engineering — Best Practices
Part 2: Big Data Engineering — Apache Spark
Part 3: Big Data Engineering — Declarative Data Flows
Part 4: Big Data Engineering — Flowman up and running
What to expect
This series is about building data pipelines with Apache Spark for batch processing. But some aspects are also valid for other frameworks or for stream processing. Eventually I will introduce Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing.
An every growing number of companies and projects build their data processing pipelines using Apache Spark as the central data processing framework. And there are very good reasons to do that (more on those in another part of this series).
Apache Spark as being a framework per se does not provide much guidance to follow best practices when designing data pipelines nor does it take care of many details which are not directly part of the data transformation itself but which are important from a broader point of view like schema management and the ability to reprocess data in case of failures or logical errors.
Typical Data Pipeline
Before discussing various aspects to keep in mind while developing data processing pipelines (with whatever technology — although I prefer Apache Spark), let us first have a look what a typical data pipelines actually does.
I would say that most data pipelines essentially contain three steps:
- Extraction. Read data from some source system (be it a shared filesystem like HDFS or in an object store like S3 or some database like MySQL or MongoDB)
- Transformation. Apply some transformations like data extraction, filtering, joining or even aggregation.
- Loading. Store the results back again into some target system. Again this can be a shared filesystem, object store or some database.
Apache Spark well supports all these steps and you can implement a simple data pipeline with a couple of lines of code. Once you have it in production, some questions might arise over time leading to non-functional requirements and best practices. These are not directly implemented by Apache Spark, you have to take care of these yourself. This series is about supporting you with building rock solid data pipelines which also take care of many of those non-functional requirements which are nevertheless really important in production.
This first part of the series is about best practices. This refers to specific ways of doing things in order to enable stable operations. The techniques to implement many of them is not revolutionary, but they are not discussed about very often (at least this is my impression).
The first requirement is to provide some form of logging. This is not very exciting, and many developers already provide that. But often the question is how verbose the logging should be: Should every detail be logged to the console or only problems?
This question is hard to answer, but I prefer to log all important decisions my application makes and to log all important variables and state information that has some influence on the application. You might want to ask yourself: What kind of information would be helpful for most incidents where your application does not work as expected?
Typical examples of what I am logging:
- What are the custom settings for the current run? For example what date range is being processed? What customer is being processed?
- What data is read? Where is the data stored? How many files? This information helps to validate that the application tries to read from the correct place.
- Where does the application tries to write its results to? Again this information helps to validate that all settings for the target system are correct.
- You should also log some sort of a success message at the end — this is invaluable for setting up automated alerts when this message is missing for some longer time.
As mentioned in the last item, alerting is also something that you should think of. Most of the time it is desirable to store all logs in a central logging aggregator like Graylog where you can easily search for specific issues and setup alerts.
Related to logging there is also the topic metrics. With this term I do not want to refer only to the internal technical metrics of Apache Spark, but also to some metrics with more business relevance. For example it might be interesting to provide metrics about the number of records read and written — both metrics are not directly available in Apache Spark, at least not per data source and data sink.
A very important feature that you should plan in from the very beginning are so called reruns. In many batch processing applications, input data is provided in time slices (daily or hourly) and you only want to process the new data.
But what happens if something goes wrong? For example what should be done if the input data is incomplete or corrupted? Or if your application contains some logic error and produces wrong results? The very simple answer to this situation is reruns. This term refers to the capability of your application to reprocess old data in case of any issue. But this capability doesn’t come automatically, you have to carefully think about your data management strategy how to organize both your input and our output data to support reruns.
There are a couple of aspects required for reruns, which are discussed separately in the next topics.
3. Single Writer per Partition
In order to support reruns, you have to think about the data organization. In case of any error, you ideally can simply remove the output of a specific batch run and replace it with the results of a new batch run.
This is very easily possible, if you use some simple partitioning mechanism. With partitioning I refer to use subfolders (for file based outputs) or Hive partitions (when you are using Hive) to organize your data which logically belongs to the same output (like a single Hive table). The basic idea is that every batch run should write into a separate partition.
For example if you application processes new data coming in every hour, simply create partitions using the data and hour as its identifier. In a file based workflow, the partition is a directory which might look as follows for some imaginary data warehouse containing customer and transaction data
Inside the directory, all files are stored from the specific job run for 2020–09–12 08:00 . In case of any error in that batch run, you can simply remove the whole directory and restart the job.
If you are using Hive (and I strongly recommend to do so if you are using Spark on top of some shared file system or object store), partitions are a core feature of Hive. Unfortunately Spark doesn’t support writing into a specific partition very well (but that limitation can be worked around).
4. Schema Management
Schema management is a very important topic. This term refers to all project and development tasks about the data input and output format.
When you read in the data of some source system, you expect a specific format of the data. This includes both the technical file format (like CSV, JSON or Parquet) and the set of columns and data types used to store the data. I highly recommend two things:
- Make the expectations about the input schema explicit. This means do not simply let your application infer the correct types.
- Prepare for change. Too often I was told “that the input schema will never change” — and that assumption always turned out to be wrong.
The first advise improves the robustness of your application since silent schema changes are detected earlier, because the application should report an error if the input data does not match your expected schema any more (some relaxations are allowed and even advisable, more on that below). Many companies even have some lightweight organizational process for negotiating schemas — rightfully in my opinion, since a schema of a data export that is picked up by another application is a technical contract. And both parties (the delivering and the consuming side) should be aware of that by explicitly using the schema for writing and reading.
The second advise is the more difficult one, especially combined with a rerun capability. Changing data processing to reflect a new version of the input schema can break the ability to reprocess old data, since it is probably stored using an older version of the schema. This means that you need to think about how your data pipeline can possibly work with different schema versions. A simple solution could be to use an older version of your data pipeline to process older data — but too often this is not a good option, because the older version of the application is missing some important features which should also be applied to older data.
So my advise for reruns with different schema versions is to create some sort of super schema that is compatible with all versions of the input data — at least as an internal intermediate representation in your application before any business logic is applied. I also advise to try to negotiate with the source system to only allow compatible changes for new schema versions. Unfortunately the term compatible highly depends on the technology being used, for example Spring has other restrictions for changing types in JSON as Spark.
5. Data Ownership
Finally as an even broader concept, some sort of “data ownership” should be available in every project. This term actually refers to two possibly different aspects, where responsibilities should be clear in case of any question or incident that needs some care taking:
- Phyisical Data Ownership. All data eventually is stored on some system — be it in the cloud or in the on-premise data center or even on the server below my desk (not recommended). There needs to be a team which is responsible for operating this system (yes, even cloud needs some operations), including backups, updates etc. This role is typically owned by the IT operations department.
- Business Data Ownership. In addition to the physical ownership, each data source also requires a business owner. That person or team is responsible for the kind of data stored, for the data schema and for the interfaces with other systems. This role has to be owned by the team who defines the data that is stored. I always recommend a simple rule: The team who writes the data also owns the data. That also implies that no other team is allowed to write into ones data — except for some interface area for data exchange.
This was part 1 of a series about building robust data pipelines with Apache Spark. You might feel a little bit betrayed, because it didn’t contain any actual code. Nevertheless I think it is important to discuss some concepts first. The next part will focus more on Apache Spark.