Apache Spark & PySpark

Apache Spark for Scalable Data Transformations

Apache Spark and PySpark have a very special place in the history of dimajix, as these frameworks were the basis of several large big data projects. The technology has proven itself to this day, and is at the core of well-known products such as Azure Data Factory.

With Flowman , dimajix has also developed a powerful open source tool based on Apache Spark in cooperation with several companies from the financial and online advertising industries, which greatly simplifies the creation of robust data pipelines through a declarative approach.

Data Pipelines

Apache Spark and PySpark have proven to be an extremely flexible and highly scalable technology for implementing data pipelines and ETL jobs. With a variety of connectors for blob storage such as S3, ADLS, etc., as well as NoSQL databases and classic relational SQL databases, Apache Spark continues to be a very good choice for handling complex data transformation and integration tasks. Thanks to the clever architecture, Apache Spark can be used to distribute and process data volumes in parallel that are far more than the total amount of main memory.

Challenges

Although PySpark in particular offers a comparatively simple API for application development, a deep understanding of how Apache Spark works quickly becomes necessary when it comes to optimization in order to make optimal use of existing resources (CPU and memory).

On premise, IaaS, PaaS or SaaS

There are now a variety of ways in which applications based on Apache Spark and PySpark can be operated: as a local installation, on virtual infrastructure in the cloud, as a managed service or even as a finished application in the cloud. We’ll help you find a decision that fits your strategy and business.

How dimajix helps your business

As long-standing experts in the field of big data with a focus on Hadoop, we have made it our mission to support companies in successfully implementing Hadoop precisely from this situation. Our knowledge and experience will help your project to succeed.

Competencies

Hadoop ecosystem inclsuive HDFS, Hive, Spark, etc
Deploy to YARN or Kubernetes
On-premise and in the cloud

Technology

All common Hadoop tools and components
Cloudera Manager
Hive Warehouses on HDFS, ADLS and S3
DevOps tools such as Docker, Kubernetes, Terraform, Ansible etc
Cloud (AWS, Azure, GCP)
Development in Java, Scala and Python

Experience

Research and development
Financial sector
Marketing & Online Advertising