Running Jupyter with Spark in Docker

By | PySpark, Python, Spark | No Comments

most attendees of dimajix Spark workshops seem to like the hands-on approach I am offering to them using Jupyter notebooks with Spark clusters running in the AWS cloud. But then, when the workshop finishes, the natural question for many attendees is “how can I continue?”. One the one hand, setting up a Spark cluster is not too difficult, but on the other hand, this is probably out of scope for most people. Moreover you still need to get Jupyter notebook running with PySpark, which is again not too difficult, but also out of scope for a starting point.

Docker to the Rescue

So I made up a Docker image, which contains Spark 2.2.0 and Anaconda Python 3.5, which can be run locally on Linux, Windows and probably Mac (I didn’t test on Apple so far). You only need to have Docker installed on your machine, everything else is contained in the single image. The image can be downloaded with the Docker CLI as follows:
docker pull dimajix/jupyter-spark:latest
When the image is downloaded (which is required only once), you can run a Jupyter notebook via
docker rum --rm -p 8888:8888 dimajix/jupyter-spark:latest
Then point your favorite browser to http://localhost:8888 , this will show the Jupyter notebook start page. Since Spark will run in “local” mode, it does not require any cluster resources. But still it will use as much CPUs as it can find in your Docker environment.

Accessing S3

In order to access training data in S3, you also need to have some AWS credentials and specify them as environment variables as follows:
docker run --rm -p 8888:8888 -e AWS_ACCESS_KEY_ID= -e AWS_SECRET_ACCESS_KEY= dimajix/jupyter-spark:latest
Note that for accessing data in S3, for some technical reasons, you need to use the schema “s3a” instead of “s3”, i.e. “s3a://dimajix-training/data/alice/”.

More on GitHub

The Docker image also supports a Spark standalone cluster and has some more options to tweak (for example, proxy for accessing S3 for all those sitting behing a firewall and proxy), you can find all the details on GitHub at

Running Spark and Hadoop with S3

By | Big Data, Hadoop, Spark | No Comments

Traditionally HDFS was the primary storage for Hadoop (and therefore also for Apache Spark). Naturally this implies that you permanently need to run Hadoop servers for hosting the data. While this works perfectly well for many projects running an Hadoop Cluster which are either big enough to store all the data or only contain hot data (which is accessed vrey frequently), it may be worth some thoughts about alternatives.

One downside of HDFS simply is the costs associates, especially if you are running inside a cloud (like AWS for example). Renting cloud servers becomes expensive pretty fast, and that will hurt you even more if you only need them to store lots of cold data. Moreover while it is technically possible to dynamically scale up and down an extisting Hadoop cluster in order to increase the computing power for infrequent work loads or ad hoc analysis, this is also a questionable approach since it changes the core infrastructure (Hadoop) containing all your valuable business data.
Read More

Running PySpark on Anaconda in PyCharm

By | PySpark, Spark | 6 Comments

Working with PySpark

Currently Apache Spark with its bindings PySpark and SparkR is the processing tool of choice in the Hadoop Environment. Initially only Scala and Java bindings were available for Spark, since it  is implemented in Scala itself and runs on the JVM. But on the other hand, people who are deeply into data analytics often feel more comfortable with simpler languages like Python or dedicated statistical languages like R. Fortunately Spark comes out-of-the-box with high quality interfaces to both Python and R, so you can use Spark to process really huge datasets inside the programming language of your choice (that is, if Python or R is the programming language of your choice, of course).

The integration of Python with Spark allows me to mix Spark code to process huge amounts of data with other powerful Python frameworks like Numpy, Pandas and of course Matplotlib. I also know of software development teams which already know Python very well and try to avoid to learn Scala in order to focus on data processing and not on learning a new language. These teams also build complex data processing chains with PySpark.

When you are working with Python, you have two different options for development: Either use Pythons REPL (Read-Execute-Print-Loop) interface for interactive development. You could do that on the command line, but Jupyter Notebooks offer a much better experience. The other option is a more traditional (for software development) workflow, which uses an IDE and creates a complete program, which is then run. This is actually what I want to write about in this article. Read More



Ihr Name (Pflichtfeld)

Ihre E-Mail-Adresse (Pflichtfeld)


Ihre Nachricht