Jupyter Notebooks with PySpark in AWS

By | Big Data, Cloud, Data Science, PySpark, Python | No Comments

Amazon Elastic MapReduce (EMR) is something wonderful if you need compute capacity on demand. I love it for deploying the technocal environments for my trainings, so every attendee gets its own small Spark cluster in AWS. It comes with Hadoop, Spark, Hive, Hbase, Presto, Pig as working horses and Hue and Zeppelin as convenient frontends, which really support workshops and interactive trainings extremly well. But unfortunately Zeppelin is still lacking behind Jupyter notebooks, especially if you are using Python with PySpark instead of Scala. So if you are into PySpark and EMR, you really want to use Jupyter with PySpark running on top of EMR.

Technically this requires downloading and installing an appropriate Python distribution (like Anaconda for example), configuring an appropriate Jupyter kernel which uses PySpark instead of plain Python. Moreover the Python distribution is required on all participating nodes, so Spark can start Python processes with the same packages on any node in the cluster. Things start to get complicated, especially if you want to large multiple EMR clusters – for example for providing a separate cluster to every attendee of a training.

Obviously this situation requires some sort of automatization. And fortunately a good solution is provided by Terraform from Hashicorp – the perfect tool for deploying multiple clusters for trainings. And by adding a bootstrap action, it is also possible to automatically deploy Anaconda and the Jupyter notebook server on the master node of the EMR cluster.

Read More

Running Spark and Hadoop with S3

By | Big Data, Hadoop, Spark | No Comments

Traditionally HDFS was the primary storage for Hadoop (and therefore also for Apache Spark). Naturally this implies that you permanently need to run Hadoop servers for hosting the data. While this works perfectly well for many projects running an Hadoop Cluster which are either big enough to store all the data or only contain hot data (which is accessed vrey frequently), it may be worth some thoughts about alternatives.

One downside of HDFS simply is the costs associates, especially if you are running inside a cloud (like AWS for example). Renting cloud servers becomes expensive pretty fast, and that will hurt you even more if you only need them to store lots of cold data. Moreover while it is technically possible to dynamically scale up and down an extisting Hadoop cluster in order to increase the computing power for infrequent work loads or ad hoc analysis, this is also a questionable approach since it changes the core infrastructure (Hadoop) containing all your valuable business data.
Read More

Running PySpark on Anaconda in PyCharm

By | PySpark, Spark | 4 Comments

Working with PySpark

Currently Apache Spark with its bindings PySpark and SparkR is the processing tool of choice in the Hadoop Environment. Initially only Scala and Java bindings were available for Spark, since it  is implemented in Scala itself and runs on the JVM. But on the other hand, people who are deeply into data analytics often feel more comfortable with simpler languages like Python or dedicated statistical languages like R. Fortunately Spark comes out-of-the-box with high quality interfaces to both Python and R, so you can use Spark to process really huge datasets inside the programming language of your choice (that is, if Python or R is the programming language of your choice, of course).

The integration of Python with Spark allows me to mix Spark code to process huge amounts of data with other powerful Python frameworks like Numpy, Pandas and of course Matplotlib. I also know of software development teams which already know Python very well and try to avoid to learn Scala in order to focus on data processing and not on learning a new language. These teams also build complex data processing chains with PySpark.

When you are working with Python, you have two different options for development: Either use Pythons REPL (Read-Execute-Print-Loop) interface for interactive development. You could do that on the command line, but Jupyter Notebooks offer a much better experience. The other option is a more traditional (for software development) workflow, which uses an IDE and creates a complete program, which is then run. This is actually what I want to write about in this article. Read More

Building Druid for Cloudera 5.4.x

By | Uncategorized | No Comments

So the other day I wanted to investigate into using Druid as a reporting backend database. But unfortunately Druid doesn’t work out of the box with Cloudera 5.4. I always get an error when running the Hadoop indexer, either via CLI or via the Indexing service. The exceptions in Hadoop always look like this:

2015-11-30 11:42:37,653 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.VerifyError: class com.fasterxml.jackson.datatype.guava.deser.HostAndPortDeserializer overrides final method deserialize.(Lcom/fasterxml/jackson/core/JsonParser;Lcom/fasterxml/jackson/databind/DeserializationContext;)Ljava/lang/Object;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    ...

So the problem seems to be a classical version mismatch between Cloudera Hadoop and Druid. Specifically both projects are using incompatible versions of the Jackson libraries (Cloudera still uses 2.2.3 while Druid uses 2.4.6). After some trials with different Jackson versions I got it to work by modifying the dependencies of Druid itself and building it myself. Since I suspect that others may run into similar problems, here is what I did to get Druid up and running:

git clone https://github.com/druid-io/druid.git
cd druid
git checkout 0.8.2
sed -i "s#jackson.version>2.4.6<#jackson.version>2.3.5<#" pom.xml
mvn package -DskipTests

After that you will find a packaged version of Druid at

distribution/target/druid-0.8.3-SNAPSHOT-bin.tar.gz

which should work with Cloudera 5.4.

Anfrage:

 

Ihr Name (Pflichtfeld)

Ihre E-Mail-Adresse (Pflichtfeld)

Betreff

Ihre Nachricht

×