Amazon Elastic MapReduce (EMR) is something wonderful if you need compute capacity on demand. I love it for deploying the technocal environments for my trainings, so every attendee gets its own small Spark cluster in AWS. It comes with Hadoop, Spark, Hive, Hbase, Presto, Pig as working horses and Hue and Zeppelin as convenient frontends, which really support workshops and interactive trainings extremly well. But unfortunately Zeppelin is still lacking behind Jupyter notebooks, especially if you are using Python with PySpark instead of Scala. So if you are into PySpark and EMR, you really want to use Jupyter with PySpark running on top of EMR.
Technically this requires downloading and installing an appropriate Python distribution (like Anaconda for example), configuring an appropriate Jupyter kernel which uses PySpark instead of plain Python. Moreover the Python distribution is required on all participating nodes, so Spark can start Python processes with the same packages on any node in the cluster. Things start to get complicated, especially if you want to large multiple EMR clusters – for example for providing a separate cluster to every attendee of a training.
Obviously this situation requires some sort of automatization. And fortunately a good solution is provided by Terraform from Hashicorp – the perfect tool for deploying multiple clusters for trainings. And by adding a bootstrap action, it is also possible to automatically deploy Anaconda and the Jupyter notebook server on the master node of the EMR cluster.