{"id":627,"date":"2017-05-22T16:10:53","date_gmt":"2017-05-22T14:10:53","guid":{"rendered":"http:\/\/dimajix.de\/?p=627"},"modified":"2023-06-05T13:08:35","modified_gmt":"2023-06-05T11:08:35","slug":"jupyter-notebooks-with-pyspark-in-aws","status":"publish","type":"post","link":"https:\/\/dimajix.de\/en\/jupyter-notebooks-with-pyspark-in-aws\/","title":{"rendered":"Jupyter Notebooks with PySpark in AWS"},"content":{"rendered":"<p>Amazon Elastic MapReduce (EMR) is something wonderful if you need compute capacity on demand. I love it for deploying the technocal environments for my <a href=\"http:\/\/dimajix.de\/schulungen\">trainings<\/a>, so every attendee gets its own small Spark cluster in AWS. It comes with Hadoop, Spark, Hive, Hbase, Presto, Pig as working horses and Hue and Zeppelin as convenient frontends, which really support workshops and interactive trainings extremly well. But unfortunately Zeppelin is still lacking behind Jupyter notebooks, especially if you are using Python with PySpark instead of Scala. So if you are into PySpark and EMR, you really want to use Jupyter with PySpark running on top of EMR.<\/p>\n<p>Technically this requires downloading and installing an appropriate Python distribution (like <a href=\"https:\/\/www.continuum.io\/downloads\">Anaconda<\/a> for example), configuring an appropriate Jupyter kernel which uses PySpark instead of plain Python. Moreover the Python distribution is required on all participating nodes, so Spark can start Python processes with the same packages on any node in the cluster. Things start to get complicated, especially if you want to large multiple EMR clusters &#8211; for example for providing a separate cluster to every attendee of a training.<\/p>\n<p>Obviously this situation requires some sort of automatization. And fortunately a good solution is provided by <a href=\"https:\/\/www.terraform.io\/\">Terraform<\/a> from Hashicorp &#8211; the perfect tool for deploying multiple clusters for trainings. And by adding a bootstrap action, it is also possible to automatically deploy <a href=\"https:\/\/www.continuum.io\/downloads\">Anaconda<\/a> and the Jupyter notebook server on the master node of the EMR cluster.<\/p>\n<p><!--more--><\/p>\n<h2>Deploying Spark + Juyper in AWS with Terraform<\/h2>\n<p>In order to perform an automatic deployment you can simply use <a href=\"https:\/\/github.com\/dimajix\/terraform-emr-training\">dimajix Terraform scripts available at github<\/a>. Clone the repository onto your local machine, then proceed as follows in order to create a secure deployment:<\/p>\n<ol>\n<li>Clone the github repository into some directory, for example via\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">git clone https:\/\/github.com\/dimajix\/terraform-emr-training.git<\/pre>\n<\/li>\n<li>Create a new public\/private ssh key pair, for example via\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">ssh-keygen -b 2048 -t rsa -C &quot;EMR Access Key&quot; -f deployer-key<\/pre>\n<\/li>\n<li>Copy <b><tt>aws-config.tf.template<\/tt><\/b> to <b><tt>aws-config.tf<\/tt><\/b>, and insert your AWS credentials and adjust the availability zone.<\/li>\n<li>Edit <b><tt>main.tf<\/tt><\/b> to suite your requirements (additional EMR components, number and size of clusters etc)<\/li>\n<li>Start the cluster via\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">terraform get\r\nterraform apply<\/pre>\n<\/li>\n<li>Create a dynamic SSH tunnel to the cluster via\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">ssh -i deployer-key -ND 8157 hadoop@public-master-ip-address<\/pre>\n<\/li>\n<li>Install <a href=\"https:\/\/getfoxyproxy.org\/\">Foxy-Proxy standard plugin<\/a> for Firefox or Chrome, use the provided config file <b><tt>foxy-proxy.xml<\/tt><\/b><\/li>\n<li>Access the Juper-Notebook via <b><tt>http:\/\/public-master-ip-address:8888<\/tt><\/b><\/li>\n<li>Perform your magic in Jupyter<\/li>\n<li>Destroy the cluster via\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">terraform destroy<\/pre>\n<\/li>\n<\/ol>\n<h2>What&#8217;s inside<\/h2>\n<p>The Terraform script will create a new VPC and subnets, will start new clusters with Spark, Hive, Pig, Presto, Hue, Zeppelin and Jupyter. You can modify <b><tt>main.tf<\/tt><\/b>, where the number of clusters, their common configuration (EC2 instance types) and EMR components are configured.<\/p>\n<p>If you are using FoxyProxy, all services are available at<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nYARN - http:\/\/public-master-ip-address:8088\r\nHDFS - http:\/\/public-master-ip-address:50070\r\nHue - http:\/\/public-master-ip-address:8888\r\nZeppelin - http:\/\/public-master-ip-address:8890\r\nSpark History - http:\/\/public-master-ip-address:18080\r\nJupyter Notebook - http:\/\/public-master-ip-address:8888\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Amazon Elastic MapReduce (EMR) is something wonderful if you need compute capacity on demand. I love it for deploying the technocal environments for my trainings, so every attendee gets its&#8230;<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[63,58,61,62,21],"class_list":{"0":"post-627","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"tag-anaconda","7":"tag-aws","8":"tag-emr","9":"tag-jupyter","10":"tag-pyspark"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Jupyter Notebooks with PySpark in AWS - Dimajix<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Jupyter Notebooks with PySpark in AWS - Dimajix\" \/>\n<meta property=\"og:description\" content=\"Amazon Elastic MapReduce (EMR) is something wonderful if you need compute capacity on demand. I love it for deploying the technocal environments for my trainings, so every attendee gets its...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en\" \/>\n<meta property=\"og:site_name\" content=\"Dimajix\" \/>\n<meta property=\"article:published_time\" content=\"2017-05-22T14:10:53+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-05T11:08:35+00:00\" \/>\n<meta name=\"author\" content=\"KupferschmidtAdmin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@KupferschmidtK\" \/>\n<meta name=\"twitter:site\" content=\"@KupferschmidtK\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"KupferschmidtAdmin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en\"},\"author\":{\"name\":\"KupferschmidtAdmin\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/person\\\/e39fb24c7d4ccbbbfff045e25e3eeb81\"},\"headline\":\"Jupyter Notebooks with PySpark in AWS\",\"datePublished\":\"2017-05-22T14:10:53+00:00\",\"dateModified\":\"2023-06-05T11:08:35+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en\"},\"wordCount\":513,\"publisher\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#organization\"},\"keywords\":[\"Anaconda\",\"aws\",\"EMR\",\"Jupyter\",\"PySpark\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en\",\"url\":\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en\",\"name\":\"Jupyter Notebooks with PySpark in AWS - Dimajix\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#website\"},\"datePublished\":\"2017-05-22T14:10:53+00:00\",\"dateModified\":\"2023-06-05T11:08:35+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/jupyter-notebooks-with-pyspark-in-aws\\\/?lang=en#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Startseite\",\"item\":\"https:\\\/\\\/dimajix.de\\\/en\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Jupyter Notebooks with PySpark in AWS\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#website\",\"url\":\"https:\\\/\\\/dimajix.de\\\/\",\"name\":\"Dimajix\",\"description\":\"Data. Analytics. Intelligence.\",\"publisher\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/dimajix.de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#organization\",\"name\":\"dimajix\",\"url\":\"https:\\\/\\\/dimajix.de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/dimajix.de\\\/wp-content\\\/uploads\\\/2020\\\/06\\\/fav.png\",\"contentUrl\":\"https:\\\/\\\/dimajix.de\\\/wp-content\\\/uploads\\\/2020\\\/06\\\/fav.png\",\"width\":347,\"height\":346,\"caption\":\"dimajix\"},\"image\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/KupferschmidtK\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/kaya-kupferschmidt\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/person\\\/e39fb24c7d4ccbbbfff045e25e3eeb81\",\"name\":\"KupferschmidtAdmin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g\",\"caption\":\"KupferschmidtAdmin\"},\"sameAs\":[\"https:\\\/\\\/www.dimajix.de\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Jupyter Notebooks with PySpark in AWS - Dimajix","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en","og_locale":"en_US","og_type":"article","og_title":"Jupyter Notebooks with PySpark in AWS - Dimajix","og_description":"Amazon Elastic MapReduce (EMR) is something wonderful if you need compute capacity on demand. I love it for deploying the technocal environments for my trainings, so every attendee gets its...","og_url":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en","og_site_name":"Dimajix","article_published_time":"2017-05-22T14:10:53+00:00","article_modified_time":"2023-06-05T11:08:35+00:00","author":"KupferschmidtAdmin","twitter_card":"summary_large_image","twitter_creator":"@KupferschmidtK","twitter_site":"@KupferschmidtK","twitter_misc":{"Written by":"KupferschmidtAdmin","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en#article","isPartOf":{"@id":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en"},"author":{"name":"KupferschmidtAdmin","@id":"https:\/\/dimajix.de\/#\/schema\/person\/e39fb24c7d4ccbbbfff045e25e3eeb81"},"headline":"Jupyter Notebooks with PySpark in AWS","datePublished":"2017-05-22T14:10:53+00:00","dateModified":"2023-06-05T11:08:35+00:00","mainEntityOfPage":{"@id":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en"},"wordCount":513,"publisher":{"@id":"https:\/\/dimajix.de\/#organization"},"keywords":["Anaconda","aws","EMR","Jupyter","PySpark"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en","url":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en","name":"Jupyter Notebooks with PySpark in AWS - Dimajix","isPartOf":{"@id":"https:\/\/dimajix.de\/#website"},"datePublished":"2017-05-22T14:10:53+00:00","dateModified":"2023-06-05T11:08:35+00:00","breadcrumb":{"@id":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/dimajix.de\/jupyter-notebooks-with-pyspark-in-aws\/?lang=en#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Startseite","item":"https:\/\/dimajix.de\/en\/home\/"},{"@type":"ListItem","position":2,"name":"Jupyter Notebooks with PySpark in AWS"}]},{"@type":"WebSite","@id":"https:\/\/dimajix.de\/#website","url":"https:\/\/dimajix.de\/","name":"Dimajix","description":"Data. Analytics. Intelligence.","publisher":{"@id":"https:\/\/dimajix.de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/dimajix.de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/dimajix.de\/#organization","name":"dimajix","url":"https:\/\/dimajix.de\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/dimajix.de\/#\/schema\/logo\/image\/","url":"https:\/\/dimajix.de\/wp-content\/uploads\/2020\/06\/fav.png","contentUrl":"https:\/\/dimajix.de\/wp-content\/uploads\/2020\/06\/fav.png","width":347,"height":346,"caption":"dimajix"},"image":{"@id":"https:\/\/dimajix.de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/KupferschmidtK","https:\/\/www.linkedin.com\/in\/kaya-kupferschmidt\/"]},{"@type":"Person","@id":"https:\/\/dimajix.de\/#\/schema\/person\/e39fb24c7d4ccbbbfff045e25e3eeb81","name":"KupferschmidtAdmin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g","caption":"KupferschmidtAdmin"},"sameAs":["https:\/\/www.dimajix.de"]}]}},"_links":{"self":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts\/627","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/comments?post=627"}],"version-history":[{"count":21,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts\/627\/revisions"}],"predecessor-version":[{"id":646,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts\/627\/revisions\/646"}],"wp:attachment":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/media?parent=627"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/categories?post=627"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/tags?post=627"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}