{"id":728,"date":"2017-10-02T14:54:54","date_gmt":"2017-10-02T12:54:54","guid":{"rendered":"http:\/\/dimajix.de\/?p=728"},"modified":"2023-06-05T13:08:56","modified_gmt":"2023-06-05T11:08:56","slug":"running-jupyter-with-spark-in-docker","status":"publish","type":"post","link":"https:\/\/dimajix.de\/en\/running-jupyter-with-spark-in-docker\/","title":{"rendered":"Running Jupyter with Spark in Docker"},"content":{"rendered":"<p>most attendees of <a href=\"http:\/\/dimajix.de\/schulungen\/apache-spark\/\">dimajix Spark workshops<\/a> seem to like the hands-on approach I am offering to them using Jupyter notebooks with Spark clusters running in the AWS cloud. But then, when the workshop finishes, the natural question for many attendees is &#8220;how can I continue?&#8221;. One the one hand, setting up a Spark cluster is not too difficult, but on the other hand, this is probably out of scope for most people. Moreover you still need to get Jupyter notebook running with PySpark, which is again not too difficult, but also out of scope for a starting point.<\/p>\n<h2>Docker to the Rescue<\/h2>\n<p>So I made up a Docker image, which contains Spark 2.2.0 and Anaconda Python 3.5, which can be run locally on Linux, Windows and probably Mac (I didn&#8217;t test on Apple so far). You only need to have Docker installed on your machine, everything else is contained in the single image. The image can be downloaded with the Docker CLI as follows:<br \/>\n<code>docker pull dimajix\/jupyter-spark:latest<\/code><br \/>\nWhen the image is downloaded (which is required only once), you can run a Jupyter notebook via<br \/>\n<code>docker rum --rm -p 8888:8888 dimajix\/jupyter-spark:latest<\/code><br \/>\nThen point your favorite browser to http:\/\/localhost:8888 , this will show the Jupyter notebook start page. Since Spark will run in &#8220;local&#8221; mode, it does not require any cluster resources. But still it will use as much CPUs as it can find in your Docker environment.<\/p>\n<h3>Accessing S3<\/h3>\n<p>In order to access training data in S3, you also need to have some AWS credentials and specify them as environment variables as follows:<br \/>\n<code>docker run --rm -p 8888:8888 -e AWS_ACCESS_KEY_ID= -e AWS_SECRET_ACCESS_KEY= dimajix\/jupyter-spark:latest<\/code><br \/>\nNote that for accessing data in S3, for some technical reasons, you need to use the schema &#8220;s3a&#8221; instead of &#8220;s3&#8221;, i.e. &#8220;s3a:\/\/dimajix-training\/data\/alice\/&#8221;.<\/p>\n<h2>More on GitHub<\/h2>\n<p>The Docker image also supports a Spark standalone cluster and has some more options to tweak (for example, proxy for accessing S3 for all those sitting behing a firewall and proxy), you can find all the details on GitHub at <a href=\"https:\/\/github.com\/dimajix\/docker-jupyter-spark\">https:\/\/github.com\/dimajix\/docker-jupyter-spark<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>most attendees of dimajix Spark workshops seem to like the hands-on approach I am offering to them using Jupyter notebooks with Spark clusters running in the AWS cloud. But then,&#8230;<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[62,21,17,20],"class_list":{"0":"post-728","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"tag-jupyter","7":"tag-pyspark","8":"tag-python","9":"tag-spark"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Running Jupyter with Spark in Docker - Dimajix<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Running Jupyter with Spark in Docker - Dimajix\" \/>\n<meta property=\"og:description\" content=\"most attendees of dimajix Spark workshops seem to like the hands-on approach I am offering to them using Jupyter notebooks with Spark clusters running in the AWS cloud. But then,...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en\" \/>\n<meta property=\"og:site_name\" content=\"Dimajix\" \/>\n<meta property=\"article:published_time\" content=\"2017-10-02T12:54:54+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-05T11:08:56+00:00\" \/>\n<meta name=\"author\" content=\"KupferschmidtAdmin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@KupferschmidtK\" \/>\n<meta name=\"twitter:site\" content=\"@KupferschmidtK\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"KupferschmidtAdmin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en\"},\"author\":{\"name\":\"KupferschmidtAdmin\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/person\\\/e39fb24c7d4ccbbbfff045e25e3eeb81\"},\"headline\":\"Running Jupyter with Spark in Docker\",\"datePublished\":\"2017-10-02T12:54:54+00:00\",\"dateModified\":\"2023-06-05T11:08:56+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en\"},\"wordCount\":335,\"publisher\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#organization\"},\"keywords\":[\"Jupyter\",\"PySpark\",\"Python\",\"Spark\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en\",\"url\":\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en\",\"name\":\"Running Jupyter with Spark in Docker - Dimajix\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#website\"},\"datePublished\":\"2017-10-02T12:54:54+00:00\",\"dateModified\":\"2023-06-05T11:08:56+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/running-jupyter-with-spark-in-docker\\\/?lang=en#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Startseite\",\"item\":\"https:\\\/\\\/dimajix.de\\\/en\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Running Jupyter with Spark in Docker\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#website\",\"url\":\"https:\\\/\\\/dimajix.de\\\/\",\"name\":\"Dimajix\",\"description\":\"Data. Analytics. Intelligence.\",\"publisher\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/dimajix.de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#organization\",\"name\":\"dimajix\",\"url\":\"https:\\\/\\\/dimajix.de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/dimajix.de\\\/wp-content\\\/uploads\\\/2020\\\/06\\\/fav.png\",\"contentUrl\":\"https:\\\/\\\/dimajix.de\\\/wp-content\\\/uploads\\\/2020\\\/06\\\/fav.png\",\"width\":347,\"height\":346,\"caption\":\"dimajix\"},\"image\":{\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/KupferschmidtK\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/kaya-kupferschmidt\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/dimajix.de\\\/#\\\/schema\\\/person\\\/e39fb24c7d4ccbbbfff045e25e3eeb81\",\"name\":\"KupferschmidtAdmin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g\",\"caption\":\"KupferschmidtAdmin\"},\"sameAs\":[\"https:\\\/\\\/www.dimajix.de\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Running Jupyter with Spark in Docker - Dimajix","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en","og_locale":"en_US","og_type":"article","og_title":"Running Jupyter with Spark in Docker - Dimajix","og_description":"most attendees of dimajix Spark workshops seem to like the hands-on approach I am offering to them using Jupyter notebooks with Spark clusters running in the AWS cloud. But then,...","og_url":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en","og_site_name":"Dimajix","article_published_time":"2017-10-02T12:54:54+00:00","article_modified_time":"2023-06-05T11:08:56+00:00","author":"KupferschmidtAdmin","twitter_card":"summary_large_image","twitter_creator":"@KupferschmidtK","twitter_site":"@KupferschmidtK","twitter_misc":{"Written by":"KupferschmidtAdmin","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en#article","isPartOf":{"@id":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en"},"author":{"name":"KupferschmidtAdmin","@id":"https:\/\/dimajix.de\/#\/schema\/person\/e39fb24c7d4ccbbbfff045e25e3eeb81"},"headline":"Running Jupyter with Spark in Docker","datePublished":"2017-10-02T12:54:54+00:00","dateModified":"2023-06-05T11:08:56+00:00","mainEntityOfPage":{"@id":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en"},"wordCount":335,"publisher":{"@id":"https:\/\/dimajix.de\/#organization"},"keywords":["Jupyter","PySpark","Python","Spark"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en","url":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en","name":"Running Jupyter with Spark in Docker - Dimajix","isPartOf":{"@id":"https:\/\/dimajix.de\/#website"},"datePublished":"2017-10-02T12:54:54+00:00","dateModified":"2023-06-05T11:08:56+00:00","breadcrumb":{"@id":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/dimajix.de\/running-jupyter-with-spark-in-docker\/?lang=en#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Startseite","item":"https:\/\/dimajix.de\/en\/home\/"},{"@type":"ListItem","position":2,"name":"Running Jupyter with Spark in Docker"}]},{"@type":"WebSite","@id":"https:\/\/dimajix.de\/#website","url":"https:\/\/dimajix.de\/","name":"Dimajix","description":"Data. Analytics. Intelligence.","publisher":{"@id":"https:\/\/dimajix.de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/dimajix.de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/dimajix.de\/#organization","name":"dimajix","url":"https:\/\/dimajix.de\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/dimajix.de\/#\/schema\/logo\/image\/","url":"https:\/\/dimajix.de\/wp-content\/uploads\/2020\/06\/fav.png","contentUrl":"https:\/\/dimajix.de\/wp-content\/uploads\/2020\/06\/fav.png","width":347,"height":346,"caption":"dimajix"},"image":{"@id":"https:\/\/dimajix.de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/KupferschmidtK","https:\/\/www.linkedin.com\/in\/kaya-kupferschmidt\/"]},{"@type":"Person","@id":"https:\/\/dimajix.de\/#\/schema\/person\/e39fb24c7d4ccbbbfff045e25e3eeb81","name":"KupferschmidtAdmin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/24ad52e4805b09ef0620defe4c353db34b100cf6c727121646d5666c9fd58cbc?s=96&r=g","caption":"KupferschmidtAdmin"},"sameAs":["https:\/\/www.dimajix.de"]}]}},"_links":{"self":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts\/728","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/comments?post=728"}],"version-history":[{"count":10,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts\/728\/revisions"}],"predecessor-version":[{"id":739,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/posts\/728\/revisions\/739"}],"wp:attachment":[{"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/media?parent=728"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/categories?post=728"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dimajix.de\/en\/wp-json\/wp\/v2\/tags?post=728"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}