Running Spark and Hadoop with S3

By | Big Data, Hadoop, Spark | No Comments

Traditionally HDFS was the primary storage for Hadoop (and therefore also for Apache Spark). Naturally this implies that you permanently need to run Hadoop servers for hosting the data. While this works perfectly well for many projects running an Hadoop Cluster which are either big enough to store all the data or only contain hot data (which is accessed vrey frequently), it may be worth some thoughts about alternatives.

One downside of HDFS simply is the costs associates, especially if you are running inside a cloud (like AWS for example). Renting cloud servers becomes expensive pretty fast, and that will hurt you even more if you only need them to store lots of cold data. Moreover while it is technically possible to dynamically scale up and down an extisting Hadoop cluster in order to increase the computing power for infrequent work loads or ad hoc analysis, this is also a questionable approach since it changes the core infrastructure (Hadoop) containing all your valuable business data.
Read More



Ihr Name (Pflichtfeld)

Ihre E-Mail-Adresse (Pflichtfeld)


Ihre Nachricht