Traditionally HDFS was the primary storage for Hadoop (and therefore also for Apache Spark). Naturally this implies that you permanently need to run Hadoop servers for hosting the data. While this works perfectly well for many projects running an Hadoop Cluster which are either big enough to store all the data or only contain hot data (which is accessed vrey frequently), it may be worth some thoughts about alternatives.
One downside of HDFS simply is the costs associates, especially if you are running inside a cloud (like AWS for example). Renting cloud servers becomes expensive pretty fast, and that will hurt you even more if you only need them to store lots of cold data. Moreover while it is technically possible to dynamically scale up and down an extisting Hadoop cluster in order to increase the computing power for infrequent work loads or ad hoc analysis, this is also a questionable approach since it changes the core infrastructure (Hadoop) containing all your valuable business data.
Separating Storage and CPU again
A more agile approach for infrequent workloads would be to decouple computing from storage again (as it has been the case in data centers during the last twenty years or so). This would allow you to use servers specifically designed for storage which are less expensive per GB than typical Hadoop worker nodes which contain both storage, compute power and RAM. Or even better inside the cloud you can use storage provided by the cloud platform itself, which is often much cheaper than running your own servers.
Now that you have solved your storage problem, you also need compute power for performing data analytics tasks. This can be provided by spinning up small or large clusters just on demand, perform your tasks and then destroy the clusters again. In cloud environments you only pay for the time the cluster is running, which could save you a lot of money. Moreover you can dynamically create clusters of different size matching the specific requirements of each task. You can even run multiple clusters in parallel, which don’t interfer with each other, giving you a new level of flexibility for your data scientists.
Use cases for HDFS
Of course more flexibility and saving costs sounds great, but there are still reasons why you might want to stick with HDFS or at least why S3 isn’t the perfect fit for all scenarios either. The biggest downside of S3 is speed, or the lack of. At least in comparison with HDFS, especially if deployed with many fast discs or even on SSD, S3 is a whole lot slower, since there is absolutely no data locality. Everything has to be transfered from the S3 servers via network. This might cause in issue if your workloads are I/O bound. Moreover some MPP query tools like Impala do not support S3 very well, and the whole idea of MPP query engines do not fit very well to network attached storage.
S3 + Alluxio
But thanks to projects like Alluxio, even in use cases with low latency requirements, it is possible to work the S3 as a cheap persistent deep storage. Alluxio will create an overlay file system, which caches data from S3 vin memory, which again is accessed as a local resource by worker processes. Alluxio supports Hadoop and Spark as batch tools and Presto as an SQL query frontend. Of course this will require some beefy servers with lots of RAM again, but they don’t need a lot of storage directly attached to them. This setup provides a fast database for analytical workloads, which again can be sized by only considering the amount of hot data which needs to be in memory.
In the field
More and projects (especially those running in the cloud) are going down this route. Specifically S3 has been adopted as a de-facto standard for storage, as it is cheap and widely available (not only in the Amazon cloud, but also with many different vendors and on premise).
These companies will spin up clusters on demand for their data scientists which then directly access data in S3 without the need to copy them to some local directory or into HDFS first. Hadoop and Spark come out of the box with first class support for S3 as another file system in addition to HDFS. You can even create Hive tables for data stored in S3, which further simplifies accessing the data.
You can access files in s3 with either Spark or Hadoop simply by using an S3 uri with the appropriate schema. Actually there are three different schemas for accessing S3:
- s3://my-bucket/my-data – An old schema which used to split files into multiple chunks inside. You better do not use this one, as it would create files which are not useable outside of Hadoop / Spark.
- s3n://my-bucket/my-data – The native S3 file system using an older library for accessing S3
- s3a://my-bucket/my-data – The native S3 file system using the AWS SDK. You should always use this one!
In order to access data, you also need to provide the credentials in your configuration. Depending on your tool, this is done using different configuration keys:
S3 Settings for Spark
The two most important settings are spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key, but you can also configure a proxy (if required) and some more settings.
S3 Settings for Hadoop
The two most important settings are fs.s3a.access.key and fs.s3a.secret.key, but you can also configure a proxy (if required) and some more settings.