The loading of parquet files from HDFS is slower that the loading from S3. What can be the reasons for that?
0 I have hundreds of parquet files in HDFS. I also have the same files in AWS S3. On a EMR cluster, I run a machine learning model, which can take its learning data from HDFS or from S3. When it loads the data from HDFS, it takes a longer time than when the data is loaded from S3. Should not it be the opposite ? What can be the reason(s) for that? Obviously, the hardware (the machines) used in the EMR cluster is the same for both cases. hadoop amazon-s3 share | improve this question asked Nov 14 '18 at 16:52 Benjamin Dupuis Benjamin Dupuis 1 2 add a comment | 0 I have hundreds of parquet files in HDFS. I also have the same files in AWS S3. On a EMR cluster, I run a machine learning model, which can take its learning data from HDFS or from S3. When it loads the data from HDFS, it takes a longer time than when the data is loaded from S3. Should not it be the opposite ? What can be the reason(s) for that? Obvious...