I'm trying to upload a 32GB file to a S3 bucket using the s3cmd CLI. It's doing a multipart upload and often fails. I'm doing this from a server which has a 1000mbps bandwidth to play with. But the upload still is VERY slow. Is there something I can do to speed this up?
On the other hand, the file is on the HDFS on the server I mentioned. Is there a way to reference the Amazon Elastic Map Reduce job to pick it up from this HDFS? It's still an upload but the job is getting executed as well. So the overall process is much quicker.
First I'll admit that I've never used the Multipart feature of s3cmd, so I can't speak to that. However, I have used boto in the past to upload large (10-15GB files) to S3 with a good deal of success. In fact, it became such a common task for me that I wrote a little utility to make it easier.
As for your HDFS question, you can always reference an HDFS path with a fully qualified URI, e.g., hdfs://{namenode}:{port}/path/to/files. This assumes your EMR cluster can access this external HDFS cluster (might have to play with security group settings)
Related
What I am trying to achieve?
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
My setup
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
What I tried
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
Changes that I see after applying the aforementioned configs:
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.
I don't see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. anymore.
A _SUCCESS file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic".
Hence, I believe my configs are getting applied correctly.
What I expect
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
Some other point that might be of value
When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol", my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol.
I have not looked into enabling S3gaurd, as S3 now provides strong consistency.
correct. you don't need s3guard
the com.hortonworks binding was for the wip committer work. the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes. you seem to be ok there
the simple test for what committer is live is to print the JSON _SUCCESS file. If that is a 0 byte file, you are still using the old committer. it does sound like you are.
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere
As I understand, that Hadoop HDFS can't increase the network speed, but I was in a discussion with a few folks trying to brainstorm how we can significantly speed up our uploads, and someone said that they were able to significantly improve the upload speed using HDFS.
If a user is on a LAN (100 MBPS), is there someway Hadoop HDFS can help increase the upload speeds when the user uploads a large file >100GB using their browser?
The webbrowser and webserver will then become the bottleneck in itself. They must buffer the file on that server, and then upload to HDFS, as compared to a direct datanode writer of hadoop fs -copyFromLocal
HUE (which uses WebHDFS) operates in this fashion, and I don't think there is an easy way to stream that large of a file via HTTP to exist on HDFS unless you can do chunked uploads, and once you do, you'd then have multiple smaller files on HDFS rather than the original 100+ GB one (assuming you're not trying to append to the same file reference on HDFS)
Recently I am setting up my Hadoop cluster over Object Store with S3, all data file are store in S3 instead of HDFS, and I successfully run spark and MP over S3, so I wonder if my namenode is still necessary, if so, what does my namenode do while I am running hadoop application over S3? Thanks.
No, provided you have a means to deal with the fact that S3 lacks the consistency needed by the shipping work committers. Every so often, if S3's listings are inconsistent enough, your results will be invalid and you won't even notice.
Different suppliers of Spark on AWS solve this in their own way. If you are using ASF spark, there is nothing bundled which can do this.
https://www.youtube.com/watch?v=BgHrff5yAQo
I'm working with EMR (Elastic MapReduce) on AWS infrastructure and the default way to provide input files (large datasets) for programs is to upload them to an S3 bucket and reference those buckets from within EMR.
Usually I download the datasets to my local,development machine and then upload them to S3, but this is getting harder to do with larger files, as upload speeds are generally much lower than download speeds.
My question is is there a way to download files from the internet (given their URL) directly into S3 so I don't have to download them to my local machine and then manually upload them?
No. You need an intermediary- typically, an EC2 instance is used, rather than your local machine, for speed.
I am developing an API for using hdfs as a distributed file storage. I have made a REST api for allowing a server to mkdir, ls, create and delete a file in the HDFS cluster using Webhdfs. But since Webhdfs does not support downloading a file, are there any solutions for achieving this. I mean I have a server who runs my REST api and communicates with the cluster. I know the OPEN operation just supports reading a text file content, but suppose I have a file which is 300 MB in size, how can I download it from the hdfs cluster. Do you guys have any possible solutions.? I was thinking of directly pinging the datanodes for a file, but this solution is flawed as if the file is 300 MB in size, it will put a huge load on my proxy server, so is there a streaming API to achieve this.
As an alternative you could make use of streamFile provided by DataNode API.
wget http://$datanode:50075/streamFile/demofile.txt
It'll not read the file as a whole, so the burden will be low, IMHO. I have tried it, but on a pseudo setup and it works fine. You can just give it a try on your fully distributed setup and see if it helps.
One way which comes to my mind, is to use a proxy worker, which reads the file using hadoop file system API, and creates a local normal file.And the provide download link to this file. Downside being
Scalablity of Proxy server
Files may be theoretically too large to fit into disk of a single proxy server.