I'm trying to load terabytes of data from hdfs to local using hadoop fs -get but it takes hours to complete this task. Is there an alternate effective way to get data from hdfs to local?
How fast you copy to a local filesystem is dependent on many factors including:
Are you copying in parallel or in serial.
Is the file splittable (can a mapper potentially deal with a block of data rather than a file, usually a problem if you have certain kinds of compressed files on HDFS)
Network bandwidth of course because you will likely be pulling from many DataNodes
Option 1: DistCp
In any case, since you state your files are on HDFS, we know each hadoop slave node can see the data. You can try to use the DistCp command (distributed copy) which will make your copy operation into a parallel MapReduce job for you WITH ONE MAJOR CAVEAT!.
MAJOR CAVEAT: This will be a distributed copy process so the destination you specify on the command line needs to be a place visible to all nodes. To do this you can mount a network share on all nodes and specify a directory in that network share (NFS, Samba, Other) as the destination for your files. This may take getting a system admin involved, but the result may be a faster file copy operation so the cost-benefit is up to you.
DistCp documentation is here: http://hadoop.apache.org/docs/r0.19.0/distcp.html
DistCp example: YourShell> hadoop distcp -i -update /path/on/hdfs/to/directoryOrFileToCopy file:///LocalpathToCopyTo
Option 2: Multi-threaded Java Application with HDFS API
As you found, the hadoop fs -get is a sequential operation. If your java skills are up to the task, you can write your own multithreaded copy program using the hadoop file system API calls.
Option 3: Multi-threaded Program in any language with HDFS REST API
If you know a different language than Java, you can similarly write a multi-threaded program that accesses HDFS through the HDFS REST API or as an NFS mount
Related
What is the recommended DefaultFS (File system) for Hadoop on Dataproc. Are there any benchmarks, considerations available around using GCS vs HDFS as the default file system?
I was also trying to test things out and discovered that when I set the DefaultFS to a gs:// path, the Hive scratch files are getting created - both on HDFS as well as the GCS paths. Is this happening synchronously and adding to latency or does the write to GCS happen after the fact?
Would appreciate any guidance, reference around this.
Thank you
PS: These are ephemeral Dataproc clusters that are going to be using GCS for all persistent data.
HDFS is faster. There should already be public benchmarks for that, or just taken as a fact because GCS is networked storage where HDFS is directly mounted in the Dataproc VMs.
"Recommended" would be persistent storage, though, so GCS, but maybe only after finalizing the data in the applications. For example, you might not want Hive scratch files in GCS since they'll never be used outside of the current query session, but you would want Spark checkpoints if you're running periodic batch jobs that scale down the HDFS cluster in between executions
I would say the default (HDFS) is the recommended. Typically, the input and output data of Dataproc jobs are persisted outside of the cluster in GCS or BigQuery, the cluster is used for compute and intermediate data. These intermediate data are stored on local disks directly or through HDFS which eventually also goes to local disks. After the job is done, you can safely delete the cluster, only pay for the storage of input and output data to save cost.
Also HDFS usually has lower latency for intermediate data, especially for lots of small files and metadata operations, e.g. dir rename. GCS is better at throughput for large files.
But when using HDFS, you need to provision sufficient disk space (at least 1TB each node) and consider using local SSDs. See https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size for more details.
I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use
I understand that distcp is used for inter/intra cluster transfer of data. Is it possible to use distcp to ingest data from the local file system to HDFS. I understand that you can use file:///....
to point to a local file outside of HDFS but how reliable and fast is that compared to the inter/intra cluster transfer.
Distcp is a mapreduce job that is executed inside the hadoop cluster. For hadoop cluster perspective, your local machine is not a local file system. Then you can't use your local file sytem with distcp. An alternative could be configure a FTP server in your machine that hadoop cluster can read. The performance depends on the network and the protocol used (ftp with hadoop has a very bad performance).
Use hdfs dfs -put command could be better for small amount of data but it isn't work in parallel like distcp.
Hadoop writes the intermediate results to the local disk and the results of the reducer to the HDFS. what does HDFS mean. What does it physically translate to
HDFS is the Hadoop Distributed File System. Physically, it is a program running on each node of the cluster that provides a file system interface very similar to that of a local file system. However, data written to HDFS is not just stored on the local disk but rather is distributed on disks across the cluster. Data stored in HDFS is typically also replicated, so the same block of data may appear on multiple nodes in the cluster. This provides reliable access so that one node's crashing or being busy will not prevent someone from being able to read any particular block of data from HDFS.
Check out http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System for more information.
As Chase indicated, HDFS is Hadoop Distributed File System.
If I may, I recommend this tutorial and video of how HDFS and the Map/Reduce framework works and will serve you as a guide into the world of Hadoop: http://www.cloudera.com/resource/introduction-to-apache-mapreduce-and-hdfs/
I'm able to run a local mapper and reducer built using ruby with an input file.
I'm unclear about the behavior of the distributed system though.
For the production system, I have a HDFS set up across two machines. I know that if I store a large file on the HDFS, it will have some blocks on both machines to allow for parallelization. Do I also need to store the actual mappers and reducer files (my ruby files in this case) on the HDFS as well?
Also, how would I then go about actually running the streaming job so that it runs in a parallel manner on both systems?
If you were to use mapper/reducers written in ruby (or anything other than Java), you would have to use hadoop-streaming. Hadoop streaming has an option to package your mapper/reducer files when sending your job to the cluster. The following link should have what you are looking for.
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html