Calling distcp from Spark - hadoop

can anyone tell me what's the most robust way to copy files from HDFS to S3 in Pyspark ?
I am looking at 2 options:
I. Call distcp directly as in the following:
distcp_arglist =['/usr/lib/hadoop/bin/hadoop','distcp',
...,
'-overwrite',
src_path, dest_path]
II. Using s3-distcp - which seems a bit more involved.
https://gist.github.com/okomestudio/699edbb8e095f07bafcc
Any suggestions are welcome. Thanks.

I'm going to point you at a little bit of my code, cloudcp
This is a basic proof of concept of implementing distCp in spark
individual files are scheduled via the spark scheduler; not ideal for 0-byte files, but stops the job being held up by a large file off one node
does do locality via a special RDD which works out the location of every row (i.e file) differently (which has to be in the org.apache.spark package for scoped access)
shows how to do FS operations within a spark map
shuffles the input for a bit of randomness
collects results within an RDD
Doesn't do:
* incremental writes (you can't compare checksums between HDFS and S3 anyway, but it could do a check for fs.exists(path) before the copy.
* permissions. S3 doesn't have them
* throttling
* scheduling of the big files first. You ought to.
* recovery of job failure (no incremental, see)
Like I said, PoC to say "we be more agile by using spark for the heavy lifting"
Anyway, take it and play, you can rework it to operate within an existing spark context with ease, as long as you don't mind a bit of scala coding.

Distcp would probably be the way to go as it is well-proven solution for transfering data between the clusters. I guess any possible alternatives would do something similar - create mapreduce jobs for transfering the data. Important point here is how to tune this process for your particular data as it could really depend on many factors like networking or map-reduce settings. I recommend you to read HortonWorks article about how you can tune this process

Related

use spark to copy data across hadoop cluster

I have a situation where I have to copy data/files from PROD to UAT (hadoop clusters). For that I am using 'distcp' now. but it is taking forever. As distcp uses map-reduce under the hood, is there any way to use spark to make the process any faster? Like we can set hive execution engine to 'TEZ' (to replace map-reduce), can we set execution engine to spark for distcp? Or is there any other 'spark' way to copy data across clusters which may not even bother about distcp?
And here comes my second question (assuming we can set distcp execution engine to spark instead of map-reduce, please don't bother to answer this one otherwise):-
As per my knowledge Spark is faster than map-reduce mainly because it stores data in the memory which it might need to process in several occasions so that it does not have to load the data all the way from disk. Here we are copying data across clusters, so there is no need to process one file (or block or split) more than once as each file will go up into the memory then will be sent over the network, gets copied to the destination cluster disk, end of the story for that file. Then how come Spark makes the process faster if the main feature is not used?
Your bottlenecks on bulk cross-cluster IO are usually
bandwidth between clusters
read bandwidth off the source cluster
write bandwidth to the destination cluster (and with 3x replication, writes do take up disk and switch bandwidth)
allocated space for work (i.e. number of executors, tasks)
Generally on long-distance uploads its your long-haul network that is the bottleneck: you don't need that many workers to flood the network.
There's a famous tale of a distcp operation between two Yahoo! clusters which did manage to do exactly that to part of the backbone: the Hadoop ops team happy that the distcp was going so fast, while the networks ops team are panicing that their core services were somehow suffering due to the traffic between two sites. I believe this incident is the reason that distcp now has a -bandwidth option :)
Where there may be limitations in distcp, it's probably in task setup and execution: the decision of which files to copy is made in advance and there's not much (any?) intelligence in rescheduling work if some files copy fast but others are outstanding.
Distcp just builds up the list in advance and hands it off to the special distcp mappers, each of which reads its list of files and copies it over.
Someone could try doing a spark version of distcp; it could be an interesting project if someone wanted to work on better scheduling, relying on the fact that spark is very efficient at pushing out new work to existing executors: a spark version could push out work dynamically, rather than listing everything in advance. Indeed, it could still start the copy operation while enumerating the files to copy, for a faster startup time. Even so: cross-cluster bandwidth will usually be the choke point.
Spark is not really intended for data movement between Hadoop clusters. You may want to look into additional mappers for your distcp job using the "-m" option.

SequenceFile alternative/extension that allows in-place updates

I like the convenience of a database where you can update a row in-place. But Hadoop relies on sequence files that are capable of being consumed in parallel.
I liked the idea of HBase where I can rewrite just one row; as well as being input to a map-reduce job. But HBase is not something a newb must mess with, right? What is a good tool/method for this?
I don't think it's very difficult to learn and use HBase.
Coming to your original question. The reason why we use HBase is same as the reason behind using any other DB, i.e random, real-time read/write access, which HDFS lacks like any other FS. And this is true for any filesystem, not just for HDFS. You could take ext4 & MySQL paradigm as an example.
And when you say re-write in HBase it is actually not update. You either put a new version of a cell or delete a cell and put new data at the same location.
And you can't say that Hadoop relies on sequence files to provide you the parallelism. Parallelism is something which is provided by Hadoop by virtue of its nature, i'e a distributed platform. You can handle almost any kind of file using Hadoop with almost euqal parallelism. The only advantage with sequence files is that they are more suitable for MapReduce processing as they are already in key/vale pairs.
You have to take it with a pinch of salt, but frankly speaking Hadoop doesn't understand update. If you could elaborate your use case a bit more, maybe I could suggest something better.

Data movement HDFS Vs Parallel file system Vs MPI

I'm currently working on implementation of machine learning algorithms on MR-MPI (MapReduce on MPI). And i'm also trying to understand about other MapReduce frameworks especially Hadoop, so the following is my basic question (I'm new to MapReduce frameworks, i aplogize if my question dosen't make sense).
Question: Since MapReduce can be implemented on top of many things such as a parallel file system(GPFS), HDFS, MPI, e.t.c.,. After the map step there is a collate operation and then followed by a reduce operation. For a collate operation we need some data movement to happen across the nodes. In this regard i would like to know what is the difference in data movement mechanisms(between nodes) in HDFS Vs GPFS Vs MPI.
I appreciate if you provide me some good explanation and can give me some good references on each of these so i can get into further details.
Thanks.
MapReduce as a paradigm can bi implemented on many storage systems. Indeed Hadoop has so called DFS (Distributed file system) abstraction which enable integration of different storage system and run MapReduce over them. For example there are Amazon S3, Local file system, Opens Stack Swift and other integrations.
In the same time HDFS integration has one special property - it reports to the MR engine (JobTracker, to be more specific) where data resides and it enable smart scheduling of Mapping in the way that data to be processed by each Mapper is usually collocated with the Mapper.
As a result during Mapping phase data is not moved over network when MR run over HDFS.
To be more general can be stated that idea of Hadoop MR is to move code to data and not opposite, and it should be important criteria when evaluating any scalable MR implementation - does this system care that mappers process local data.
The OP has mixed a couple of things - messaging and file system, so there are multiple ansewers.
Hadoop/MAPI is a WIP and you can find more details here.
Hadoop/GPFS is still open.
Hadoop/HDFS comes out of the box from Apache Hadoop. For data transfer between the mappers and reducers HTTP is used, not sure why.

Synchronize data to HBase/HDFS and use it as input to MapReduce job

I would like to synchronize data to a Hadoop filesystem. This data is intended to be used as input for a scheduled MapReduce job.
This example might explain more:
Lets say I have an input stream of documents which contain a bunch of words, these words are needed as input for a MapReduce WordCount job. So, for each document, all words should be parsed out and uploaded to the filesystem. However, if the same document arrives from the input stream again, I only want the changes to be uploaded (or deleted) from the filesystem.
How should the data be stored; should I use HDFS or HBase? The amount of data is not very large, maybe a couple of GB.
Is it possible to start scheduled MapReduce jobs with input from HDFS and/or HBase?
I would first pick the best tool for the job, or do some research to make a reasonable choice. You're asking the question, which is the most important step. Given the amount of data you're planning to process, Hadoop is probably just one option. If this is the first step towards bigger and better things, then that would narrow the field.
I would then start off with the simplest approach that I expect to work, which typically means using the tools I already know. Write code flexibly to make it easier to replace original choices with better ones as you learn more or run into roadblocks. Given what you've stated in your question, I'd start off by using HDFS, using Hadoop command-lines tools to push the data to an HDFS folder (hadoop fs -put ...). Then, I'd write an MR job or jobs to do the processing, running them manually. When it was working I'd probably use cron to handle scheduling of the jobs.
That's a place to start. As you build the process, if you reach a point where HBase seems like a natural fit for what you want to store, then switch over to that. Solve one problem at a time, and that will give you clarity on which tools are the right choice each step of the way. For example, you might get to the scheduling step and know by that time that cron won't do what you need - perhaps your organization has requirements for job scheduling that cron won't fulfil. So, you pick a different tool.

Experience with Hadoop?

Have any of you tried Hadoop? Can it be used without the distributed filesystem that goes with it, in a Share-nothing architecture? Would that make sense?
I'm also interested into any performance results you have...
Yes, you can use Hadoop on a local filesystem by using file URIs instead of hdfs URIs in various places. I think a lot of the examples that come with Hadoop do this.
This is probably fine if you just want to learn how Hadoop works and the basic map-reduce paradigm, but you will need multiple machines and a distributed filesystem to get the real benefits of the scalability inherent in the architecture.
Hadoop MapReduce can run ontop of any number of file systems or even more abstract data sources such as databases. In fact there are a couple of built-in classes for non-HDFS filesystem support, such as S3 and FTP. You could easily build your own input format as well by extending the basic InputFormat class.
Using HDFS brings certain advantages, however. The most potent advantage is that the MapReduce job scheduler will attempt to execute maps and reduces on the physical machines that are storing the records in need of processing. This brings a performance boost as data can be loaded straight from the local disk instead of transferred over the network, which depending on the connection may be orders of magnitude slower.
As Joe said, you can indeed use Hadoop without HDFS. However, throughput depends on the cluster's ability to do computation near where data is stored. Using HDFS has 2 main benefits IMHO 1) computation is spread more evenly across the cluster (reducing the amount of inter-node communication) and 2) the cluster as a whole is more resistant to failure due to data unavailability.
If your data is already partitioned or trivially partitionable, you may want to look into supplying your own partitioning function for your map-reduce task.
The best way to wrap your head around Hadoop is to download it and start exploring the include examples. Use a Linux box/VM and your setup will be much easier than Mac or Windows. Once you feel comfortable with the samples and concepts, then start to see how your problem space might map into the framework.
A couple resources you might find useful for more info on Hadoop:
Hadoop Summit Videos and Presentations
Hadoop: The Definitive Guide: Rough Cuts Version - This is one of the few (only?) books available on Hadoop at this point. I'd say it's worth the price of the electronic download option even at this point ( the book is ~40% complete ).
Parallel/ Distributed computing = SPEED << Hadoop makes this really really easy and cheap since you can just use a bunch of commodity machines!!!
Over the years disk storage capacities have increased massively but the speeds at which you read the data have not kept up. The more data you have on one disk, the slower the seeks.
Hadoop is a clever variant of the divide an conquer approach to problem solving.
You essentially break the problem into smaller chunks and assign the chunks to several different computers to perform processing in parallel to speed things up rather than overloading one machine. Each machine processes its own subset of data and the result is combined in the end. Hadoop on a single node isn't going to give you the speed that matters.
To see the benefit of hadoop, you should have a cluster with at least 4 - 8 commodity machines (depending on the size of your data) on a the same rack.
You no longer need to be a super genius parallel systems engineer to take advantage of distributed computing. Just know hadoop with Hive and your good to go.
yes, hadoop can be very well used without HDFS. HDFS is just a default storage for Hadoop. You can replace HDFS with any other storage like databases. HadoopDB is an augmentation over hadoop that uses Databases instead of HDFS as a data source. Google it, you will get it easily.
If you're just getting your feet wet, start out by downloading CDH4 & running it. You can easily install into a local Virtual Machine and run in "pseudo-distributed mode" which closely mimics how it would run in a real cluster.
Yes You can Use local file system using file:// while specifying the input file etc and this would work also with small data sets.But the actual power of hadoop is based on distributed and sharing mechanism. But Hadoop is used for processing huge amount of data.That amount of data cannot be processed by a single local machine or even if it does it will take lot of time to finish the job.Since your input file is on a shared location(HDFS) multiple mappers can read it simultaneously and reduces the time to finish the job. In nutshell You can use it with local file system but to meet the business requirement you should use it with shared file system.
Great theoretical answers above.
To change your hadoop file system to local, you can change it in "core-site.xml" configuration file like below for hadoop versions 2.x.x.
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
for hadoop versions 1.x.x.
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>

Resources