use spark to copy data across hadoop cluster - hadoop

I have a situation where I have to copy data/files from PROD to UAT (hadoop clusters). For that I am using 'distcp' now. but it is taking forever. As distcp uses map-reduce under the hood, is there any way to use spark to make the process any faster? Like we can set hive execution engine to 'TEZ' (to replace map-reduce), can we set execution engine to spark for distcp? Or is there any other 'spark' way to copy data across clusters which may not even bother about distcp?
And here comes my second question (assuming we can set distcp execution engine to spark instead of map-reduce, please don't bother to answer this one otherwise):-
As per my knowledge Spark is faster than map-reduce mainly because it stores data in the memory which it might need to process in several occasions so that it does not have to load the data all the way from disk. Here we are copying data across clusters, so there is no need to process one file (or block or split) more than once as each file will go up into the memory then will be sent over the network, gets copied to the destination cluster disk, end of the story for that file. Then how come Spark makes the process faster if the main feature is not used?

Your bottlenecks on bulk cross-cluster IO are usually
bandwidth between clusters
read bandwidth off the source cluster
write bandwidth to the destination cluster (and with 3x replication, writes do take up disk and switch bandwidth)
allocated space for work (i.e. number of executors, tasks)
Generally on long-distance uploads its your long-haul network that is the bottleneck: you don't need that many workers to flood the network.
There's a famous tale of a distcp operation between two Yahoo! clusters which did manage to do exactly that to part of the backbone: the Hadoop ops team happy that the distcp was going so fast, while the networks ops team are panicing that their core services were somehow suffering due to the traffic between two sites. I believe this incident is the reason that distcp now has a -bandwidth option :)
Where there may be limitations in distcp, it's probably in task setup and execution: the decision of which files to copy is made in advance and there's not much (any?) intelligence in rescheduling work if some files copy fast but others are outstanding.
Distcp just builds up the list in advance and hands it off to the special distcp mappers, each of which reads its list of files and copies it over.
Someone could try doing a spark version of distcp; it could be an interesting project if someone wanted to work on better scheduling, relying on the fact that spark is very efficient at pushing out new work to existing executors: a spark version could push out work dynamically, rather than listing everything in advance. Indeed, it could still start the copy operation while enumerating the files to copy, for a faster startup time. Even so: cross-cluster bandwidth will usually be the choke point.

Spark is not really intended for data movement between Hadoop clusters. You may want to look into additional mappers for your distcp job using the "-m" option.

Related

Calling distcp from Spark

can anyone tell me what's the most robust way to copy files from HDFS to S3 in Pyspark ?
I am looking at 2 options:
I. Call distcp directly as in the following:
distcp_arglist =['/usr/lib/hadoop/bin/hadoop','distcp',
...,
'-overwrite',
src_path, dest_path]
II. Using s3-distcp - which seems a bit more involved.
https://gist.github.com/okomestudio/699edbb8e095f07bafcc
Any suggestions are welcome. Thanks.
I'm going to point you at a little bit of my code, cloudcp
This is a basic proof of concept of implementing distCp in spark
individual files are scheduled via the spark scheduler; not ideal for 0-byte files, but stops the job being held up by a large file off one node
does do locality via a special RDD which works out the location of every row (i.e file) differently (which has to be in the org.apache.spark package for scoped access)
shows how to do FS operations within a spark map
shuffles the input for a bit of randomness
collects results within an RDD
Doesn't do:
* incremental writes (you can't compare checksums between HDFS and S3 anyway, but it could do a check for fs.exists(path) before the copy.
* permissions. S3 doesn't have them
* throttling
* scheduling of the big files first. You ought to.
* recovery of job failure (no incremental, see)
Like I said, PoC to say "we be more agile by using spark for the heavy lifting"
Anyway, take it and play, you can rework it to operate within an existing spark context with ease, as long as you don't mind a bit of scala coding.
Distcp would probably be the way to go as it is well-proven solution for transfering data between the clusters. I guess any possible alternatives would do something similar - create mapreduce jobs for transfering the data. Important point here is how to tune this process for your particular data as it could really depend on many factors like networking or map-reduce settings. I recommend you to read HortonWorks article about how you can tune this process

Performance comparison between hadoop Pseudo-Distributed Operation and Standalone Operation

I'm a very beginner of hadoop. But I had this interesting observation.
Using the example in hadoop documentation,
By running the same example in Standalone Operation and Pseudo-Distributed Operation, the standalone one took less than 1 minute but Pseudo-distributed operation took more than 3 minutes. This is big difference. I could understand there are extra network and scheduling overhead in distributed mode. But the difference just seems to be too much. This may not be a real comparison because the example is very simple.
My question is, how much difference did you experience between the standalone and distributed mode for a real-world job?
These are reasonably different scenarios. In stand-alone mode, it never starts up a proper one-node Hadoop cluster. Everything happens locally, inline, in the JVM. Data never has to be even written out to disk, potentially. Pseudo-distributed operation is the smallest "real" Hadoop installation, of one local node. You have to read/write data to a local HDFS instance, spawn another JVM, etc. All of that adds a lot of overhead. Maybe the overhead is indeed a few minutes. This seems entirely sensible to me.
Hadoop frame work is meant for processing BIG DATA..
So the size of the data matters a lot ,because ,a smaller file would get processed in traditional file system very quickly than in hadoop because hadoop mapreduce frame work has internal work to do (to make chunks of data file and to send it to data nodes and while processing again access from data nodes ).So for a smaller files ,hadoop frame work is not suitable.
Coming to standalone and pseudo distributed mode ,one aspect u should consider is size of the file and second being actual difference in standalone and pseudo distributed mode.
In standalone mode there is no concept of HDFS,data is not copied to hadoop distributed file system (obviously time saved).Where as in pseudo distributed mode ,hdfs involved which need to be copied with the data that's need to be processed.
Small size data files better to use traditional file processing and if the file size become huge and huge ,hadoop framework gives better processing time!
Hope this helps!

Parallel processing of several files in a cluster

At the company I work for, everyday we have to process a few thousands of files, which takes some hours. The operations are basically CPU intensive, like converting PDF to high resolution images and later creating many different sizes os such images.
Each one of those tasks takes a lot of CPU, and therefore we can't simply start many instances on the same machine because there won't be any processing power available for everything. Thus, it takes some hours to finish everything.
The most obvious thing to do, as I see it, is to partition the set of files and have them processed by more machines concurrently (5, 10, 15 machines, I don't know yet how many would be necessary).
I don't want to reinvent the wheel and create a manager for task (nor do I want the hassle), but I am not sure which tool should I use.
Although we don't have big data, I have looked at Hadoop for a start (we are running at Amazon), and its capabilities of handling the nodes seem interesting. However, I don't know if it makes sense to use it. I am looking at Hazelcast as well, but I have no experience at all with it or the concepts yet.
What would be a good approach for this task?
Hadoop is being used for a wide variety of data processing problems, some of them are related to image processing also. The problem mentioned in the OP can also be easily solved using Hadoop. Note that in some cases where the data to processed is small, then there is an overhead using Hadoop.
If you are new to Hadoop, would suggest a couple of things
Buy the Hadoop : The Definitive Guide book.
Go through the MapReduce resources.
Start going through the tutorials (1 and 2) and setup Hadoop on a single node and a cluster. There is no need for Amazon, if 1-2 machines can be spared for learning.
Run the sample programs and understand how they work.
Start migrating the problem area to Hadoop.
The advantage of Hadoop over other s/w is the ecosystem around Hadoop. As of now the ecosystem around Hadoop is huge and growing, I am not sure of Hazelcast.
You can use Hazelcast distributed queue.
First you can put your files (file references) as tasks to a distributed queue.
Then each node takes a task from the queue processes it and puts the result into another distributed queue/list or write it to DB/storage.

How to use HBase and Hadoop to serve live traffic AND perform analytics? (Single cluster vs separate clusters?)

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.
Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.
I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.
Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?
What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.
The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.
The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.
Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?
I'd love to get your opinions on this :) !
Thanks.
EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?
--
Felix
Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.
HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.
In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.
There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.
Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!
I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.

Experience with Hadoop?

Have any of you tried Hadoop? Can it be used without the distributed filesystem that goes with it, in a Share-nothing architecture? Would that make sense?
I'm also interested into any performance results you have...
Yes, you can use Hadoop on a local filesystem by using file URIs instead of hdfs URIs in various places. I think a lot of the examples that come with Hadoop do this.
This is probably fine if you just want to learn how Hadoop works and the basic map-reduce paradigm, but you will need multiple machines and a distributed filesystem to get the real benefits of the scalability inherent in the architecture.
Hadoop MapReduce can run ontop of any number of file systems or even more abstract data sources such as databases. In fact there are a couple of built-in classes for non-HDFS filesystem support, such as S3 and FTP. You could easily build your own input format as well by extending the basic InputFormat class.
Using HDFS brings certain advantages, however. The most potent advantage is that the MapReduce job scheduler will attempt to execute maps and reduces on the physical machines that are storing the records in need of processing. This brings a performance boost as data can be loaded straight from the local disk instead of transferred over the network, which depending on the connection may be orders of magnitude slower.
As Joe said, you can indeed use Hadoop without HDFS. However, throughput depends on the cluster's ability to do computation near where data is stored. Using HDFS has 2 main benefits IMHO 1) computation is spread more evenly across the cluster (reducing the amount of inter-node communication) and 2) the cluster as a whole is more resistant to failure due to data unavailability.
If your data is already partitioned or trivially partitionable, you may want to look into supplying your own partitioning function for your map-reduce task.
The best way to wrap your head around Hadoop is to download it and start exploring the include examples. Use a Linux box/VM and your setup will be much easier than Mac or Windows. Once you feel comfortable with the samples and concepts, then start to see how your problem space might map into the framework.
A couple resources you might find useful for more info on Hadoop:
Hadoop Summit Videos and Presentations
Hadoop: The Definitive Guide: Rough Cuts Version - This is one of the few (only?) books available on Hadoop at this point. I'd say it's worth the price of the electronic download option even at this point ( the book is ~40% complete ).
Parallel/ Distributed computing = SPEED << Hadoop makes this really really easy and cheap since you can just use a bunch of commodity machines!!!
Over the years disk storage capacities have increased massively but the speeds at which you read the data have not kept up. The more data you have on one disk, the slower the seeks.
Hadoop is a clever variant of the divide an conquer approach to problem solving.
You essentially break the problem into smaller chunks and assign the chunks to several different computers to perform processing in parallel to speed things up rather than overloading one machine. Each machine processes its own subset of data and the result is combined in the end. Hadoop on a single node isn't going to give you the speed that matters.
To see the benefit of hadoop, you should have a cluster with at least 4 - 8 commodity machines (depending on the size of your data) on a the same rack.
You no longer need to be a super genius parallel systems engineer to take advantage of distributed computing. Just know hadoop with Hive and your good to go.
yes, hadoop can be very well used without HDFS. HDFS is just a default storage for Hadoop. You can replace HDFS with any other storage like databases. HadoopDB is an augmentation over hadoop that uses Databases instead of HDFS as a data source. Google it, you will get it easily.
If you're just getting your feet wet, start out by downloading CDH4 & running it. You can easily install into a local Virtual Machine and run in "pseudo-distributed mode" which closely mimics how it would run in a real cluster.
Yes You can Use local file system using file:// while specifying the input file etc and this would work also with small data sets.But the actual power of hadoop is based on distributed and sharing mechanism. But Hadoop is used for processing huge amount of data.That amount of data cannot be processed by a single local machine or even if it does it will take lot of time to finish the job.Since your input file is on a shared location(HDFS) multiple mappers can read it simultaneously and reduces the time to finish the job. In nutshell You can use it with local file system but to meet the business requirement you should use it with shared file system.
Great theoretical answers above.
To change your hadoop file system to local, you can change it in "core-site.xml" configuration file like below for hadoop versions 2.x.x.
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
for hadoop versions 1.x.x.
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>

Resources