Experience with Hadoop? - performance

Have any of you tried Hadoop? Can it be used without the distributed filesystem that goes with it, in a Share-nothing architecture? Would that make sense?
I'm also interested into any performance results you have...

Yes, you can use Hadoop on a local filesystem by using file URIs instead of hdfs URIs in various places. I think a lot of the examples that come with Hadoop do this.
This is probably fine if you just want to learn how Hadoop works and the basic map-reduce paradigm, but you will need multiple machines and a distributed filesystem to get the real benefits of the scalability inherent in the architecture.

Hadoop MapReduce can run ontop of any number of file systems or even more abstract data sources such as databases. In fact there are a couple of built-in classes for non-HDFS filesystem support, such as S3 and FTP. You could easily build your own input format as well by extending the basic InputFormat class.
Using HDFS brings certain advantages, however. The most potent advantage is that the MapReduce job scheduler will attempt to execute maps and reduces on the physical machines that are storing the records in need of processing. This brings a performance boost as data can be loaded straight from the local disk instead of transferred over the network, which depending on the connection may be orders of magnitude slower.

As Joe said, you can indeed use Hadoop without HDFS. However, throughput depends on the cluster's ability to do computation near where data is stored. Using HDFS has 2 main benefits IMHO 1) computation is spread more evenly across the cluster (reducing the amount of inter-node communication) and 2) the cluster as a whole is more resistant to failure due to data unavailability.
If your data is already partitioned or trivially partitionable, you may want to look into supplying your own partitioning function for your map-reduce task.

The best way to wrap your head around Hadoop is to download it and start exploring the include examples. Use a Linux box/VM and your setup will be much easier than Mac or Windows. Once you feel comfortable with the samples and concepts, then start to see how your problem space might map into the framework.
A couple resources you might find useful for more info on Hadoop:
Hadoop Summit Videos and Presentations
Hadoop: The Definitive Guide: Rough Cuts Version - This is one of the few (only?) books available on Hadoop at this point. I'd say it's worth the price of the electronic download option even at this point ( the book is ~40% complete ).

Parallel/ Distributed computing = SPEED << Hadoop makes this really really easy and cheap since you can just use a bunch of commodity machines!!!
Over the years disk storage capacities have increased massively but the speeds at which you read the data have not kept up. The more data you have on one disk, the slower the seeks.
Hadoop is a clever variant of the divide an conquer approach to problem solving.
You essentially break the problem into smaller chunks and assign the chunks to several different computers to perform processing in parallel to speed things up rather than overloading one machine. Each machine processes its own subset of data and the result is combined in the end. Hadoop on a single node isn't going to give you the speed that matters.
To see the benefit of hadoop, you should have a cluster with at least 4 - 8 commodity machines (depending on the size of your data) on a the same rack.
You no longer need to be a super genius parallel systems engineer to take advantage of distributed computing. Just know hadoop with Hive and your good to go.

yes, hadoop can be very well used without HDFS. HDFS is just a default storage for Hadoop. You can replace HDFS with any other storage like databases. HadoopDB is an augmentation over hadoop that uses Databases instead of HDFS as a data source. Google it, you will get it easily.

If you're just getting your feet wet, start out by downloading CDH4 & running it. You can easily install into a local Virtual Machine and run in "pseudo-distributed mode" which closely mimics how it would run in a real cluster.

Yes You can Use local file system using file:// while specifying the input file etc and this would work also with small data sets.But the actual power of hadoop is based on distributed and sharing mechanism. But Hadoop is used for processing huge amount of data.That amount of data cannot be processed by a single local machine or even if it does it will take lot of time to finish the job.Since your input file is on a shared location(HDFS) multiple mappers can read it simultaneously and reduces the time to finish the job. In nutshell You can use it with local file system but to meet the business requirement you should use it with shared file system.

Great theoretical answers above.
To change your hadoop file system to local, you can change it in "core-site.xml" configuration file like below for hadoop versions 2.x.x.
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
for hadoop versions 1.x.x.
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>

Related

use spark to copy data across hadoop cluster

I have a situation where I have to copy data/files from PROD to UAT (hadoop clusters). For that I am using 'distcp' now. but it is taking forever. As distcp uses map-reduce under the hood, is there any way to use spark to make the process any faster? Like we can set hive execution engine to 'TEZ' (to replace map-reduce), can we set execution engine to spark for distcp? Or is there any other 'spark' way to copy data across clusters which may not even bother about distcp?
And here comes my second question (assuming we can set distcp execution engine to spark instead of map-reduce, please don't bother to answer this one otherwise):-
As per my knowledge Spark is faster than map-reduce mainly because it stores data in the memory which it might need to process in several occasions so that it does not have to load the data all the way from disk. Here we are copying data across clusters, so there is no need to process one file (or block or split) more than once as each file will go up into the memory then will be sent over the network, gets copied to the destination cluster disk, end of the story for that file. Then how come Spark makes the process faster if the main feature is not used?
Your bottlenecks on bulk cross-cluster IO are usually
bandwidth between clusters
read bandwidth off the source cluster
write bandwidth to the destination cluster (and with 3x replication, writes do take up disk and switch bandwidth)
allocated space for work (i.e. number of executors, tasks)
Generally on long-distance uploads its your long-haul network that is the bottleneck: you don't need that many workers to flood the network.
There's a famous tale of a distcp operation between two Yahoo! clusters which did manage to do exactly that to part of the backbone: the Hadoop ops team happy that the distcp was going so fast, while the networks ops team are panicing that their core services were somehow suffering due to the traffic between two sites. I believe this incident is the reason that distcp now has a -bandwidth option :)
Where there may be limitations in distcp, it's probably in task setup and execution: the decision of which files to copy is made in advance and there's not much (any?) intelligence in rescheduling work if some files copy fast but others are outstanding.
Distcp just builds up the list in advance and hands it off to the special distcp mappers, each of which reads its list of files and copies it over.
Someone could try doing a spark version of distcp; it could be an interesting project if someone wanted to work on better scheduling, relying on the fact that spark is very efficient at pushing out new work to existing executors: a spark version could push out work dynamically, rather than listing everything in advance. Indeed, it could still start the copy operation while enumerating the files to copy, for a faster startup time. Even so: cross-cluster bandwidth will usually be the choke point.
Spark is not really intended for data movement between Hadoop clusters. You may want to look into additional mappers for your distcp job using the "-m" option.

Hadoop - CPU intensive application - Small data

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.

Data movement HDFS Vs Parallel file system Vs MPI

I'm currently working on implementation of machine learning algorithms on MR-MPI (MapReduce on MPI). And i'm also trying to understand about other MapReduce frameworks especially Hadoop, so the following is my basic question (I'm new to MapReduce frameworks, i aplogize if my question dosen't make sense).
Question: Since MapReduce can be implemented on top of many things such as a parallel file system(GPFS), HDFS, MPI, e.t.c.,. After the map step there is a collate operation and then followed by a reduce operation. For a collate operation we need some data movement to happen across the nodes. In this regard i would like to know what is the difference in data movement mechanisms(between nodes) in HDFS Vs GPFS Vs MPI.
I appreciate if you provide me some good explanation and can give me some good references on each of these so i can get into further details.
Thanks.
MapReduce as a paradigm can bi implemented on many storage systems. Indeed Hadoop has so called DFS (Distributed file system) abstraction which enable integration of different storage system and run MapReduce over them. For example there are Amazon S3, Local file system, Opens Stack Swift and other integrations.
In the same time HDFS integration has one special property - it reports to the MR engine (JobTracker, to be more specific) where data resides and it enable smart scheduling of Mapping in the way that data to be processed by each Mapper is usually collocated with the Mapper.
As a result during Mapping phase data is not moved over network when MR run over HDFS.
To be more general can be stated that idea of Hadoop MR is to move code to data and not opposite, and it should be important criteria when evaluating any scalable MR implementation - does this system care that mappers process local data.
The OP has mixed a couple of things - messaging and file system, so there are multiple ansewers.
Hadoop/MAPI is a WIP and you can find more details here.
Hadoop/GPFS is still open.
Hadoop/HDFS comes out of the box from Apache Hadoop. For data transfer between the mappers and reducers HTTP is used, not sure why.

Performance comparison between hadoop Pseudo-Distributed Operation and Standalone Operation

I'm a very beginner of hadoop. But I had this interesting observation.
Using the example in hadoop documentation,
By running the same example in Standalone Operation and Pseudo-Distributed Operation, the standalone one took less than 1 minute but Pseudo-distributed operation took more than 3 minutes. This is big difference. I could understand there are extra network and scheduling overhead in distributed mode. But the difference just seems to be too much. This may not be a real comparison because the example is very simple.
My question is, how much difference did you experience between the standalone and distributed mode for a real-world job?
These are reasonably different scenarios. In stand-alone mode, it never starts up a proper one-node Hadoop cluster. Everything happens locally, inline, in the JVM. Data never has to be even written out to disk, potentially. Pseudo-distributed operation is the smallest "real" Hadoop installation, of one local node. You have to read/write data to a local HDFS instance, spawn another JVM, etc. All of that adds a lot of overhead. Maybe the overhead is indeed a few minutes. This seems entirely sensible to me.
Hadoop frame work is meant for processing BIG DATA..
So the size of the data matters a lot ,because ,a smaller file would get processed in traditional file system very quickly than in hadoop because hadoop mapreduce frame work has internal work to do (to make chunks of data file and to send it to data nodes and while processing again access from data nodes ).So for a smaller files ,hadoop frame work is not suitable.
Coming to standalone and pseudo distributed mode ,one aspect u should consider is size of the file and second being actual difference in standalone and pseudo distributed mode.
In standalone mode there is no concept of HDFS,data is not copied to hadoop distributed file system (obviously time saved).Where as in pseudo distributed mode ,hdfs involved which need to be copied with the data that's need to be processed.
Small size data files better to use traditional file processing and if the file size become huge and huge ,hadoop framework gives better processing time!
Hope this helps!

Hadoop: Disadvantages of using just 2 machines?

I want to do log parsing of huge amounts of data and gather analytic information. However all the data comes from external sources and I have only 2 machines to store - one as backup/replication.
I'm trying to using Hadoop, Lucene... to accomplish that. But, all the training docs mention that Hadoop is useful for distributed processing, multi-node. My setup does not fit into that architecture.
Are they any overheads with using Hadoop with just 2 machines? If Hadoop is not a good choice are there alternatives? We looked at Splunk, we like it, but that is expensive for us to buy. We just want to build our own.
Hadoop should be used for distributed batch processing problems.
5-common-questions-about-hadoop
Analysis of log files is one of the more common uses of Hadoop, its one of the tasks Facebook use it for.
If you have two machines, you by definition have a multi-node cluster. You can use Hadoop on a single machine if you want, but as you add more nodes the time it takes to process the same amount of data is reduced.
You say you have huge amounts of data? These are important numbers to understand. Personally when I think huge in terms of data, i think in the 100s terabytes+ range. If this is the case, you'll probably need more than two machines, especially if you want to use replication over the HDFS.
The analytic information you want to gather? Have you determined that these questions can be answered using the MapReduce approach?
Something you could consider would be to use Hadoop on Amazons EC2 if you have a limited amount of hardware resources. Here are some links to get you started:
hadoop-world-building-data-intensive-apps-with-hadoop-and-ec2
Hadoop Wiki - AmazonEC2

Resources