Purpose of Hadoop MapReduce - hadoop

Currently I am reading some papers about Hadoop and the popular MapReduce algorithm. However, I could not see the value of the MapReduce and will be glad if someone can give some insight about it. Specifically:
It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something else? If the key is the words in the file then what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What may be the purpose of counting the number of words?
It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that Hadoop can work with unstructured data! I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?
The questions may seem to be correlated with each other but I believe that I gave the idea about my question. I will be glad if you can give answers to the above questions.
Regards,
Edit:
Hi Guys,
Thank you very much for your responses.
What I understood from your writings and playing with Hadoop a little bit, I would like to state my conclusions in a very high-level basic way:
Hadoop process data through key-value pairs. Everything is converted to key-value pairs.
The main interest should be given to the definitions of the key and the value which may change according to business needs.
Hadoop provides just an efficient (e.g. distributed, scalable system and huge amount of data handling) implementation of a dictionary, nothing more.
Any comments on these outcomes are welcome.
As a final note I would like to add that, for a simple map-reduce implementation I believe that there should be a user interface which enables user to select/define the keys and appropriate values. This UI may also be extended for further statistical analysis.
Regards,

It is said that MapReduce receives a file and produces key value pairs. What is a key? Just a word, a combination of words or something
else? If the key is the words in the file then what is purpose of
writing code for MapReduce? MapReduce should do the same thing without
implementing specific algorithm.
MapReduce should be visualized as distributed computing framework. For word count example the key is word, but we can have any thing as key (APIs are available for some of them and we can write custom ones as well). The purpose of having the key is to partition, sort and merge the sorted data to perform aggregations. A map phase will be used to perform row level transformations, filtering etc and reduce phase will take care of aggregation. Map and Reduce needs to be implemented and then shuffle phase which is typically out of the box will take care of partitioning, shuffling, sorting and merging.
If everything is converted to key value pairs then what Hadoop does is just creating a Dictionary like in JAVA and C#, wright? May be
Hadoop can create the dictionary in a more efficient way. Other than
efficiency what Hadoop provides that a normal Dictionary object
cannot?
Covered as part of previous question.
What do I earn by converting a file to key value pairs? I know I can find the counts and frequencies of the words, but for what? What
may be the purpose of counting the number of words?
You can perform transformations, filtering, aggregations, joins and any custom task that can be performed on unstructured data. The major difference is distributed. Hence it can scale better than any legacy solutions.
It is said that Hadoop can be used for unstructured data. If everything is converted to a key value pair, then it so normal that
Hadoop can work with unstructured data! I can write a program in C# to
generate the key value pairs instead of using Hadoop. What is the real
value of Hadoop that I cannot utilize by using other kinds of
programming tools?
Key can be line offset and then you can process each record. It does not matter if every record is of same structure or different.
Here are the advantages of using Hadoop:
Distributed file system (HDFS)
Distributed processing framework (map reduce)
Data locality (typically in modern applications, files will be network mounted and hence data which is bigger than code has to be copied to servers on which code is deployed. In hadoop code goes to data and all the success stories of Hadoop does not use network file system)
Limited usage of network while storing and processing very large data sets
Cost effective (open source softwares on commodity hardware)
and many more.

Take an example for Word count example to get better understanding.
What is a key? Just a word, a combination of words or something else?
For Mapper:
Key is offset value from beginning of the file. Value is entire line. Once the line is read from file, the line will be split into multiple key value pairs for Reducer. Delimiter like tab or space or characters like , : helps to split line to key value pairs.
For Reducer:
Key is individual word. Value is occurrence of the word.
Once you get key value pairs at reducer, you can run many aggregation/stigmatization/categorization of data and provide analytical summary of data.
Have a look at this use case article which covers Financial, Energy, Telecom, Retail etc.
Have a look at this article for better understanding of entire word count example and Map reduce tutorial.
what is purpose of writing code for MapReduce? MapReduce should do the same thing without implementing specific algorithm.
Hadoop have four key components.
1. Hadoop Common: The common utilities that support the other Hadoop modules.
2. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
3. Hadoop YARN: A framework for job scheduling and cluster resource management.
4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
May be Hadoop can create the dictionary in a more efficient way. Other than efficiency what Hadoop provides that a normal Dictionary object cannot?
Creating the dictionary is not the core purpose. Hadoop is creating this dictionary and uses these key value pairs to solve business use cases later depending on requirement.
Word count example may provide output as just Word & Word count. But you can process Structured/Semi-Sturctured & Un-Structured data for various use cases
Find the hottest day of year/month/day/hour for a given place in entire universe.
Find the number of buy/sell transactions of a particular stock in NYSE on a given day. Provide Minute wise/hour wise/Day wise summary of transactions per a stock. Find top 10 highly traded stocks on a given day
Find the number of tweets/re-tweets for a particular tag key
What may be the purpose of counting the number of words?
Explained the purpose in earlier answers.
I can write a program in C# to generate the key value pairs instead of using Hadoop. What is the real value of Hadoop that I cannot utilize by using other kinds of programming tools?
How much data volume you can handle by writing C# to get key value pairs and process data? Can you process 10 peta bytes of weather information in 5000 node cluster using C# with distributed storage/processing framework developed in C#?
How do you summarize the data Or find top 10 cool/hot places using C#?
You have to develop some framework to do all of these things and Hadoop has already come-up with that framework.
HDFS is used for distributed storage of data in volumes of peta bytes. If you need to handle data growth, just add some more nodes to hadoop cluster.
Hadoop Map reduce & YARN provide framework for distributed data processing to process data stored in thousands of machines in Hadoop cluster.
Image source: kickstarthadoop ( article author: Bejoy KS)

Related

How does Hadoop/MapReduce scale when input data is NOT stored?

The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.

cassandra and hadoop - realtime vs batch

As per http://www.dbta.com/Articles/Columns/Notes-on-NoSQL/Cassandra-and-Hadoop---Strange-Bedfellows-or-a-Match-Made-in-Heaven-75890.aspx
Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions.
What are the differences in the architecture/implementation of Cassandra and Hadoop which account for this sort of difference in usage. (in lay software professional terms)
I wanted to add, because I think there might be a misleading statement here saying Cassandra might perform good for reads.
Cassandra is not very good at random reads either, it's good compared to other solutions out there in how can you read randomly over a huge amount of data, but at some point if the reads are truly random you can't avoid hitting the disk every single time which is expensive, and it may come down to something useless like a few thousand hits/second depending on your cluster, so planning on doing lots of random queries might not be the best, you'll run into a wall if you start thinking like that. I'd say everything in big data works better when you do sequential reads or find a way to sequentially store them. Most cases even when you do real time processing you still want to find a way to batch your queries.
This is why you need to think beforehand what you store under a key and try to get the most information possible out of a read.
It's also kind of funny that statement says transaction and Cassandra in the same sentence, cause that really doesn't happen.
On the other hand hadoop is meant to be batch almost by definition, but hadoop is a distributed map reduce framework, not a db, in fact, I've seen and used lots of hadoop over cassandra, they're not antagonistic technologies.
Handling your big data in real time is doable but requires good thinking and care about when and how you hit the database.
Edit: Removed secondary indices example, as last time I checked that used random reads (though I've been away from Cassandra for more than a year now).
The Vanilla hadoop consists of a Distributed File System (DFS) at the core and libraries to support Map Reduce model to write programs to do analysis. DFS is what enables Hadoop to be scalable. It takes care of chunking data into multiple nodes in a multi node cluster so that Map Reduce can work on individual chunks of data available nodes thus enabling parallelism.
The paper for Google File System which was the basis for Hadoop Distributed File System (HDFS) can be found here
The paper for Map Reduce model can be found here
For a detailed explanation on Map Reduce read this post
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is not a conventional database but is more like Hashtable or HashMap which stores a key/value pair. Cassandra works on top of HDFS and makes use of it to scale. Both Cassandra and HBase are implementations of Google's BigTable. Paper for Google BigTable can be found here.
BigTable makes use of a String Sorted Table (SSTable) to store key/value pairs. SSTable is just a File in HDFS which stores key followed by value. Furthermore BigTable maintains a index which has key and offset in the File for that key which enables reading of value for that key using only a seek to the offset location. SSTable is effectively immutable which means after creating the File there is no modifications can be done to existing key/value pairs. New key/value pairs are appended to the file. Update and Delete of records are appended to the file, update with a newer key/value and deletion with a key and tombstone value. Duplicate keys are allowed in this file for SSTable. The index is also modified with whenever update or delete take place so that offset for that key points to the latest value or tombstone value.
Thus you can see Cassandra's internal allow fast read/write which is crucial for real time data handling. Whereas Vanilla Hadoop with Map Reduce can be used to process batch oriented passive data.
Hadoop consists of two fundamental components: distributed datastore (HDFS) and distributed computation framework (MapReduce). It reads a bunch of input data then writes output from/to the datastore. It needs distributed datastore since it performs parallel computing with the local data on cluster of machines to minimize the data loading time.
While Cassandra is the datastore with linear scalability and fault-tolerance ability. It lacks of the parallel computation ability provided by MapReduce in Hadoop.
The default datastore (HDFS) of Hadoop can be replaced with other storage backend, such as Cassandra, Glusterfs, Ceph, Amazon S3, Microsoft Azure's file system, MapR’s FS, and etc. However, each alternatives has its pros and cons, they should be evaluated based on the needs.
There are some resources that help you integrate Hadoop with Cassandra: http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configHadoop.html

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Is the input format responsible for implementing data locality in Hadoop's MapReduce?

I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)
Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.
If a developer writes their own input format would they be responsible for implementing data locality?
You're right. If you're looking at the FileInputFormat class and the getSplits() method. It searches for the Blocklocations:
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
This implies the FileSystem query. This happens inside the JobClient, the results getting written into a SequenceFile (actually it's just raw byte code).
So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.
BUT the distribution of the data is the NameNodes job.
To your question now:
Normally you are extending from the FileInputFormat. So you will be forced to return a list of InputSplit, and in the initialization step it is required for such a thing to set the location of the split. For example the FileSplit:
public FileSplit(Path file, long start, long length, String[] hosts)
So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the FileSystem interface.
Mu understanding is that data locality is jointly determined by HDFS and the InputFormat. The former determines (via rack awareness) and stores the location of HDFS blocks across datanodes while the latter will determine which blocks are associated with which split. The jobtracker will try to optimize which splits are delivered to which map task by making sure that the blocks associated for each split (1 split to 1 map task mapping) are local to the tasktracker.
Unfortunately, this approach to guaranteeing locality is preserved in homogeneous clusters but would break down in inhomogeneous ones i.e. ones where there are different sizes of hard disks per datanode. If you want to dig deeper on this you should read this paper (Improving MapReduce performance through data placement in heterogeneous hadoop clusters) that also touches on several topics relative to your question.

hadoop behind the scenes

Can someone explain what is hadoop in terms of the ideas behind the software ? What makes it so popular and/or powerful ?
Hadoop is a programming environment which enables running massive computations in parallel on a large cluster of machines. It is resilient to loss of several machines, scalable to enable faster computations by adding machines and trackable to report the computation status.
Hadoop is popular because it is a strong open source environment and because many users, including large ones such as Yahoo!, Microsoft and Facebook, employ it for large data-crunching projects.
It is powerful because it uses the map/reduce algorithm, which decomposes a computation into a sequence of two simple operations:
map - Take a list of items and perform the same simple operation on each of them. For example, take the text of a web page, tokenize it and replace every token with the string :1
reduce - Take a list of items and accumulate it using an accumulation operator. For example, take the list of :1, count the occurence of and output a list of the form :nt, where nt is the number of times appeared in the original list.
Using proper decomposition (Which the programmer does) and task distribution and monitoring (which Hadoop does) you get a fast scalable computation; In our example - a word-counting computation. You can sequence tens of maps and reduces and get implementations of sophisticated algorithms.
This is the very high level view. Now go read about MapReduce and Hadoop in further detail.
Hadoop implements Google's MapReduce algorithm, to understand it better you must read Google's MapReduce paper over at http://research.google.com/archive/mapreduce.html

Resources