I have read through the tutorials from Apache and Yahoo on DistributedCache. I am still confused about one thing though. Suppose I have a file which I want to be copied to all data nodes. So, I use
DistributedCache.addCacheFile(new URI(hdfsPath),job) in the job Driver to make the file available. Then, I call DistributedCache.getLocalCacheFiles(job) inside my Mapper.
Now, I want to create an array on the data node based on the contents of this file so that each time map() runs, it can access the elements of the array. Can I do this? I am confused because if I read the cached file and create the array within the Mapper class, it seems like it would create the array for each new input to the Mapper rather than just once per Mapper. How does this part actually work (i.e., where/when should I create the array)?
There are a few concepts mixed here.
Datanode has nothing to do directly with the DistributedCache. It is concept of the MapReduce layer.
Desire to reuse the same derivative from the cached file between mappers is somwhat contradicts with the functional nature of the MR paradigm. Mappers should be logically independent.
What you want is a kind of optimization which makes sense if preprocessing of cached file for the mappers is relatively expensive
You can do it in some extent by saving the preprocessed data in the some static variable, lazy evaluate it, and set hadoop to reuse virtual machines between tasks. It is not "MR" spirit solution but should work.
Better solution would be to preprocess the cached file to the form, where its consumption by the mapper will be cheap.
Lets assume that all the idea is a kind of optimization - otherwise reading and processing the file for each mapping is just fine.
Can be stated that if preparing the file for each mapper is much cheaper than map processing itself, or much cheaper than mapper run overhead - we are fine.
By form I mean the format of the file, which can be very efficiently converted to the in-memory structure we need. For example - if we need some search in the data - we can store data already sorted. It will save us sorting each time, what, usually much more expensive than sequential reading from the disk
If in your case it is properties in some modest number (let say thousands) I can assume that their reading and initialization is not significant comparing to the single mapper
Related
I am curious if Spark first reads entire file into memory and only then starts processing it, meaning applying transformations and actions, or it reads first chunk of a file - applies transformation on it, reads second chunk and so on.
Is there any difference between Spark in Hadoop for the same matter? I read that Spark keeps entire file in memory most of the times, while Hadoop not. But what about the initial step when we read it for the first time and map the keys.
Thanks
I think a fair characterisation would be this:
Both Hadoop (or more accurately MapReduce) and Spark use the same underlying filesystem HDFS to begin with.
During the Mapping phase both will read all data and actually write the map result to disk so that it can be sorted and distributed between nodes via the Shuffle logic.
Both of them do in fact try and cache the data just mapped in memory in addition to spilling it to disk for the Shuffle to do its work.
The difference here though is that Spark is a lot more efficient in this process, trying to optimally align the node chosen for a specific computation with the data already cached on a certain node.
Since Spark also does something called lazy-evaluation the memory use of Spark is very different from Hadoop as a result of planning computation and caching simultaneously.
In in the steps of a word-count job Hadoop does this:
Map all the words to 1.
Write all those mapped pairs of (word, 1) to a single file in HDFS (single file could still span multiple nodes on the distributed HDFS) (this is the shuffle phase)
Sort the rows of (word, 1) in that shared file (this is the sorting phase)
Have the reducers read sections (partitions) from that shared file that now contains all the words sorted and sum up all those 1s for every word.
Spark on the other hand will go the other way around:
It figures that like in Hadoop it is probably most efficient to have all those words summed up via separate Reducer runs, so it decides according to some factors that it wants to split the job into x parts and then merge them into the final result.
So it knows that words will have to be sorted which will require at least part of them in memory at a given time.
After that it evaluates that such a sorted list will require all words mapped to (word, 1) pairs to start the calculation.
It works through steps 3 than 2 than 1.
Now the trick relative to Hadoop is that it knows in Step 3, which in-memory cached items it will need in 2. and in 2. it already knows how these parts (mostly K-V pairs) will be needed in the final step 1.
This allows Spark to very efficiently plan the execution of Jobs, but caching data it knows will be needed in later stages of the job. Hadoop working from the beginning (mapping) to the end without explicitly looking ahead into the following stages, simply cannot use memory this efficiently and hence doesn't waste resources keeping the large chunks in memory, that Spark would keep. Unlike Spark it just doesn't know if all the pairs in a Map phase will be needed in the next step.
The fact that it appears that Spark is keeping the whole dataset in memory hence isn't something Spark actively does, but rather a result of the way Spark is able to plan the execution of a job.
On the other hand, Spark may be able to actually keep fewer things memory in a different kind of job. Counting the number of distinct words is a good example here in my opinion.
Here Spark would have planned ahead and immediately drop a repeat-word from the cache/memory when encountering it during the mapping, while in Hadoop it would go ahead and waste memory on shuffling the repeat words too (I acknowledge there is a million ways to also make Hadoop do this but it's not out of the box, also there is ways of writing your Spark job in unfortunate ways to break these optimisations, but it's not so easy to fool Spark here :)).
Hope this helps understand that the memory use is just a natural consequence of the way Spark works, but not something actively aimed at and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between steps of the execution when memory becomes an issue.
For more insight into this I recommend learning about the DAG scheduler in Spark from here to see how this is actually done in code.
You'll see that it always follows the pattern of working out where what data is and will be cached before figuring out what to calculate where.
Spark uses lazy iterators to process data and can spill data to disk if necessary. It doesn't read all data in memory.
The difference compared to Hadoop is that Spark can chain multiple operations together.
I was reading hadoop definitive guide , It was written Map Reduce is good for updating larger portions of the database , and it uses Sort & Merge to rebuild the database which is dependent on transfer time .
Also RDBMS is good for updating only smaller portions of a big database , It uses a B-Tree which is limited by seek time
Can anyone elaborate on what both these claims really mean ?
I am not really sure what the book means, but you will usually do a map reduce job to rebuild the entire database/anything if you still have the raw data.
The real good thing about hadoop is that it's distributed, so performance is not really a problem since you could just add more machines.
Let's take an example, you need to rebuild a complex table with 1 billion rows. With RDBMS, you can only scale vertically, so you will be depending more on the power of the CPU, and how fast the algorithm is. You will be doing it with some SQL command. You will need to select a few data, process them, do stuffs, etc. So you will most likely be limited by the seek time.
With hadoop map reduce, you could just add more machines, so performance is not the problem. Let's say you you use 10000 mappers, that means the task will be divided to 10000 mapper containers, and because of hadoop's nature, all these containers usually already have the data on their harddrive stored locally. The output of each mapper is always a key value structured format on their local harddrive. These data are sorted using the key by the mapper.
Now the problem is, they need to combine the data together, so all of these data will be sent to a reducer. This happens through the network, is usually the slowest part if you have big data. The reducer will receive all of the data and will merge-sort them for further processing. In the end you have a file which could be just uploaded to your database.
The transfer from mapper to reducer is usually what's taking the longest time if you have a lot of data, and network is usually your bottleneck. Maybe this is what it meant by depending on the transfer time.
The mapping phase of my Hadoop program generates a great number of unique keys (around 200K keys for one data set and 900K for another data set). Each key is a string value containing 60 numerical characters. The sorting/shuffling phase of the my Hadoop program takes too long. Is there any way to make the sorting/shuffling phase more efficient for such a great number of keys?
You should consider the use of combiners to reduce the overheat of the network, combining the "map-phase" outputs sent over to reducers.
You are right as regarding to WritableComparator, is better to implement yours, because as far as I know the way of comparing two objects in sort phase is by, once serialized the objects(output from the mapper), Hadoop in order to give and order, have to deserialize them, so It is pretty much better to avoid the "deserialization phase" and do the comparison at byte level.
You have to be careful on overriding the method compare from WritableComparable, because can be pretty challenging to do it properly, the method I'm referring to from GrepCode:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/io/WritableComparator.java#WritableComparator.compare%28byte%5B%5D%2Cint%2Cint%2Cbyte%5B%5D%2Cint%2Cint%29
EDIT
I add which I consider a great article, to take some pointers on improving the performance on MapReduce:
http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
You should probably create a custom key type. There are a few reasons for this:
By having a numeric (binary) key, you can implement Comparable<BinaryComparable>, which compares bytes instead of text, allowing for a speed increase
You can have the key writable as a binary format, which saves time when transmitting and reading the key. If we were to write a key class, we could extend BytesWritable, which already implements the interface that I mentioned in the first bullet.
There are some job parameters that you should probably tune. For instance, you might want to consider tuning the io.sort options within your job. Because you have a lot of unique values, Hadoop probably isn't able to sort them all in memory, meaning that it must spill to the disk. When this happens, the data must be re-read and re-sorted, slowing down the shuffle. You can tell if spills are happening by looking through your log since spills are recorded. For tuning tips, see http://www.slideshare.net/cloudera/mr-perf
I have ~5000 entries in my Hadoop input file, but I know in advance that some of the lines will take much longer to process than others (in the map stage).
(Mainly because I need to download a file from Amazon S3, and the size of the file will vary between tasks)
I want to make sure that the biggest map tasks are processed first, to make sure that all my hadoop nodes will finish working roughly at the same time.
Is there a way to do that with Hadoop? Or do I need to rework the whole thing? (I am new to Hadoop)
Thanks!
Well if you would implement your custom InputFormat (the getSplits() method contains the logic about split creation), then theoretically you could achieve what you want.
BUT, you have to take special care, because the order of how the InputFormat returns the splits is not the order of how Hadoop will process it.
There is a split re-ordering code inside the JobClient:
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new NewSplitComparator());
which will make the whole thing more tricky.
But you could implement a custom InputFormat + a custom InputSplit and make the InputSlip#length() dependent on its expected execution time.
I need to parse few XML's to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions
using SAXParser
use Hadoop
i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data
it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?
I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way... the other is to write SAX code) since you can still operate on the records in a dom-like fashion.
With these large files, one thing to keep in mind is that you'll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output... this will speed things up quite a bit.
I've written a quick outline of how I've handled all this, maybe it'll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple....
I don't know about SAXparser. But definitely Hadoop will do your job if you have a hadoop cluster with enough data nodes. 50Gb is nothing as I was performing operations on more than 300GB of data on my cluster. Write a map reduce job in java and the documentation for hadoop can be found at http://hadoop.apache.org/
It is rilatively trivial to process XML on hadoop by having one mapper per XML file. This approach will be fine for large number of relatively small XMLs
The problem is that in Your case files are big and thier number is small so without splitting hadoop benefit will be limited. Taking to account hadoop's overhead the benefit be negative...
In hadoop we need to be able to split input files into logical parts (called splits) to efficiently process large files.
In general XML is not looks like "spliitable" format since there is no well defined division into blocks, which can be processed independently. In the same time, if XML contains "records" of some kind splitting can be implemented.
Good discussion about splitting XMLs in haoop is here:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
where Mahout's XML input format is suggested.
Regarding your case - I think as long as number of your files is not much bigger then number of cores you have on single system - hadoop will not be efficient solution.
In the same time - if you want to accumulate them over time - you can profit from hadoop as a scalable storage also.
I think that SAX has traditionally been mistakenly associated with processing big XML files... in reality, VTD-XML is often the best option, far better than SAX in terms of performance, flexibility, code readability and maintainability... on the issue of memory, VTD-XML's in-memory model is only 1.3x~1.5X the size of the corresponding XML document.
VTD-XML has another significant benefit over SAX: its unparalleled XPath support. Because of it, VTD-XML users routinely report performance gain of 10 to 60x over SAX parsing over hundreds of MB XML files.
http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java#anch104307
Read this paper that comprehensively compares the existing XML parsing frameworks in Java.
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf