Apache Pig: Does ORDER BY with PARALLEL ensure consistent hashing/distribution? - sql-order-by

If I load one dataset, order it on a specific key with a parallel clause, and then store it, I can get multiple files, part-r-00000 through part-r-00XXX, depending on what I specify in the parallel statement.
If I then load a new dataset, say another day's worth of data, with some new keys, and some of the same keys, order it, and then store it, is there any way to guarantee that part-r-00000 from yesterday's data will contain the same keyspace as part-r-00000 from today's data?
Is there even a way to guarantee that all of the records will be contained in a single part file, or is it possible that a key could get split across 2 files, given enough records?
I guess the question is really about how the ordering function works in pig - does it use a consistent hash-mod algorithm to distribute data, or does it order the whole set, and then divide it up?
The intent or hope would be that if the keyspace is consistently partitioned, it would be easy enough to perform rolling merges of data per part file. If it is not, I guess the question becomes, is there some operator or way of writing pig to enable that sort of consistent hashing?
Not sure if my question is very clear, but any help would be appreciated - having trouble figuring it out based on the docs. Thanks in advance!

Alrighty, attempting to answer my own question.
It appears that Pig does NOT have a way of ensuring said consistent distribution of results into files. This is partly based on docs, partly based on information about how hadoop works, and partly based on observation.
When pig does a partitioned order-by (eg, using the PARALLEL clause to get more than one reducer), it seems to force an intermediate job between whatever comes before the order-by, and the ordering itself. From what I can tell, pig looks at 1-10% of the data (based on the number of mappers in the intermediate job being 1-10% of the number in the load step) and gets a sampled distribution of the keys you are attempting to sort on.
My guess/thinking is that pig figures out the key distribution, and then uses a custom partitioner from the mappers to the reducers. The partitioner maps a range of keys to each reducer, so it becomes a simple lexical comparison - "is this record greater than my assigned end_key? pass it down the line to the next reducer."
Of course, there are two factors to consider that basically mean that Pig would not be consistent on different datasets, or even on a re-run of the same dataset. For one, since pig is sampling data in the intermediate job, I imagine it's possible to get a different sample and thus a different key distribution. Also, consider an example of two different datasets with widely varying key distributions. Pig would necessarily come up with a different distribution, and thus if key X was in part-r-00111 one day, it would not necessarily end up there the next day.
Hope this helps anyone else looking into this.
EDIT
I found a couple of resources from O'Reilly that seem to back up my hypothesis.
One is about map reduce patterns. It basically describes the standard total-order problem as being solvable by a two-pass approach, one "analyze" phase and a final sort phase.
The second is about pig's order by specifically. It says (in case the link ever dies):
As discussed earlier in “Group”, skew of the values in data is very common. This affects order just as it does group, causing some reducers to take significantly longer than others. To address this, Pig balances the output across reducers. It does this by first sampling the input of the order statement to get an estimate of the key distribution. Based on this sample, it then builds a partitioner that produces a balanced total order...
An important side effect of the way Pig distributes records to minimize skew is that it breaks the MapReduce convention that all instances of a given key are sent to the same partition. If you have other processing that depends on this convention, do not use Pig’s order statement to sort data for it...
Also, Pig adds an additional MapReduce job to your pipeline to do the sampling. Because this sampling is very lightweight (it reads only the first record of every block), it generally takes less than 5% of the total job time.

Related

How to merge two presorted rdds in spark?

I have two large csv files presorted by one of the columns. Is there a way to use the fact that they are already sorted to get a new sorted RDD faster, without full sorting again?
The short answer: No, there is no way to leverage the fact that two input RDDs are already sorted when using the sort facilities offered by Apache Spark.
The long answer: Under certain conditions, there might be a better way than using sortBy or sortByKey.
The most obvious case is when the input RDDs are already sorted and represent distinct ranges. In this case, simply using rdd1.union(rdd2) is the fastest (virtually zero cost) way for combining the input RDDs, assuming that all elements in rdd1 come before all elements in rdd2 (according to the chosen ordering).
When the ranges of the input RDDs overlap, things get more tricky. Assuming that the target RDD shall only have a single partition, it might be efficient to use toLocalIterator on both RDDs and then do a merge manually. If the result has to be an RDD, one could do this inside the compute method of a custom RDD type, processing the input RDDs and generating the outputs.
When the inputs are large and thus consist of many partitions, things get even trickier. In this case, you probably want multiple partitions in the output RDD as well. You could use the custom RDD mentioned earlier, but create multiple partitions (using a RangePartitioner). Each partition would cover a distinct range of elements (in the optimal case, these ranges would cover roughly equally sized parts of the output).
The tricky part with this is avoiding to process the complete input RDDs multiple times inside compute. This can be avoided efficiently using filterByRange from OrderedRDDFunctions when the input RDDs are using a RangePartitioner. When they are not using a RangePartitioner, but you know that partitions are ordered internally and also have a global order, you would first need to find out the effective ranges covered by these partitions by actually probing into the data.
As the multiple partition case is rather complex, I would check whether the custom-made sort is really faster than simply using sortBy or sortByKey. The logic for sortBy and sortByKey is highly optimized regarding the shuffling process (transferring data between nodes). For this reason, it might well be that for many cases these methods are faster than the custom-made logic, even though the custom-made logic could be O(n) while sortBy / sortByKey can be O(n log(n)) at best.
If you are interested in learning more about the shuffling logic used by Apache Spark, there is an article explaining the basic concept.

Too many values for a particular key in Hadoop

I have written a K-Means Clustering code in MapReduce on Hadoop. If I have few number of clusters, consider 2, and if the data is very large, the whole data would be divided into two sets and each Reducer would receive too many values for a particular key, i.e the cluster centroid. How to solve this?
Note: I use the iterative approch to calculate new centers.
Algorithmically, there is not much you can do, as the nature of this algorithm is the one that you describe. The only option, in this respect, is to use more clusters and divide your data to more reducers, but this yields a different result.
So, the only thing that you can do, in my opinion, is compressing. And I do not only mean, using a compression codec of Hadoop.
For instance, you could find a compact representation of your data. E.g., give an integer id to each element and only pass this id to the reducers. This will save network traffic (store elements as VIntWritables, or define your own VIntArrayWritable extending ArrayWritable) and memory of each reducer.
In this case of k-means, I think that a combiner is not applicable, but if it is, it would greatly reduce the network and the reducer's overhead.
EDIT: It seems that you CAN use a combiner, if you follow this iterative implementation. Please, edit your question to describe the algorithm that you have implemented.
If you have too much shuffle then you will run into OOM issues.
Try to split the dataset in smaller chunks and try
yarn.app.mapreduce.client.job.retry-interval
AND
mapreduce.reduce.shuffle.retry-delay.max.ms
where there are more splits but the retries of the job will be long enough so there is no OOM issues.

Motivation behind an implicit sort after the mapper phase in map-reduce

I am trying to understand as to why does map-reduce does an implicit sorting during the shuffle and sort phase both on the map side and the reduce side which is manifested as a mixture of both in-memory as well as on-disk sorting (can be really expensive for large sets of data).
My concern is that while running map-reduce jobs, performance is a significant consideration and an implicit sorting based on the keys before throwing the output of the mapper to the reducer will have a great impact on the performance when dealing with large sets of data.
I understand that sorting can prove to be a boon in certain cases where it is explicitly required but this is not always true? So, why does the concept of implicit sorting exist in Hadoop Map-Reduce?
For any kind of reference to what I am talking about while mentioning the shuffle and sort phase feel free to give a brief reading to the post : Map-Reduce: Shuffle and Sort on my blog: Hadoop-Some Salient Understandings
One of the possible explanation to the above which came to my mind much later after posting this question is:
The sorting is done just to aggregate all the records corresponding to a particular key, together, so that all these records corresponding to that single key maybe sent to a single reducer (default partitioning logic in Hadoop Map-Reduce). So, it may be said that by sorting all the records by the keys after the Mapper phase just allows to bring all records corresponding to a single key together where the order of the keys in sorted order may just get used for certain use cases such as sorting large sets of data.
If people can verify the above if they think the same, it shall be great. Thanks.

Is Hadoop designed for solving problems that need multiple parallel computations of the same data but with different parameters?

From the little I have read, I understood that Hadoop is great for the following class of problems - having one massive question answered by distributing the computation between potentially many nodes.
Was Hadoop designed to solve problems that involve multiple calculations, on the same dataset, but each with different parameters? for example, simulating different scenarios based on the same master dataset, but with different parameters (e.g. testing a data mining model on the same data set, but spawning multiple iterations of the simulation, each with a different set of parameters and finding the best model)
E.g. for a model predicting weather, that have a set of rules with different weights, does Hadoop support running the same model, but each "node" running with a different weight values on a learning set and comparingthe prediction results to find the best model?
Or is this something that Hadoop was simply not designed to do?
It's not really something it was designed to do. Typically you would want to be able to distribute different parts of the dataset to different nodes. This is one of the main ideas behind hadoop: split a huge dataset over multiple nodes and bring the computation to the data. However, it still can definitely be accomplished without jumping through too many hoops.
I'm not sure how familiar you are with the MapReduce paradigm, but you could think of the parameters of your models as the "input" to your map task. Put the dataset in some location in HDFS, and write a MapReduce job such that the map tasks read in the dataset from HDFS, then evaluate the model using the given parameters. All map outputs get sent to one reducer, which simply outputs the parameters that gave the highest score. If you make the number of input files (model parameters) equal to the number of nodes, you should get one map task per node, which is what you want.

Does someone really sort terabytes of data?

I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?
I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.
But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?
In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?
But in reality, does companies like
Amazon/Ebay, sort terabytes of data? I
know, they store tons of info but
sorting them???
Yes. Last time I checked Google processed over 20 petabytes of data daily.
Why wouldn't they keep them sorted at
the first place instead of sorting
terabytes of data, is my question in a
nutshell.
EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.
Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.
Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.
For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.
http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/
The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.
However for sorting 1 TB I would use map-reduce using Hadoop.
Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.
Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.
Here is the link to the identity mapper:
http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.
Yes, certain companies certainly sort at least that much data every day.
Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.
Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.
Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.
Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.
Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.
Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps.
For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.

Resources