PIG: how to choose good value for PARALLEL clause? - hadoop

I'm trying to minimize for a given cluster (512GB RAM, 100 vCores) the execution time of a workflow with multiples "instances" of the same PIG script.
Increasing PARALLEL clause value for COGROUP operations give better results. However, is there a formula to pick up good value for such clause ? PIG documentation is very evasive about that!

Unfortunately the is no firm rule to define number of reducers and it more can be done empirically investigating the COGROUP execution time phase and playing with different values for PARALELL(suggest to start with 100 from my experience).
Nevertheless the upper bound is usually defined as numReduces << heapSize/(2*io.buffer.size). More you can find here

Related

repeatedly computing the fft over a varying number of rows

I am interested in computing the fft of the first rows of a matrix, but I do not know in advance how many rows I need. I need to do this repeatedly but the number of rows I need to transform can change.
I will illustrate with the following example. Suppose I have a 100 by 128 array. If I plan for 1-dimensional fft's on each row, FFTW produces the following plan:
(dft-ct-dit/8
(dftw-direct-8/28-x100 "t2fv_8_sse2")
(dft-vrank>=1-x8/1
(dft-direct-16-x100 "n1fv_16_sse2")))
Although I don't fully understand this output, I do see the key ingredients: 1) This is a single Cooley-Tucker pass, note that 8*16=128. 2) Because of the x100 postfix on two lines, the plan states that this needs to happen for 100 rows.
I see three possibilities:
One-size-fits-all planning: plan for the 100 by 128 array, and execute this big plan even when only the first (say) 20 rows are needed.
Pros: we need only one plan so there is little planning overhead. Cons: potentially substantial performance loss in the execution phase (transforming more than I need).
Exhaustive planning: obtain plans using the same input/output array but for a all possible number of rows. In the example I would have 100 plans, where plan i carries out the fft for each of the first i rows. Pros: Transforming exactly what I need. Cons: Experiments show that I have to pay the planning penalty over and over, even though say for i=50 the plan will be the same as above but with x50 instead of x100. (I suppose there is no guarantee this will indeed be the plan identified by the FFTW planner, but I wouldn't mind losing "optimality" if I can cut out the planning time.)
Single-row planning: plan for a single row and use a loop to move data into input, transform, and move data out of output. Pros: I'm transforming exactly what I need. Cons: it seems to me this removes a lot of the FFTW optimizations, for instance when I use multiple threads. (I'm generally confused how multithreading works in FFTW since it is ill-documented... I know threading information is part of the plan, but printing the plan doesn't display any of it. This is off-topic though.)
I was thinking that I would combine all three ideas by first creating the one-size-fits-all plan, modifying this plan 99 times in a for loop instead of planning for the different sizes, and executing as under the exhaustive-planning approach. However I can't find any documentation on the plan/wisdom format, the output of wisdom with hexadecimal numbers is impenetrable. So I am wondering how I can carry out this hybrid approach.

Hadoop. Reducing result to the single value

I started learning Hadoop, and am a bit confused by MapReduce. For tasks where result natively is a list of key-value pairs everything seems clear. But I don't understand how should I solve the tasks where result is a single value (say, sum of squared input decimals, or centre of mass for input points).
On the one hand I can put all results of mapper to the same key. But as far as I understood in this case the only reducer will manage the whole set of data (calculate sum, or mean coordinates). It doesn't look like a good solution.
Another one that I can imaging is to group mapper results. Say, mapper that processed examples 0-999 will produce key equals to 0, 1000-1999 will produce key equals to 1, and so on. As far as there still will be multiple results of reducers, it will be necessary to build chain of reducers (reducing will be repeated until only one result remains). It looks much more computational effective, but a bit complicated.
I still hope that Hadoop has the off-the-shelf tool that executes superposition of reducers to maximise the efficiency of reducing the whole data to a single value. Although I failed to find one.
What is the best practise of solving the tasks where result is a single value?
If you are able to reformulate your task in terms of commutative reduce you should look at Combiners. Any way you should take a look on it, it can significantly reduce amount data to shuffle.
From my point of view, you are tackling the problem from the wrong angle.
See that problem where you need to sum the squares of your input, let's assume you have many and large text input files consisting out of a number per line.
Then ideally you want to parallelize your sums in the mapper and then just sum up the sums in the reducer.
e.G:
map: (input "x", temporary sum "s") -> s+=(x*x)
At the end of map, you would emit that temporary sum of every mapper with a global key.
In the reduce stage, you basically get all the sums from your mappers and sum the sums up, note that this is fairly small (n-times a single integer, where n is the number of mappers) in relation to your huge input files and therefore a single reducer is really not a scalability bottleneck.
You want to cut down the communication cost between the mapper and the reducer, not proxy all your data to a single reducer and read through it there, that would not parallelize anything.
I think your analysis of the specific use cases you bring up are spot on. These use cases still fall into a rather inclusive scope of what you can do with hadoop and there are certainly other things that hadoop just wasn't designed to handle. If I had to solve the same problem, I would follow your first approach unless I knew the data was too big, then I'd follow your two-step approach.

How does Oracle calculate the cost in an explain plan?

Can anyone explain how the cost is evaluated in an Oracle explain plan?
Is there any specific algorithm to determine the cost of the query?
For example: full table scans have higher cost, index scan lower... How does Oracle evaluate the cases for full table scan, index range scan, etc.?
This link is same as what I am asking: Question about Cost in Oracle Explain Plan
But can anyone explain with an example, we can find the cost by executing explain plan, but how does it work internally?
There are many, many specific algorithms for computing the cost. Far more than could realistically be discussed here. Jonathan Lewis has done an admirable job of walking through how the cost-based optimizer decides on the cost of a query in his book Cost-Based Oracle Fundamentals. If you're really interested, that's going to to be the best place to start.
It is a fallacy to assume that full table scans will have a higher cost than, say, an index scan. It depends on the optimizer's estimates of the number of rows in the table and the optimizer's estimates of the number of rows the query will return (which, in turn, depends on the optimizer's estimates of the selectivity of the various predicates), the relative cost of a sequential read vs. a serial read, the speed of the processor, the speed of the disk, the probability that blocks will be available in the buffer cache, your database's optimizer settings, your session's optimizer settings, the PARALLEL attribute of your tables and indexes, and a whole bunch of other factors (this is why it takes a book to really start to dive into this sort of thing). In general, Oracle will prefer a full table scan if your query is going to return a large fraction of the rows in your table and an index access if your query is going to return a small fraction of the rows in your table. And "small fraction" is generally much smaller than people initially estimate-- if you're returning 20-25% of the rows in a table, for example, you're almost always better off using a full table scan.
If you are trying to use the COST column in a query plan to determine whether the plan is "good" or "bad", you're probably going down the wrong path. The COST is only valid if the optimizer's estimates are accurate. But the most common reason that query plans would be incorrect is that the optimizer's estimates are incorrect (statistics are incorrect, Oracle's estimates of selectivity are incorrect, etc.). That means that if you see one plan for a query that has a cost of 6 and a plan for a different version of that query that has a cost of 6 million, it is entirely possible that the plan that has a cost of 6 million is more efficient because the plan with the low cost is incorrectly assuming that some step is going to return 1 row rather than 1 million rows.
You are much better served ignoring the COST column and focusing on the CARDINALITY column. CARDINALITY is the optimizer's estimate of the number of rows that are going to be returned at each step of the plan. CARDINALITY is something you can directly test and compare against. If, for example, you see a step in the plan that involves a full scan of table A with no predicates and you know that A has roughly 100,000 rows, it would be concerning if the optimizer's CARDINALITY estimate was either way too high or way too low. If it was estimating the cardinality to be 100 or 10,000,000 then the optimizer would almost certainly be either picking the table scan in error or feeding that data into a later step where its cost estimate would be grossly incorrect leading it to pick a poor join order or a poor join method. And it would probably indicate that the statistics on table A were incorrect. On the other hand, if you see that the cardinality estimates at each step is reasonably close to reality, there is a very good chance that Oracle has picked a reasonably good plan for the query.
Another place to get started on understanding the CBO's algorithms is this paper by Wolfgang Breitling. Jonathan Lewis's book is more detailed and more recent, but the paper is a good introduction.
In the 9i documentation Oracle produced an authoratative looking mathematical model for cost:
Cost = (#SRds * sreadtim +
#MRds * mreadtim +
#CPUCycles / cpuspeed ) / sreadtim
where:
#SRDs is the number of single block reads
#MRDs is the number of multi block reads
#CPUCycles is the number of CPU Cycles *)
sreadtim is the single block read time
mreadtim is the multi block read time
cpuspeed is the CPU cycles per second
So it gives a good idea of the factors which go into calculating cost. This was why Oracle introduced the capability to gather system statistics: to provide accurate values for CPU speed, etc
Now we fast forward to the equivalent 11g documentation and we find the maths has been replaced with a cursory explanation:
"Cost of the operation as estimated by the optimizer's query approach.
Cost is not determined for table access operations. The value of this
column does not have any particular unit of measurement; it is merely
a weighted value used to compare costs of execution plans. The value
of this column is a function of the CPU_COST and IO_COST columns."
I think this reflects the fact that cost just isn't a very reliable indicator of execution time. Jonathan Lewis recently posted a pertinent blog piece. He shows two similar looking queries; their explain plans are different but they have identical costs. Nevertheless when it comes to runtime one query performs considerably slower than the other. Read it here.

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
1 / RANDOM * OFFSET
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
Thanks!
Edit:
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
http://code.google.com/appengine/docs/python/datastore/stats.html
[EDIT1.]
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

How does the MapReduce sort algorithm work?

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.
To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way.
In my mind splitting the dataset into many pieces means you can sort a single piece and then you still have to integrate these pieces into the 'complete' fully sorted dataset. Given the terabyte dataset distributed over thousands of systems I expect this to be a huge task.
So how is this really done? How does this MapReduce sorting algorithm work?
Thanks for helping me understand.
Here are some details on Hadoop's implementation for Terasort:
TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
So their trick is in the way they determine the keys during the map phase. Essentially they ensure that every value in a single reducer is guaranteed to be 'pre-sorted' against all other reducers.
I found the paper reference through James Hamilton's Blog Post.
Google Reference: MapReduce: Simplified Data Processing on Large Clusters
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
That link has a PDF and HTML-Slide reference.
There is also a Wikipedia page with description with implementation references.
Also criticism,
David DeWitt and Michael Stonebraker, pioneering experts in parallel databases and shared nothing architectures, have made some controversial assertions about the breadth of problems that MapReduce can be used for. They called its interface too low-level, and questioned whether it really represents the paradigm shift its proponents have claimed it is. They challenge the MapReduce proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over two decades; they compared MapReduce programmers to Codasyl programmers, noting both are "writing in a low-level language performing low-level record manipulation". MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as PigLatin and Sawzall are starting to address these problems.
I had the same question while reading Google's MapReduce paper. #Yuval F 's answer pretty much solved my puzzle.
One thing I noticed while reading the paper is that the magic happens in the partitioning (after map, before reduce).
The paper uses hash(key) mod R as the partitioning example, but this is not the only way to partition intermediate data to different reduce tasks.
Just add boundary conditions to #Yuval F 's answer to make it complete: suppose min(S) and max(S) is the minimum key and maximum key among the sampled keys; all keys < min(S) are partitioned to one reduce task; vice versa, all keys >= max(S) are partitioned to one reduce task.
There is no hard limitation on the sampled keys, like min or max. Just, more evenly these R keys distributed among all the keys, more "parallel" this distributed system is and less likely a reduce operator has memory overflow issue.
Just guessing...
Given a huge set of data, you would partition the data into some chunks to be processed in parallel (perhaps by record number i.e. record 1 - 1000 = partition 1, and so on).
Assign / schedule each partition to a particular node in the cluster.
Each cluster node will further break (map) the partition into its own mini partition, perhaps by the key alphabetical order. So, in partition 1, get me all the things that starts with A and output it into mini partition A of x. Create a new A(x) if currently there is an A(x) already. Replace x with sequential number (perhaps this is the scheduler job to do so). I.e. Give me the next A(x) unique id.
Hand over (schedule) jobs completed by the mapper (previous step) to the "reduce" cluster nodes. Reduce node cluster will then further refine the sort of each A(x) parts which wil lonly happen when al lthe mapper tasks are done (Can't actually start sorting all the words starting w/ A when there are still possibility that there is still going to be another A mini partition in the making). Output the result in the final sorted partion (i.e. Sorted-A, Sorted-B, etc.)
Once done, combine the sorted partition into a single dataset again. At this point it is just a simple concatenation of n files (where n could be 26 if you are only doing A - Z), etc.
There might be intermediate steps in between... I'm not sure :). I.e. further map and reduce after the initial reduce step.

Resources