is the output of an aggregator in informatica always sorted? - sorting

Is the output of an aggregator component in Informatica always going to be sorted by the group specified, also when the input sorted box is not ticked, and the component does the grouping itself? Or is there no guarantee?

If your input to the aggregator transformation is not sorted, then the output will also not be sorted.
As for your other question, even if you do not use sorted input , aggregator transformation will group by just fine. It might only impact performance.

Hi Rusty If the data is not sorted, aggregator will not sort it. If it is sorted already before passing to the aggregator transformation, check the sorted option as it increases the execution speed. To have fun on informatica, play with https://etlinfromatica.wordpress.com/

Related

In SPSS modeler (17/18), what is criteria for evaluating ties encountered while sorting particular column using sorting nugget?

Am sorting by particular column using sorting nugget in SPSS modeler 17/18. However, do not understand how ties are evaluated when values are repeated in sorting column. None of the other columns have any sequence associated with it? Can someone throw some light on this.
Have attached illustration here where am sorting on col3 (excel file is original data). However, after sorting, no other cols (Key) seem to follow any sequence/order. How was final data arrived at then?
I have not been able to find any documentation to answer this question, but I believe that the order of ties after the sort is essentially random or at least determined by a number of factors that are outside of the user's control. Generally, I think it is determined by the order of the records in the source, but if you are querying a database or similar without specifying a sort order, you may see that the data will be sorted differently depending on the source system and it may even differ between each execution.
If your processing depends on the sort order of the data (including the order of the ties), the best approach will be to specify the sort order in such a detail that ties will not happen.

Not allowed to use Joiner transformation

I have an expression transformation from which I am passing the data to two different transformations.
Later in the downstream of these parallel flows, I am trying to apply a joiner transformation but I am not allowed to do so.,
Is joiner transformation not allowed in a such a case similar to self-join?
What could be an alternative approach if I wanted to achieve such a transformation?
It would be great if somebody can help me sort out this issue.
You need to sort the data before the joiner, and turn on 'sorted merge join' before connecting the second set of ports to the joiner.
One voice of caution though: carefully consider the 'key' you join these data on. I should be a unique value across all records on at least one of the two data streams, otherwise you'll end up with a data explosion. I know this may sound very basic, but it is often forgotten in self joins :)
Joiner Transformation will work. I assume the if the data is from same source table and is passed through different pipe line, use the option of SORTED INPUT in the joiner transformation.

Hadoop map only job

My situation is like the following:
I have two MapReduce jobs.
First one is MapReduce job which produces output sorted by key.
Then second Map only job will extract some part of the data and just collect it.
I have no reducer in second job.
Problem is I am not sure if the output from map only job will be sorted or it will be shuffled from the map function.
First of all: If your second job only contains a filter to include/exclude specific records then you are better of simply adding this filter to the end of your reducer of the first job.
A rather important fact of the MapReduce is that the reducer will sort the records in "some way" that you do not control. When writing a job you should assume the records are output in a random order.
If you really need all records to be output in a specific order then using the SecondarySort mechanism in combination with a single reducer is "easy" solution that doesn't scale well.
The "hard" solution is what the "Tera sort" benchmark uses.
Read this SO question for more insight into how that works:
How does the MapReduce sort algorithm work?
No as zsxwing said, there wont be any processing done unless you specify reducer, then partitioning will be performed at map side and sorting and grouping will be done on reduce side.

Hadoop MapReduce with already sorted files

I'm working with Hadoop MapReduce. I've got data in HDFS and data in each file is already sorted. Is it possible to force MapReduce not to resort the data after map phase? I've tried to change the map.sort.class to no-op, but it didn't work (i.e. the data wasn't sorted as I'd expected). Does anyone tried doing something similar and managed to achieve it?
I think it depends on what style result you want, sorted result or unsorted result?
If you need result be sorted, I think hadoop is not suitable to do this work. There are two reasons:
INPUT DATA will be stored in different chunk(if big enough) and partitioned into multi-splits. Each one split will be mapped to one map task and all output of map tasks will gathered(after processes of partitioned/sorted/combined/copied/merged) as reduce's input. It is hard to keep keys in order among these stages.
Sort function exists not only after map process in map task. When do merge process during reduce task, there is sort option,too.
If you do not need result be sorted,I think this patch may be what you want:
Support no sort dataflow in map output and reduce merge phrase : https://issues.apache.org/jira/browse/MAPREDUCE-3397

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources