I have an expression transformation from which I am passing the data to two different transformations.
Later in the downstream of these parallel flows, I am trying to apply a joiner transformation but I am not allowed to do so.,
Is joiner transformation not allowed in a such a case similar to self-join?
What could be an alternative approach if I wanted to achieve such a transformation?
It would be great if somebody can help me sort out this issue.
You need to sort the data before the joiner, and turn on 'sorted merge join' before connecting the second set of ports to the joiner.
One voice of caution though: carefully consider the 'key' you join these data on. I should be a unique value across all records on at least one of the two data streams, otherwise you'll end up with a data explosion. I know this may sound very basic, but it is often forgotten in self joins :)
Joiner Transformation will work. I assume the if the data is from same source table and is passed through different pipe line, use the option of SORTED INPUT in the joiner transformation.
Related
Am sorting by particular column using sorting nugget in SPSS modeler 17/18. However, do not understand how ties are evaluated when values are repeated in sorting column. None of the other columns have any sequence associated with it? Can someone throw some light on this.
Have attached illustration here where am sorting on col3 (excel file is original data). However, after sorting, no other cols (Key) seem to follow any sequence/order. How was final data arrived at then?
I have not been able to find any documentation to answer this question, but I believe that the order of ties after the sort is essentially random or at least determined by a number of factors that are outside of the user's control. Generally, I think it is determined by the order of the records in the source, but if you are querying a database or similar without specifying a sort order, you may see that the data will be sorted differently depending on the source system and it may even differ between each execution.
If your processing depends on the sort order of the data (including the order of the ties), the best approach will be to specify the sort order in such a detail that ties will not happen.
Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.
Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
10GB.
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.
I need to find average of two values in separate lines.
My CSV file looks like this
Name,ID,Marks
Mahi,1,90
Mahi,1,100
Andy,2,85
Andy,2,95
Now I need to store that average of 2 marks in database.
"Average" column should add two marks and divide with 2 and store that result in SQL query
Table:
Name,ID,Average
Mahi,2,95
Andy,2,90
Is it possible to find the average of two values in separate rows using NiFi?
Given a lot of assumptions, this is doable. You are definitely better off pre-processing the data in NiFi and exporting it to a tool better suited to this, like Apache Spark using the NiFi Spark Receiver library (instructions here), because this solution will not scale well.
However, you could certainly use a combination of SplitText processors to get the proper data into individual flowfiles (i.e. all Mahi rows in one, all Andy rows in another). Once you have a record that looks like:
Andy,1,85
Andy,1,95
you can use ExtractText with regular expressions to get 85 and 95 into attributes marks.1 and marks.2 (a good example of where scaling will break down -- doing this with 2 rows is easy; doing this with 100k is ridiculous). You can then use UpdateAttribute with the Expression Language to calculate the average of those two attributes (convert toNumber() first) and populate a third attribute marks.average (either through chaining plus() and divide() functions or with the math advanced operation (uses Java Reflection)). Once you have the desired result in an attribute, use ReplaceText to update the flowfile content, and MergeContent to merge the individual flowfiles back into a single instance.
If this were me, I'd first evaluate how static my incoming data format was, and if it was guaranteed to stay the same, probably just write a Groovy script that parsed the data and calculated the averages in place. I think that would even scale better (within reason) because of the flexibility of having written domain-specific code. If you need to offload this to cluster operations, Spark is the way to go.
Is the output of an aggregator component in Informatica always going to be sorted by the group specified, also when the input sorted box is not ticked, and the component does the grouping itself? Or is there no guarantee?
If your input to the aggregator transformation is not sorted, then the output will also not be sorted.
As for your other question, even if you do not use sorted input , aggregator transformation will group by just fine. It might only impact performance.
Hi Rusty If the data is not sorted, aggregator will not sort it. If it is sorted already before passing to the aggregator transformation, check the sorted option as it increases the execution speed. To have fun on informatica, play with https://etlinfromatica.wordpress.com/
It is common task to make some evaluation on pairs of items:
Examples: de-duplication, collaborative filtering, similar items etc
This is basically self-join or cross-product with the same source of data.
To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value.
So, let's say we wanted to do a self-join on "city" (the middle column) on the following data:
don,baltimore,12
jerry,boston,19
bob,baltimore,99
cameron,baltimore,13
james,seattle,1
peter,seattle,2
The mapper would emit the key->value pairs:
(baltimore -> don,12)
(boston -> jerry,19)
(baltimore -> bob,99)
(baltimore -> cameron,13)
(seattle -> james,1)
(seattle -> peter,2)
In the reducer, we'll get this:
(baltimore -> [(don,12), (bob,99), (cameron,13)])
(boston -> [(jerry,19)])
(seattle -> [(james,1), (peter,2)])
From here, you can do the inner join logic, if you so choose. To do this, you'd just match up every item for every other item. To do this, load it up the data into an array list, then do a N x N loop over the items to compare each to each other.
Realize that reduce-side joins are expensive. They send pretty much all of the data to the reducers if you don't filter anything out. Also, be careful of loading the data up into memory in the reducers-- you may blow your heap on a hot join key by loading all of the data in an array list.
The above is a bit different than the typical reduce-side join. The idea is the same when joining two data sets: the foreign key is the key, and the record is the value. The only difference is that the values could be coming from two or more data sets. You can use MultipleInputs to have different mappers parse different input sets, then have the reducer collect data from both.
Cross product in the case where you don't have any constraints is a nightmare. I.e.,
select * from tablea, tableb;
There are a number of ways to do this. None of them are particularly efficient. If you want this type of behavior, leave me a comment and I'll spend more time explaining a way to do this.
If you can figure out some sort of join key which is a fundamental key to similarity, you are much better off.
Plug for my book: MapReduce Design Patterns. It should be published in a few months, but if you are really interested I can email you the chapter on joins.
One typically uses the reducer to perform whatever logic is required on the join. The trick is to map the dataset twice, possibly adding some marker to the value indicating which run it is. Then a self join is no different from any other kind of join.