Two date columns in source to decide the latest updated record in informatica - informatica-powercenter

I have a requirement as below:
I have a source table like
id | name | address | updt_date_1 | updt_date_2
1 | abc | xyz | 2000-01-01 | 1999-01-01
1 | abc | pqr | 2001-01-01 | 1999-01-01
2 | lmn | ghi | 1999-01-01 | 1999-01-01
2 | lmn | stu | 2000-01-01 | 2008-01-01
I would want to get in target as:
1 | abc | pqr
2 | lmn | stu
i.e. I would want the record with the latest update date in either of the two date columns -updt_date_1 or updt_date_2
Please suggest how can this be implemented in informatica PC

This requirement can be achieved in a effective way by using just 3 transformations (SourceQualifier, Expression and Filter). Please see the steps below
1) Use the following SQL override in the Source Qualifier transformation to reduce the two last_updated_date fields into one
SELECT
id
,name
,address
,CASE WHEN updt_date_1 > updt_date_2 THEN updt_date_1 ELSE updt_date_2 AS updt_date
FROM souce_table
ORDER BY id, updt_date DESC
Now the first row for each id will be the required record.
2) Use an expression transformation to flag the first row of each id. Use the following ports in the same order in the expression transformation (prefix o_ means output port, v_ means variable port and i_ means input port)
PORT EXPRESSION
v_FIRST_ROW_FLAG - IIF(v_PREV_ID==i_id,'N','Y')
v_PREV_ID - i_id
o_FIRST_ROW_FLAG - v_FIRST_ROW_FLAG
3) Next add a filter transformation to filter records which does not satisfy the following condition
IIF(o_FIRST_ROW_FLAG==Y,TRUE,FALSE)
Connect this filter transformation to the target definition. This will give you the expected output.

Basically we have to determine maximum update date1 and update date2. Then we have to choose which one is maximum between them.
Usea souce qualifier and then sort the data based on id, name.
Add an aggregtor after. pull id, name, updt_date_1, updt_date_2 columns. Create two o/p columns - max_upd_dt1, max_upd_dt2 and calculate MAX(updt_date_1), MAX(updt_date_2) respectively . set group by id, name.
Use a joiner to join sorter output and aggregator output based on id,name. so now you have two extra columns- max_upd_dt1 and max_upd_dt2.
Use an expression transformation after joiner. Pull all columns in. Create two output port and set logic like below -
out_upd_dt1 = iif( max_upd_dt1 > max_upd_dt2, max_upd_dt1, updt_date_1 )
out_upd_dt2 = iif( max_upd_dt1 < max_upd_dt2, max_upd_dt2, updt_date_2 )
Use another source qualifier(sort by id,name)and join it with above expression tx. Join based on -
id=id, name=name, out_upd_dt1=updt_date_1, out_upd_dt2= updt_date_2
Pick up id, name, address
HTH
Koushik

Related

How to groupBy on one column in laravel?

I have a section table and class Table
class table is designed in this way
(id,class_name,section_id)
one class has many sections like
--------------------------------------------
| SN | ClassName | Section_id |
--------------------------------------------
| 1 | ClassOne | 1 |
| 2 | ClassOne | 2 |
| 3 | ClassOne | 3 |
| 4 | ClassOne | 4 |
--------------------------------------------
Now i want to groupBy Only ClassName and display all the sections of that class
$data['classes'] = SectionClass::groupBy('class_name')->paginate(10);
i have groupby like this but it only gives me one section id
Try this way...
$things = SectionClass::paginate(10);
$data['classes']= $things->groupBy('class_name');
You are getting just one row because that is what GROUP BY does, groups a set of rows into a set of summary rows and returns one row for each group. In standard SQL, a query that includes a GROUP BY clause cannot refer to nonaggregated columns in the select list that are not named in the GROUP BY clause. For example, in SQL Server if you try the next clause
SELECT * FROM [Class] GROUP BY [ClassName]
You'll get the next error
"Column 'SN' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause"
Think about it, you are grouping by ClassName, and following your sample data, this will return just one row. Your SELECT clause includes column ClassName, which is easy to get because is the same in every single row, but when you are selecting another, which one should be return if only one has to be selected?
Now, things change a little bit in MySQL. MySQL extends the standard SQL use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic. You can find a complete explanation about this topic here https://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html
If you are expecting a result in one row, you can use GROUP_CONCAT() function to get something like
--------------------------------
| ClassName | Sections |
--------------------------------
| ClassOne | 1,2,3,4 |
--------------------------------
Your query must be something like:
select `ClassName`, group_concat(Section_id) from `class` group by `ClassName`
You can get this with a raw query in laravel or its up to you to find a way to get the same result using query builder ;)

How to return the match record based on lookup table by using hive

Let's say we have a look up table (table_A) and another table (table_B) as follows:
And we want to search string of Table_B from Table_A to return the chemical type and form Table_C, as follows:
How can we implement this by using hive query under hadoop environment?
The challenging part is to search for multiple keywords within same string and create new row for each matched record.
Thank you!
I think you should structure Table_A differently (or keep the current structure but split by comma and use explode in hive) like so:
----------------------------
| Table A |
----------------------------
| Chemical Type | Keyword |
----------------------------
| HF | 100HF |
----------------------------
| HF | 100:HF |
----------------------------
| HCL | HCL200 |
----------------------------
| HCL | 500HCL |
----------------------------
etc...
Then, it seems that you need to perform a cartesian product join:
select distinct b.machine,b.string,a.chemical_type from
Table_A as a, Table_B as b where instr(b.string,a.keyword) > 0;

How to select nth row in CockroachDB?

If I use something like a SERIAL (which is a random number) for my table's primary key, how can I select a numbered row from my table? In MySQL, I just use the auto incremented ID to select a specific row, but not sure how to approach the problem with an arbitrary numbering sequence.
For reference, here is the table I'm working with:
+--------------------+------+-------+
| id | name | score |
+--------------------+------+-------+
| 235451721728983041 | ABC | 1000 |
| 235451721729015809 | EDF | 1100 |
| 235451721729048577 | GHI | 1200 |
| 235451721729081345 | JKL | 900 |
+--------------------+------+-------+
Using the LIMIT and OFFSET clauses will return the nth row. For example SELECT * FROM tbl ORDER BY col1 LIMIT 1 OFFSET 9 returns the 10th row.
Note that it’s important to include the ORDER BY clause here because you care about the order of the results (if you don’t include ORDER BY, it’s possible that the results are arbitrarily ordered).
If you care about the order in which things were inserted, you could ORDER BY the SERIAL column (id in your case), though it’s not always the case because transaction contention and other things could cause the generated SERIAL values to not be strictly ordered.

cassandra query on map in select clause

i am new to cassandra and i am trying to read a row from database which contains values
siteId | country | someMap
1 | US | {a:b, x:z}
2 | PR | {a:b, x:z}
I have also created an index on table using create index on columnfamily(keys(someMap));
but still when i query as select * from table where siteId=1 and someMap contains key 'a'
it returns an entiremap as
1 | US | {a:b, x:z}
Can somebody help me on what should i do to get the value as
1 | US | {a:b}
You can not: even if internally each entry of a Map|List|Set is stored as a column you can only retrieve the whole collection but not part of it. You are not asking cassandra give me the entry of the map containing X, but the row whom map contains X.
HTH,
Carlo

Hive: SemanticException [Error 10002]: Line 3:21 Invalid column reference 'name'

I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.

Resources