Create pivot table using KornShell - shell

I am trying to create a pivot table (csv) using KornShell (ksh).
This is the data (csv):
name a1 a2 amount
----------------------------
product1 | 1 | 1000 | 1.5
product1 | 2 | 2000 | 2.6
product1 | 3 | 3000 | 1.2
product1 | 1 | 3000 | 2.1
product1 | 2 | 3000 | 4.1
product1 | 3 | 2000 | 3.1
The result should be:
__| a2| 1000 | 2000 | 3000
a1 \----------------------
1 | 1.5 2.1
2 | 2.6 4.1
3 | 3.1 1.2
I want to "group" the data by the two attributes and create a table which contains the sums of the amount column for the respective attributes.
EDIT: The attributes a1 and a2 are dynamic. I do not know which one of them is going to exist and which not, or how many attributes there will be at all.

It sounds like you are not using the database as effectively as you might. Here is a thought to help you make some headway: You can generate SQL using the shell.
You know what the beginning of each query is going to look like:
select a1,
So you would want to start building some report SQL:
Report_SQL="select a1, "
You then would need to get the list of SQL statements for an arbitrarily sized set of columns for the pivot report(in MySQL - Other databases would require || concatenation):
select distinct concat('sum(case a2 when ', a2, ' then amount else null end) as "_', a2,'",')
from my_database_table
order by 1
;
Because this is in the shell, it is easy to pull this into a variable as follows:
SQL=" select distinct concat('sum(case a2 when ', a2, ' then amount else null end) as _', a2,',') "
SQL+=" from my_database_table "
SQL+=" order by 1 "
# You would have to define a runsql command for your database platform.
a2_columns=$(runsql "$SQL")
At this point you will have an extra comma at the end of the a2_columns variable. That is easily removed:
a2_columns=${a2_columns%,}
Now we can concatenate these variables to create the report SQL that you really seem to need:
Report_SQL+="${a2_columns}"
Report_SQL+=" from my_database_table "
Report_SQL+=" group by 1"
Report_SQL+=" order by 1"
The resulting report SQL would look something like this:
select a1,
sum(case a2 when 1000 then amount else null end) as _1000,
sum(case a2 when 2000 then amount else null end) as _2000,
sum(case a2 when 3000 then amount else null end) as _3000
from my_database_table
group by 1
order by 1
;
The formatting of the report header is left as an exercise to the reader. :)

Related

Insert value based on min value greater than value in another row

It's difficult to explain the question well in the title.
I am inserting 6 values from (or based on values in) one row.
I also need to insert a value from a second row where:
The values in one column (ID) must be equal
The values in column (CODE) in the main source row must be IN (100,200), whereas the other row must have value of 300 or 400
The value in another column (OBJID) in the secondary row must be the lowest value above that in the primary row.
Source Table looks like:
OBJID | CODE | ENTRY_TIME | INFO | ID | USER
---------------------------------------------
1 | 100 | x timestamp| .... | 10 | X
2 | 100 | y timestamp| .... | 11 | Y
3 | 300 | z timestamp| .... | 10 | F
4 | 100 | h timestamp| .... | 10 | X
5 | 300 | g timestamp| .... | 10 | G
So to provide an example..
In my second table I want to insert OBJID, OBJID2, CODE, ENTRY_TIME, substr(INFO(...)), ID, USER
i.e. from my example a line inserted in the second table would look like:
OBJID | OBJID2 | CODE | ENTRY_TIME | INFO | ID | USER
-----------------------------------------------------------
1 | 3 | 100 | x timestamp| substring | 10 | X
4 | 5 | 100 | h timestamp| substring2| 10 | X
My insert for everything that just comes from one row works fine.
INSERT INTO TABLE2
(ID, OBJID, INFO, USER, ENTRY_TIME)
SELECT ID, OBJID, DECODE(CODE, 100, (SUBSTR(INFO, 12,
LENGTH(INFO)-27)),
600,'CREATE') INFO, USER, ENTRY_TIME
FROM TABLE1
WHERE CODE IN (100,200);
I'm aware that I'll need to use an alias on TABLE1, but I don't know how to get the rest to work, particularly in an efficient way. There are 2 million rows right now, but there will be closer to 20 million once I start using production data.
You could try this:
select primary.* ,
(select min(objid)
from table1 secondary
where primary.objid < secondary.objid
and secondary.code in (300,400)
and primary.id = secondary.id
) objid2
from table1 primary
where primary.code in (100,200);
Ok, I've come up with:
select OBJID,
min(case when code in (300,400) then objid end)
over (partition by id order by objid
range between 1 following and unbounded following
) objid2,
CODE, ENTRY_TIME, INFO, ID, USER1
from table1;
So, you need a insert select the above query with a where objid2 is not null and code in (100,200);

Two date columns in source to decide the latest updated record in informatica

I have a requirement as below:
I have a source table like
id | name | address | updt_date_1 | updt_date_2
1 | abc | xyz | 2000-01-01 | 1999-01-01
1 | abc | pqr | 2001-01-01 | 1999-01-01
2 | lmn | ghi | 1999-01-01 | 1999-01-01
2 | lmn | stu | 2000-01-01 | 2008-01-01
I would want to get in target as:
1 | abc | pqr
2 | lmn | stu
i.e. I would want the record with the latest update date in either of the two date columns -updt_date_1 or updt_date_2
Please suggest how can this be implemented in informatica PC
This requirement can be achieved in a effective way by using just 3 transformations (SourceQualifier, Expression and Filter). Please see the steps below
1) Use the following SQL override in the Source Qualifier transformation to reduce the two last_updated_date fields into one
SELECT
id
,name
,address
,CASE WHEN updt_date_1 > updt_date_2 THEN updt_date_1 ELSE updt_date_2 AS updt_date
FROM souce_table
ORDER BY id, updt_date DESC
Now the first row for each id will be the required record.
2) Use an expression transformation to flag the first row of each id. Use the following ports in the same order in the expression transformation (prefix o_ means output port, v_ means variable port and i_ means input port)
PORT EXPRESSION
v_FIRST_ROW_FLAG - IIF(v_PREV_ID==i_id,'N','Y')
v_PREV_ID - i_id
o_FIRST_ROW_FLAG - v_FIRST_ROW_FLAG
3) Next add a filter transformation to filter records which does not satisfy the following condition
IIF(o_FIRST_ROW_FLAG==Y,TRUE,FALSE)
Connect this filter transformation to the target definition. This will give you the expected output.
Basically we have to determine maximum update date1 and update date2. Then we have to choose which one is maximum between them.
Usea souce qualifier and then sort the data based on id, name.
Add an aggregtor after. pull id, name, updt_date_1, updt_date_2 columns. Create two o/p columns - max_upd_dt1, max_upd_dt2 and calculate MAX(updt_date_1), MAX(updt_date_2) respectively . set group by id, name.
Use a joiner to join sorter output and aggregator output based on id,name. so now you have two extra columns- max_upd_dt1 and max_upd_dt2.
Use an expression transformation after joiner. Pull all columns in. Create two output port and set logic like below -
out_upd_dt1 = iif( max_upd_dt1 > max_upd_dt2, max_upd_dt1, updt_date_1 )
out_upd_dt2 = iif( max_upd_dt1 < max_upd_dt2, max_upd_dt2, updt_date_2 )
Use another source qualifier(sort by id,name)and join it with above expression tx. Join based on -
id=id, name=name, out_upd_dt1=updt_date_1, out_upd_dt2= updt_date_2
Pick up id, name, address
HTH
Koushik

Build adjacency matrix from list of weighted edges in BigQuery

Related issue:
How to create dummy variable columns for thousands of categories in Google BigQuery
I have a table of list of weighted edges which is a list of user-item rating, it looks like this:
| userId | itemId | rating
| 001 | 001 | 5.0
| 001 | 002 | 4.0
| 002 | 001 | 4.5
| 002 | 002 | 3.0
I want to convert this weighted edge list into a adjacency matrix:
| userId | item001 | item002
| 001 | 5.0 | 4.0
| 002 | 4.5 | 3.0
According to this post, we can do it in two steps, the first step is to extract the matrix entry's value to generate a query, and second step is to run the query which is generated from 1st step.
But my question is how to extract the rating value and use the rating value in the IF() statement? My intuition is to put a nested query inside the IF() statement such like:
IF(itemId = blah,
(select rating
from mytable
where
userId = blahblah
and itemId = blah),
0)
But this query looks too expensive, can someone give me an example?
Thanks
Unless I am missing something - it is quite similar to the post you referenced
Step 1 - generate query
SELECT 'SELECT userID, ' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(itemId = "' + STRING(itemId) + '", rating, 0)) AS item' + STRING(itemId)
)
+ ' FROM YourTable GROUP BY userId'
FROM (
SELECT itemId
FROM YourTable
GROUP BY itemId
)
Step 2 - run generated query
SELECT
userID,
SUM(IF(itemId = "001", rating, 0)) AS item001,
SUM(IF(itemId = "002", rating, 0)) AS item002
FROM YourTable
GROUP BY userId
Result as expected
userID item001 item002
001 5.0 4.0
002 4.5 3.0

How to transpose/pivot data in hive?

I know there's no direct way to transpose data in hive. I followed this question: Is there a way to transpose data in Hive? , but as there is no final answer there, could not get all the way.
This is the table I have:
| ID | Code | Proc1 | Proc2 |
| 1 | A | p | e |
| 2 | B | q | f |
| 3 | B | p | f |
| 3 | B | q | h |
| 3 | B | r | j |
| 3 | C | t | k |
Here Proc1 can have any number of values. ID, Code & Proc1 together form a unique key for this table. I want to Pivot/ transpose this table so that each unique value in Proc1 becomes a new column, and corresponding value from Proc2 is the value in that column for the corresponding row. In essense, I'm trying to get something like:
| ID | Code | p | q | r | t |
| 1 | A | e | | | |
| 2 | B | | f | | |
| 3 | B | f | h | j | |
| 3 | C | | | | k |
In the new transformed table, ID and code are the only primary key. From the ticket I mentioned above, I could get this far using the to_map UDAF. (Disclaimer - this may not be a step in the right direction, but just mentioning here, if it is)
| ID | Code | Map_Aggregation |
| 1 | A | {p:e} |
| 2 | B | {q:f} |
| 3 | B | {p:f, q:h, r:j } |
| 3 | C | {t:k} |
But don't know how to get from this step to the pivot/transposed table I want.
Any help on how to proceed will be great!
Thanks.
Here is the approach i used to solved this problem using hive's internal UDF function, "map":
select
b.id,
b.code,
concat_ws('',b.p) as p,
concat_ws('',b.q) as q,
concat_ws('',b.r) as r,
concat_ws('',b.t) as t
from
(
select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from (
select
id,
code,
map(proc1,proc2) as group_map
from
test_sample
) a
group by
a.id,
a.code
) b;
"concat_ws" and "map" are hive udf and "collect_list" is a hive udaf.
Here is the solution I ended up using:
add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';
select
id,
code,
group_map['p'] as p,
group_map['q'] as q,
group_map['r'] as r,
group_map['t'] as t
from ( select
id, code,
collect(proc1,proc2) as group_map
from test_sample
group by id, code
) gm;
The to_map UDF was used from the brickhouse repo: https://github.com/klout/brickhouse
Yet another solution.
Pivot using Hivemall to_map function.
SELECT
uid,
kv['c1'] AS c1,
kv['c2'] AS c2,
kv['c3'] AS c3
FROM (
SELECT uid, to_map(key, value) kv
FROM vtable
GROUP BY uid
) t
uid c1 c2 c3
101 11 12 13
102 21 22 23
Unpivot
SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
'c1', c1,
'c2', c2,
'c3', c3
)) t2 as key, value
uid key value
101 c1 11
101 c2 12
101 c3 13
102 c1 21
102 c2 22
102 c3 23
I have not written this code, but I think you can use some of the UDFs provided by klouts brickhouse: https://github.com/klout/brickhouse
Specifically, you could do something like use their collect as mentioned here: http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/
and then explode the arrays (they will be of differing length) using the methods detailed in this post http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra
I have created one dummy table called hive using below query-
create table hive (id Int,Code String, Proc1 String, Proc2 String);
Loaded all the data in the table-
insert into hive values('1','A','p','e');
insert into hive values('2','B','q','f');
insert into hive values('3','B','p','f');
insert into hive values('3','B','q','h');
insert into hive values('3','B','r','j');
insert into hive values('3','C','t','k');
Now use the below query to achieve the output.
select id,code,
case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
from(
select id, code,
case when proc1 ='p' then proc2 end as p,
case when proc1 ='q' then proc2 end as q,
case when proc1 ='r' then proc2 end as r,
case when proc1 ='t' then proc2 end as t
from hive
) dummy group by id,code;
In case of numeric value you can use below hive query:
Sample data
ID cust_freq Var1 Var2 frequency
220444 1 16443 87128 72.10140547
312554 6 984 7339 0.342452643
220444 3 6201 87128 9.258396518
220444 6 47779 87128 2.831972441
312554 1 6055 7339 82.15209213
312554 3 12868 7339 4.478333954
220444 2 6705 87128 15.80822558
312554 2 37432 7339 13.02712127
select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id,
map(cust_freq,frequency) as group_map
from table
) a group by a.id having id in
( '220444',
'312554');
ID one two three six
220444 72.10140547 15.80822558 9.258396518 2.831972441
312554 82.15209213 13.02712127 4.478333954 0.342452643
In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.
For Unpivot, we can simply use below logic.
SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);
Below is also a way for Pivot
SELECT TM1_Code, Product, Size, State_code, Description
, Promo_date
, Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
, MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;
You can use case statements and some help from collect_set to achieve this. You can check this out. You can check detail answer at - http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/
Here is the query for reference,
SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;

How many Include I can use on ObjectSet in EntityFramework to retain performance?

I am using the following LINQ query for my profile page:
var userData = from u in db.Users
.Include("UserSkills.Skill")
.Include("UserIdeas.IdeaThings")
.Include("UserInterests.Interest")
.Include("UserMessengers.Messenger")
.Include("UserFriends.User.UserSkills.Skill")
.Include("UserFriends1.User1.UserSkills.Skill")
.Include("UserFriends.User.UserIdeas")
.Include("UserFriends1.User1.UserIdeas")
where u.UserId == userId
select u;
It has a long object graph and uses many Includes. It is running perfect right now, but when the site has many users, will it impact performance much?
Should I do it in some other way?
A query with includes returns a single result set and the number of includes affect how big data set is transfered from the database server to the web server. Example:
Suppose we have an entity Customer (Id, Name, Address) and an entity Order (Id, CustomerId, Date). Now we want to query a customer with her orders:
var customer = context.Customers
.Include("Orders")
.SingleOrDefault(c => c.Id == 1);
The resulting data set will have the following structure:
Id | Name | Address | OrderId | CustomerId | Date
---------------------------------------------------
1 | A | XYZ | 1 | 1 | 1.1.
1 | A | XYZ | 2 | 1 | 2.1.
It means that Cutomers data are repeated for each Order. Now lets extend the example with another entities - 'OrderLine (Id, OrderId, ProductId, Quantity)andProduct (Id, Name)`. Now we want to query a customer with her orders, order lines and products:
var customer = context.Customers
.Include("Orders.OrderLines.Product")
.SingleOrDefault(c => c.Id == 1);
The resulting data set will have the following structure:
Id | Name | Address | OrderId | CustomerId | Date | OrderLineId | LOrderId | LProductId | Quantity | ProductId | ProductName
------------------------------------------------------------------------------------------------------------------------------
1 | A | XYZ | 1 | 1 | 1.1. | 1 | 1 | 1 | 5 | 1 | AA
1 | A | XYZ | 1 | 1 | 1.1. | 2 | 1 | 2 | 2 | 2 | BB
1 | A | XYZ | 2 | 1 | 2.1. | 3 | 2 | 1 | 4 | 1 | AA
1 | A | XYZ | 2 | 1 | 2.1. | 4 | 2 | 3 | 6 | 3 | CC
As you can see data become quite a lot duplicated. Generaly each include to a reference navigation propery (Product in the example) will add new columns and each include to a collection navigation property (Orders and OrderLines in the example) will add new columns and duplicate already created rows for each row in the included collection.
It means that your example can easily have hundreds of columns and thousands of rows which is a lot of data to transfer. The correct approach is creating performance tests and if the result will not satisfy your expectations, you can modify your query and load navigation properties separately by their own queries or by LoadProperty method.
Example of separate queries:
var customer = context.Customers
.Include("Orders")
.SingleOrDefault(c => c.Id == 1);
var orderLines = context.OrderLines
.Include("Product")
.Where(l => l.Order.Customer.Id == 1)
.ToList();
Example of LoadProperty:
var customer = context.Customers
.SingleOrDefault(c => c.Id == 1);
context.LoadProperty(customer, c => c.Orders);
Also you should always load only data you really need.
Edit: I just created proposal on Data UserVoice to support additional eager loading strategy where eager loaded data would be passed in additional result set (created by separate query within the same database roundtrip). If you find this improvement interesting don't forget to vote for the proposal.
(You can improve performance of many includes by creating 2 or more small data request from data base like below.
According to my experience,Only can give maximum 2 includes per query like below.More than that will give really bad performance.
var userData = from u in db.Users
.Include("UserSkills.Skill")
.Include("UserIdeas.IdeaThings")
.FirstOrDefault();
userData = from u in db.Users
.Include("UserFriends.User.UserSkills.Skill")
.Include("UserFriends1.User1.UserSkills.Skill")
.FirstOrDefault();
Above will bring small data set from database by using more travels to the database.
Yes it will. Avoid using Include if it expands multiple detail rows on a master table row.
I believe EF converts the query into one large join instead of several queries. Therefore, you'll end up duplicating your master table data over every row of the details table.
For example: Master -> Details. Say, master has 100 rows, Details has 5000 rows (50 for each master).
If you lazy-load the details, you return 100 rows (size: master) + 5000 rows (size: details).
If you use .Include("Details"), you return 5000 rows (size: master + details). Essentially, the master portion is duplicated over 50 times.
It multiplies upwards if you include multiple tables.
Check the SQL generated by EF.
I would recommend you to perform load tests and measure the performance of the site under stress. If you are performing complex queries on each request you may consider caching some results.
The result of include may change: it depend by the entity that call the include method.
Like the example proposed from Ladislav Mrnka, suppose that we have an entity
Customer (Id, Name, Address)
that map to this table:
Id | Name | Address
-----------------------
C1 | Paul | XYZ
and an entity Order (Id, CustomerId, Total)
that map to this table:
Id | CustomerId | Total
-----------------------
O1 | C1 | 10.00
O2 | C1 | 13.00
The relation is one Customer to many Orders
Esample 1: Customer => Orders
var customer = context.Customers
.Include("Orders")
.SingleOrDefault(c => c.Id == "C1");
Linq will be translated in a very complex sql query.
In this case the query will produce two record and the informations about the customer will be replicated.
Customer.Id | Customer.Name | Order.Id | Order.Total
-----------------------------------------------------------
C1 | Paul | O1 | 10.00
C1 | Paul | O2 | 13.00
Esample 2: Order => Customer
var order = context.Orders
.Include("Customers")
.SingleOrDefault(c => c.Id == "O1");
Linq will be translated in a simple sql Join.
In this case the query will produce only one record with no duplication of informations:
Order.Id | Order.Total | Customer.Id | Customer.Name
-----------------------------------------------------------
O1 | 10.00 | C1 | Paul

Resources