Creating multiple extract from single mapping in Informatica - informatica-powercenter

I have below source table.
Col1 Col2 Col 3 Col4 col5 col6 col7 col8
A 2 1 2 3 4 5 AAA
B 3 1 1 8 5 6 AAA
C 4 1 2 9 6 7 CC
D 5 2 3 10 7 8 CC
E 2 2 4 11 8 9 CC
F 3 3 5 12 9 10 BB
G 4 3 6 13 10 11 BB
H 5 3 7 14 11 12 BB
I 6 3 8 15 12 13 BB
I want to create a single mapping ( 1 source and 1 target stucture) which should create three extract from above source as below. All differnt extract will have different number of columns based on saome speific id.
Extract 1
Col1 Col2 Col 3 Col4 col8
A 2 1 2 AAA
B 3 1 1 AAA
Extract 2
Col1 Col2 Col 3 Col4 col5 col6 col7 col8
C 4 1 2 9 6 7 CC
D 5 2 3 10 7 8 CC
E 2 2 4 11 8 9 CC
Extract 3
Col1 Col2 col7 col8
F 3 10 BB
G 4 11 BB
H 5 12 BB
I 6 13 BB
I dont want to create three differnt target structure.
Please let us know if any one have any idea on that .

You can split the flow to three pipelines with three Target Definitions that all will point to the same Target Table on session level. This will be clear and effectively all will end up in same target structure.
Otherwise, if for some reason you's like to avoid having three target instances, you can use the Union Transformation that will span the three pipelines and send all data to one target instance.

Related

Pandas Sort Order of Columns by Row

Given the following data frame:
df = pd.DataFrame({'A' : ['1','2','3','7'],
'B' : [7,6,5,4],
'C' : [5,6,7,1],
'D' : [1,9,9,8]})
df=df.set_index('A')
df
B C D
A
1 7 5 1
2 6 6 9
3 5 7 9
7 4 1 8
I want to sort the order of the columns descendingly on the bottom row like this:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Thanks in advance!
Easiest way is to take the transpose, then sort_values, then transpose back.
df.T.sort_values('7', ascending=False).T
or
df.T.sort_values(df.index[-1], ascending=False).T
Gives:
D B C
A
1 1 7 5
2 9 6 6
3 9 5 7
7 8 4 1
Testing
my solution
%%timeit
df.T.sort_values(df.index[-1], ascending=False).T
1000 loops, best of 3: 444 µs per loop
alternative solution
%%timeit
df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]
1000 loops, best of 3: 525 µs per loop
You can use sort_values (by the index position of your row) with axis=1:
df.sort_values(by=df.index[-1],axis=1,inplace=True)
Here is a variation that does not involve transposition:
df = df[[c for c in sorted(list(df.columns), key=df.iloc[-1].get, reverse=True)]]

Hive - add column with number of distinct values in groups

Suppose I have the following data.
number group
1 a
1 a
3 a
4 a
4 a
5 c
6 b
6 b
6 b
7 b
8 b
9 b
10 b
14 b
15 b
I would like to group the data by group and add a further column which say how many distinct values of number each group has.
My desired output would look as follows:
number group dist_number
1 a 3
1 a 3
3 a 3
4 a 3
4 a 3
5 c 1
6 b 9
6 b 9
6 b 9
7 b 9
8 b 9
9 b 9
10 b 9
14 b 9
15 b 9
What I have tried is:
> select *, count(distinct number) over(partition by group) from numbers;
11 11
As one sees, this aggregates globally and calculates the number of distinct values independently from the group.
One thing I could do is to use group by as follows:
hive> select *, count(distinct number) from numbers group by group;
a 3
b 7
c 1
And then join over group
But I thought maybe there is a more easy solution to this, e.g., using the over(partition by group) method?
You definitely want to use windowing functions here. I'm not exactly sure how you got 11 11 from the query your tried; I'm 99% sure if you try to count(distinct _) in Hive with an over/partition it will complain. To get around this you can use collect_set() to get an array of the distinct elements in the partition and then you can use size() to count the elements.
Query:
select *
, size(num_arr) dist_num
from (
select *
, collect_set(num) over (partition by grp) num_arr
from db.tbl ) x
Output:
4 a [4,3,1] 3
4 a [4,3,1] 3
3 a [4,3,1] 3
1 a [4,3,1] 3
1 a [4,3,1] 3
15 b [15,14,10,9,8,7,6] 7
14 b [15,14,10,9,8,7,6] 7
10 b [15,14,10,9,8,7,6] 7
9 b [15,14,10,9,8,7,6] 7
8 b [15,14,10,9,8,7,6] 7
7 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
6 b [15,14,10,9,8,7,6] 7
5 c [5] 1
I included in the arrays in the output so you could see what was happening, obviously you can discard them in your query. As as note, doing a self-join here is really a disaster with regards to performance (and it's more lines of code).
As per your requirement,this may work:
select number,group1,COUNT(group1) OVER (PARTITION BY group1) as dist_num from table1;

How does pandas merge really sort its result when using its default, sort=False?

I'm a bit confused how the default sort works when I run a merge/join with 2 pandas dataframes. I would expect the order of my result set from
A.merge(B, how='left', on=[Col1, Col2])
to be how A was previously sorted, but in my experience, the ordering is slightly off - and by the keys of A. I find that the order of A maintains, except when keys are duplicated, the result set is ordered by those keys. Below is a crude example of what I have been seeing.
A is:
Col1 Col2 Col3
1 4 5 6
2 6 6 8
3 2 4 5
4 4 5 3
B is:
Col1 Col2 Col4
1 6 6 0
2 2 4 5
3 4 5 7
A.merge(B, how='left', on=[Col1, Col2])
Col1 Col2 Col3 Col4
1 4 5 6 7
2 4 5 3 7
3 6 6 8 0
4 2 4 5 5

Want to generate o/p as Below in Oracle

I need an o/p as below.
1,1
2,1
2,2
3,1
3,2
3,3
4,1
4,2
4,3
4,4
... and so on.
I tried to write the query as below. But throwing error. SIngle row subquery returns more than one row.
with test1 as(
SELECT LEVEL n
FROM DUAL
CONNECT BY LEVEL <59)
select n,(
SELECT LEVEL n
FROM DUAL
CONNECT BY LEVEL <n) from test1
Appreciate your help in solving the same.
Here is one of the methods how you could get the desired result:
SQL> with t1(col) as(
2 select level
3 from dual
4 connect by level <= 5
5 )
6 select a.col
7 , b.col
8 from t1 a
9 join t1 b
10 on a.col >= b.col
11 ;
COL COL
---------- ----------
1 1
2 1
2 2
3 1
3 2
3 3
4 1
4 2
4 3
4 4
5 1
5 2
5 3
5 4
5 5
15 rows selected

disappearing row names when using apply

Consider the following example (values in vectors are target practice results and I'm trying to automagically sort by shooting score). We generate three vectors. We sort values in columns 1:20 in ascending order and rows in descending order based on out.tot column.
# Generate data
shooter1 <- round(runif(n = 20, min = 1, max = 10))
shooter2 <- round(runif(n = 20, min = 1, max = 10))
shooter3 <- round(runif(n = 20, min = 1, max = 10))
out <- data.frame(t(data.frame(shooter1, shooter2, shooter3)))
colnames(out) <- 1:ncol(out)
out.sort <- t(apply(out, 1, sort, na.last = FALSE))
out.tot <- apply(out , 1, sum)
colnames(out.sort) <- 1:ncol(out.sort)
out2 <- cbind(out.sort, out.tot)
out3 <- apply(out2, 2, sort, decreasing = TRUE, na.last = FALSE)
out2 has row names attached while out3 lost them. The only difference is that I used MARGIN = 2, which is probably the culprit (because it takes in column by column). I can match rows by hand, but is there a way I can keep row names in out3 from disappearing?
> out2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 out.tot
shooter1 1 2 2 3 3 3 4 5 5 5 6 6 6 6 6 7 8 9 9 10 106
shooter2 1 3 3 3 3 4 4 4 5 5 5 5 5 6 7 8 8 9 9 10 107
shooter3 1 1 2 2 2 3 3 4 5 5 5 6 6 6 6 7 8 8 8 9 97
> out3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 out.tot
[1,] 1 3 3 3 3 4 4 5 5 5 6 6 6 6 7 8 8 9 9 10 107
[2,] 1 2 2 3 3 3 4 4 5 5 5 6 6 6 6 7 8 9 9 10 106
[3,] 1 1 2 2 2 3 3 4 5 5 5 5 5 6 6 7 8 8 8 9 97
If I understand your example, going from out2 to out3 you are sorting each column independently - meaning that the values on row 1 may not all come from the data generated from shooter1. It makes sense then that the rownames are dropped in as much as rownames are names of observations and you are no longer keeping data from one observation on one row.

Resources