"substr" statement in apache pig - hadoop

I have below user data structure in apache hadoop
21796346,83637,2990666,1,2,false,0,0
21827841,15748,8754621,1,7,true,0,1
First 4 digits of the 1st field represent the user type.
2nd field represents the department type.
I would like to query the number of user types in each department.
SQL statement is below
select dept_id, substr(User_Id,1,4) as user_type, count(*) as number_of_users from users group by dept_id,substr(User_Id,1,4)
I could not figure out how to define substr function in pig.

You could youse SUBSTRING in PIG
A = LOAD 'DATA' USING PigStorage(';') AS (User_Id, var1, var2, var3, var4, var5, var6, var7);
B = GROUP A By SUBSTRING(User_Id,1,4);
C = FOREACH B GENERATE group as user_typeX, COUNT(A) as number_of_users_with_the_same_user_typeX;
To get the number of all users you could GROUP BY ALL.

You can find the complete list of Pig's built-in functions here. The function you are looking for is called SUBSTRING. Note that function names in Pig are case-sensitive.

Related

Can I use FOR ALL ENTRIES with GROUP BY?

Currently the code looks something like this:
LOOP AT lt_orders ASSIGNING <fs_order>.
SELECT COUNT(*) AS cnt
FROM order_items
INTO <fs_order>-cnt
WHERE order_id = <fs_order>-order_id.
ENDLOOP.
It is the slowest part of the report. I want to speed it up.
How can I use FOR ALL ENTRIES with GROUP BY?
Check the documentation. You can't use GROUP BY. Maybe in this case, you could try selecting your items with FAE outside of the loop, then count them using a parallel cursor:
REPORT.
TYPES: BEGIN OF ty_result,
vbeln TYPE vbeln,
cnt TYPE i.
TYPES: END OF ty_result.
DATA: lt_headers TYPE SORTED TABLE OF ty_result WITH UNIQUE KEY vbeln,
lv_tabix TYPE sy-tabix VALUE 1.
"get the headers
SELECT vbeln FROM vbak UP TO 100 ROWS INTO CORRESPONDING FIELDS OF TABLE lt_headers.
"get corresponding items
SELECT vbeln, posnr FROM vbap FOR ALL ENTRIES IN #lt_headers
WHERE vbeln EQ #lt_headers-vbeln
ORDER BY vbeln, posnr
INTO TABLE #DATA(lt_items).
LOOP AT lt_headers ASSIGNING FIELD-SYMBOL(<h>).
LOOP AT lt_items FROM lv_tabix ASSIGNING FIELD-SYMBOL(<i>).
IF <i>-vbeln NE <h>-vbeln.
lv_tabix = sy-tabix.
EXIT.
ELSE.
<h>-cnt = <h>-cnt + 1.
ENDIF.
ENDLOOP.
ENDLOOP.
BREAK-POINT.
Or join header/item with a distinct count on the item id (whichever column that would be in your table).
You should be able to do something like
SELECT COUNT(order_item_id) AS cnt, order_id
FROM order_items
INTO CORRESPONDING FIELDS OF TABLE lt_count
GROUP BY order_id.
Assuming that order_item_id is a key in the order_items table. And assuming that lt_count has two fields: cnt of type int8 and order_id of same type as your other order_id fields
PS: then you can loop over lt_count and move the counts to lt_orders. Or the other way around. To speed up the loop, sort one of the tables and use READ ... BINARY SEARCH
I did with table KNB1 (customer master in company code), where we have customers, which are created in several company codes.
Please note, because of FOR ALL ENTRIES you have to SELECT the full key.
TYPES: BEGIN OF ty_knb1,
kunnr TYPE knb1-kunnr,
count TYPE i,
END OF ty_knb1.
TYPES: BEGIN OF ty_knb1_fae,
kunnr TYPE knb1-kunnr,
END OF ty_knb1_fae.
DATA: lt_knb1_fae TYPE STANDARD TABLE OF ty_knb1_fae.
DATA: lt_knb1 TYPE HASHED TABLE OF ty_knb1
WITH UNIQUE KEY kunnr.
DATA: ls_knb1 TYPE ty_knb1.
DATA: ls_knb1_db TYPE knb1.
START-OF-SELECTION.
lt_knb1_fae = VALUE #( ( kunnr = ... ) ). "add at least one customer which is created in several company codes
ls_knb1-count = 1.
SELECT kunnr bukrs
INTO CORRESPONDING FIELDS OF ls_knb1_db
FROM knb1
FOR ALL ENTRIES IN lt_knb1_fae
WHERE kunnr EQ lt_knb1_fae-kunnr.
ls_knb1-kunnr = ls_knb1_db-kunnr.
COLLECT ls_knb1 INTO lt_knb1.
ENDSELECT.
Create a range table for your lt_orders, like lt_orders_range.
Do select order_id, count( * ) where order_id in lt_orders_range.
If you think this is too much to create a range table, you will save a lot of performance by running just one select for all orders instead of single select for each order id.
Not directly, only through a CDS view
While all of the answers provide a faster solution than the one in the question, the fastest way is not mentioned.
If you have at least Netweaver 7.4, EHP 5 (and you should, it was released in 2014), you can use CDS views, even if you are not on HANA.
It still cannot be done directly, as OpenSQL does not allow FOR ALL ENTRIES with GROUP BY, and CDS views cannot handle FOR ALL ENTRIES. However, you can create one of each.
CDS:
#AbapCatalog.sqlViewName: 'zorder_i_fae'
DEFINE VIEW zorder_items_fae AS SELECT FROM order_items {
order_id,
count( * ) AS cnt,
}
GROUP BY order_id
OpenSQL:
SELECT *
FROM zorder_items_fae
INTO TABLE #DATA(lt_order_cnt)
FOR ALL ENTRIES IN #lt_orders
WHERE order_id = #lt_orders-order_id.
Speed
If lt_orders contains more than about 30% of all possible order_id values from table ORDER_ITEMS, the answer from iPirat is faster. (While using more memory, obviously)
However, if you need only a couple hunderd order_id values out of millions, this solution is about 10 times faster than any other answer, and 100 times faster than the original.

Count with WHERE clause from two tables in Oracle?

I have two tables, one called STUDENTS and the other CLASSES. I have to select all the students that are from the same class of one student, and this student has his own number id, and through this number id that I have to select everything.
TABLE STUDENTS
nr_rgm
nm_name
nm_father
nm_mother
dt_birth
id_sex
TABLE CLASSES
cd_class
nr_schoolyear
cd_school
cd_degree
nr_series
cd_class
cd_period
So I tried something like that :
SELECT count(*) FROM students, classes WHERE id_sex = 'M' AND
cd_class = (SELECT cd_class FROM classes WHERE nr_rgm = '12150');
But then it points an error, and the error is the follow :
single-row subquery returns more than one row
So, how can I make this work ?
you should use "in" and not "=" when applying subselects.
I think what you really would want to do is to simply join the two tables together rather than issuing a sub select:
SELECT count(*)
FROM students s, classes c
WHERE s.id_sex = 'M' AND c.nr_rgm = '12150' AND s.cd_class = c.cd_class;
This way you just tell the database: Please count all the occurrences where my students.id_sex = 'M' and my classes.nr_rgm = '12150' AND all found studends.cd_class match those of my classes.cd_class.
The reason why your statement above fails is because the ordinary = operation, when not used in a join, will only expect one single value, like you do with s.id_sex='M' while your statement returns multiple values. To cope with that you have to use the IN operator which operates on lists.
However, you can and will achieve the very same with just joining the two tables together, and it will be much more efficient on bigger data sets.
One more note to the example above. If classes.nr_rgm is a field of data type NUMBER, don't use the ' around the value 12150 as it will lead to implicit type conversion. With other words, '12150' is a string and will have to be converted to NUMBER first before doing a comparison.

Duplicate removal from a table using Informatica

I have a scenario to be implemented in informatica where I need to remove duplicate records from a table based on PK. But I need to keep the 1st occurrence of the PK values and remove the others(in case of duplicate PK).
For example, If my source has 1,1,1,2,3,3,4,5,4. I want to see my target data as 1,2,3,4,5. I have to read data from the same table and need to load into the same table., no new table can be introduced. please help me with your inputs.
Thanks in Advance!
I suppose you want the first occurrence because there are other (data) columns in addition to the key you entered. Therefore you want
1,b
1,c
1,a
2,d
3,c
3,d
4,e
5,f
4,b
Turned into
1,b
2,d
3,c
4,e
5,f
??
In that case try this mapping layout:
SRC -> SQ -> SRT -> AGG -> TGT
SEQ /
Where the sorter is set to sort on the KEY,sequence_port (desc)
And the aggregator is set to group by the KEY, and the sequence_port should not go out of the sorter
Hope you can follow me :)
There are multiple ways to do this, the simplest would be too do it in the SQL override.
Assuming the example quoted above, the SQL would be like this. General idea is to set a row number for a primary key ( so if you have 3 rows with same pk they will have 1,2,3 as row numbers before being reset for the next pk)
SQL:
select * from (
Select primary_key,column2 row_number() over (partition by primary_key order by primary_key) as distinct_key) where distinct_key=1
Before:
1,b
1,c
1,a
2,d
3,c
3,d
Intermediate query:
1,c,1
1,a,2
2,d,1
3,c,1
3,d,2
output:
1,c
2,d
3,d
I am able to achieve this by following the below steps.
1. Passing Sorted data(keys are EMP_ID, MOBILE, DEPTID) to an expression.
2. Creating the following variable ports in the expression and getting the counts.
V_CURR_EMP_ID = EMP_ID
V_CURR_MOBILE = MOBILE
V_CURR_DEPTID = DEPTID
V_COUNT =
IIF(V_CURR_EMP_ID=V_PREV_EMP_ID AND V_CURR_MOBILE=V_PREV_MOBILE AND V_CURR_DEPTID=V_PREV_DEPTID ,V_COUNT+1,1)
V_PREV_EMP_ID = EMP_ID
V_PREV_MOBILE = MOBILE
V_PREV_DEPTID = DEPTID
O_COUNT =V_COUNT
3. In the next transformation which is filter, I am taking only the records which have count more than 1 and deleting them using update strategy(DD_DELETE).
Here is the mapping flow.
SQ->SRTR->EXP->FIL->UPD->TGT
Also, when I tried to delete them using aggregator , it is deleting only the first occurrence of duplicates but not all.
Thanks again for your inputs!

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

Transforming hive IN subselect query combined with WHERE replacement

I know that one needs to replace IN query with semi-left-join (e.g. Hive doesn't support in, exists. How do I write the following query?), but I don't know how to combine it with a WHERE clause:
SELECT *
from foo
WHERE userId IN
(SELECT distinct(userId) FROM foo WHERE x=true ORDER BY RAND() LIMIT 100);
thanks.
EDIT: Changed query. Intention is to create a random sample of entries (statistics wise).
(Posting alternative approach for completeness.)
To sample a set of records from a table, you can use Hive's TABLESAMPLE syntax. For example, too select a random sample of 100 distinct userId's you would use:
SELECT userId
FROM (SELECT DISTINCT(userId) as userId FROM foo) f
TABLESAMPLE(100 ROWS);
The syntax allows you to specify your sample size in different ways. The following is also valid:
SELECT userId
FROM (SELECT DISTINCT(userId) as userId FROM foo) f
TABLESAMPLE(1 PERCENT);
For more details, check out the manual page for this topic.
Once you have your sample of userId's, you can use Manuel Aldana's earlier answer to select the corresponding records from your original table.
select id from foo
left semi join
(SELECT id_2 FROM bar WHERE x=true RAND() LIMIT 100) x
ON foo.id=x.id_2
Should be like this.
I just don't understand this part : x=true RAND()
Also, this doesn't handle nulls just like your query.

Resources