Load only particular field in PIG? - hadoop

This is my file:
Col1, Col2, Col3, Col4, Col5
I need only Col2 and Col3.
Currently I'm doing this:
a = load 'input' as (Col1:chararray,
Col2:chararray,
Col3:chararray,
Col4:chararray);
b = foreach a generate Col2, Col3;
Is there a way to do directly load only Col2 and Col3 instead of loading the whole input and then generate required columns?

Your method of only GENERATEing the columns you want is an effective way to do just what you ask. Remember that all of your data is stored on HDFS, and you're not loading it all into memory when you start your script. You still will have to read those bytes off the disk even if you are not keeping them around for use in your processing, so there is no performance advantage to never loading that data. The advantage comes in never having to send it to a reducer, which you have accomplished with your method.
In cases where Pig can tell that a column won't be used, it will "prune" it immediately, essentially doing for you what you did with your b = foreach a generate Col2, Col3;. This won't happen, however, if you are using a UDF that might access other fields, because Pig doesn't look inside the UDF to see if they get used. For example, suppose Col3 is an int. If you have
b = group a by Col2;
c = foreach b generate group, SUM(a.Col3);
then Pig will automatically prune the 1st and 4th columns for you, since it can see they're never used. However, if you instead did
b = group a by Col2;
c = foreach b generate group, COUNT(a);
then Pig can't prune, because it doesn't see inside the COUNT UDF and doesn't know that the other fields won't be used. When in doubt of whether Pig will do this pruning, you can use the foreach/generate method you already have. And Pig should print a diagnostic message when you start your script listing all the columns it was able to prune out.
If instead your problem is that you don't want to have to provide a full schema when you're interested in just a few columns, you can skip the schema entirely and put it in the GENERATE:
a = load 'input';
b = foreach a generate (chararray) $1 as Col2, (chararray) $2 as Col3;

Related

SAS join (or insert) little table to big table

I have little problem. I have big table and few little table where little tables including part of fields from big table. How I can insert (or union) tables on the basis of if field is the same - set data, if little table not have field from big - set null/0 in big table.
Example:
data temp1;
infile DATALINES dsd missover;
input a b c d e f g;
CARDS;
1, 2, 3, 4,5,6
2, 3, , 5
3, 3
4,,3,2,3,
;
run;
data temp2;
infile DATALINES dsd missover;
input a c e g;
CARDS;
5, 2, 3, 4
6, 3, , 5
7, 3
;
run;
Is there an elegant method where if I insert temp2 to temp1 - missing fields in temp2 will set value of null in temp1?
Thank you for help!
That is exactly what SAS does by default.
data want ;
set have1 have2;
run;
It will match the variables by name and any variables that do not exist (in either source) will have missing values.
For better performance when appending a small table to a large table you should use PROC APPEND instead of a data step to avoid having to make new copy of the large dataset. That is more like an "insert". The FORCE option will allow the dataset to be different. But since the new data is being added to the old dataset any extra variables that appear only in HAVE2 will just be ignored and their values will be lost.
proc append base=have1 data=have2 force ;
run;
If you really did have to generate an actual INSERT statement (perhaps you are actually trying to generate SQL code to run in a foreign database) you might want to compare the metadata of the two datasets and find the common variables.
proc contents data=have1 out=cont1 noprint; run;
proc contents data=have2 out=cont2 noprint; run;
proc sql noprint;
select a.name into :varlist separated by ','
from cont2 a
inner join cont1 b
on upcase(a.name) = upcase(b.name)
;
...
insert into have1 (&varlist) select &varlist from have2 ;
It is not very clear to me what operation you intend to do but some initial thoughts are:
To compare columns between two datasets (and check whether a value exists in one of them) it is good practice to use an outer join. You can do joins via MERGE clause in a datastep, or more elegantly use PROC SQL.
However, using either approach you will have to specify which two rows in temp1 and temp2 shall be compared - you are typically joining on a column that is available in both tables.
To help us resolve your issue, could you possibly provide the correct output for your desired operation, if you perform it on temp1 and temp2? This would show what options you've explored and what needs to be fixed there.
you should try proc append.that will be more efficient because you will not reading your big table again and again unlike in
/*reads temp1 which is big table and temp2*/
data temp3;
set temp1 temp2;
run;
/* this does pretty much same as above code but will not read your big table
and will be efficient*/
proc append base=temp1 data=temp2 force;
run;
more on proc append in documentation http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n19kwc3onglzh2n1l2k4e39edv3x.htm

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

Comparing Similar Hive Tables

I have two hive tables (t1 and t2) that I would like to compare. The second table has 5 additional columns that are not in the first table. Other than the five disjoint fields, the two tables should be identical. I am trying to write a query to check this. Here is what I have so far:
SELECT * FROM t1
UNION ALL
select * from t2
GROUP BY some_value
HAVING count(*) == 2
If the tables are identical, this should return 0 records. However, since the second table contains 5 extra fields, I need to change the second select statement to reflect this. There are almost 60 column names so I would really hate to write it like this:
SELECT * FROM t1
UNION ALL
select field1, field2, field3,...,fieldn from t2
GROUP BY some_value
HAVING count(*) == 2
I have looked around and I know there is no select * EXCEPT syntax, but is there a way to do this query without having to explicity name each column that I want included in the final result?
You should have used UNION DISTINCT for the logic you are applying.
However, the number and names of columns returned by each select_statement have to be the same otherwise a schema error is thrown.
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
To skip the 5 extra fields, you could use the "--ignore-columns" option.

"substr" statement in apache pig

I have below user data structure in apache hadoop
21796346,83637,2990666,1,2,false,0,0
21827841,15748,8754621,1,7,true,0,1
First 4 digits of the 1st field represent the user type.
2nd field represents the department type.
I would like to query the number of user types in each department.
SQL statement is below
select dept_id, substr(User_Id,1,4) as user_type, count(*) as number_of_users from users group by dept_id,substr(User_Id,1,4)
I could not figure out how to define substr function in pig.
You could youse SUBSTRING in PIG
A = LOAD 'DATA' USING PigStorage(';') AS (User_Id, var1, var2, var3, var4, var5, var6, var7);
B = GROUP A By SUBSTRING(User_Id,1,4);
C = FOREACH B GENERATE group as user_typeX, COUNT(A) as number_of_users_with_the_same_user_typeX;
To get the number of all users you could GROUP BY ALL.
You can find the complete list of Pig's built-in functions here. The function you are looking for is called SUBSTRING. Note that function names in Pig are case-sensitive.

Selecting data from one table or another in multiple queries PL/SQL

The easiest way to ask my question is with a Hypothetical Scenario.
Lets say we have 3 tables. Singapore_Prices, Produce_val, and Bosses_unreasonable_demands.
So Prices is a pretty simple table. Item column containing a name, and a Price column containing a number.
Produce_Val is also simple 2 column table. Type column containing what type the produce is (Fruit or veggie) and then Name column (Tomato, pineapple, etc.)
The Bosses_unreasonable_demands only contains one column, Fruit, which CAN contain the names of some fruits.
OK? Ok.
SO, My boss wants me to write a query that returns the prices for every fruit in his unreasonable demands table. Simple enough. BUT, if he doesn't have any entries in his table, he just wants me to output the prices of ALL fruits that exist in produce_val.
Now, assuming I don't know where the DBA who designed this silly hypothetical system lives (and therefore can't get him to fix this), our query would look like this:
if <Logic to determine if Bosses demands are empty>
Then
select Item, Price
from Singapore_Prices
where Item in (select Fruit from Bosses_Unreasonable_demands)
Else
select Item, Price
from Singapore_Prices
where Item in (select Name from Produce_val where type = 'Fruit')
end if;
(Well, we'd select those into a variable, and then output the variable, probably with bulk-collect shenanigans, but that's not important)
Which works. It is entirely functional, and won't be slow, even if we extend it out to 2000 other stores other than Singapore. (Well, no slower than anything else that touches 2000 some tables) BUT, I'm still doing two different select statements that are practically identical. My Comp Sci teacher rolls in their grave every time my fingers hit ctrl-V. I can cut this code in half and only do one select statement. I KNOW I can.
I just have no earthly idea how. I can't use cursors as an in statement, I can't use nested tables or varrays, I can't use cleverly crafted strings, I... I just... I don't know. I don't know how to do this. Is there a way? Does it exist?
Or do I have to copy/paste forever?
Your best bet would be dynamic SQL, because you can't parameterize table or column names.
You will have a SQL query template, have a logic to determine tables and columns that you want to query, then blend them together and execute.
Another aproach, (still a lot of ctrl-v like code) is to use set construction UNION ALL:
select 1st query where boss_condition
union all
select 2nd query where not boss_condition
Try this:
SELECT *
FROM (SELECT s.*, 'BOSS' AS FRUIT_SOURCE
FROM BOSSES_UNREASONABLE_DEMANDS b
INNER JOIN SINGAPORE_FRUIT_LIST s
ON s.ITEM = b.FRUIT
CROSS JOIN (SELECT COUNT(*) AS BOSS_COUNT
FROM BOSSES_UNREASONABLE_DEMANDS)) x
UNION ALL
(SELECT s.*, 'NORMAL' AS FRUIT_SOURCE
FROM PRODUCE_VAL p
INNER JOIN SINGAPORE_FRUIT_LIST s
ON (s.ITEM = p.NAME AND
s.TYPE = 'Fruit')
CROSS JOIN (SELECT COUNT(*) AS BOSS_COUNT
FROM BOSSES_UNREASONABLE_DEMANDS)) n
WHERE (BOSS_COUNT > 0 AND FRUIT_SOURCE = 'BOSS') OR
(BOSS_COUNT = 0 AND FRUIT_SOURCE = 'NORMAL')
Share and enjoy.
I think you can use nested tables. Assume you have a schema-level nested table type FRUIT_NAME_LIST (defined using CREATE TYPE).
SELECT fruit
BULK COLLECT INTO my_fruit_name_list
FROM bosses_unreasonable_demands
;
IF my_fruit_name_list.count = 0 THEN
SELECT name
BULK COLLECT INTO my_fruit_name_list
FROM produce_val
WHERE type='Fruit'
;
END IF;
SELECT item, price
FROM singapore_prices
WHERE item MEMBER OF my_fruit_name_list
;
(or, WHERE item IN (SELECT column_value FROM TABLE(CAST(my_fruit_name_list AS fruit_name_list)) if you like that better)

Resources