I have the following records in a table
session_id sequence timestamp
1 1 298349
1 2 299234
1 3 234255
2 1 153523
2 2 234524
3 1 123434
I want to have the following results
session_id start end
1 298349 234255
2 153523 234524
3 123434 123434
How can I do this in pig?
register 'file:$piglib/datafu-1.2.0.jar';
define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
input_data = load 'so.txt' using PigStorage('\t') as (session_id:int, sequence:int, time:long);
g = group input_data by session_id;
r = foreach g {
s1 = order input_data by sequence asc;
s2 = order input_data by sequence desc;
generate group as session_id, FirstTupleFromBag(s1, null).time as start, FirstTupleFromBag(s2, null).time as end;
}
dump r;
First of all we group by session_id, then sort by sequence ascending and descending order and take the first tuple of the sorted bags respectively.
This makes use of the datafu UDF library (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/FirstTupleFromBag.html)
Related
I have a sqlite database that I can read as:
In [42]: df = pd.read_sql("SELECT * FROM all_vs_all", engine)
In [43]:
In [43]: df.head()
Out[43]:
user_data user_model \
0 037d05edbbf8ebaf0eca#172.16.199.165 037d05edbbf8ebaf0eca#172.16.199.165
1 037d05edbbf8ebaf0eca#172.16.199.165 060210bf327a3e3b4621#172.16.199.33
2 037d05edbbf8ebaf0eca#172.16.199.165 1141259bd36ba65bef02#172.21.44.180
3 037d05edbbf8ebaf0eca#172.16.199.165 209627747e2af1f6389e#172.16.199.181
4 037d05edbbf8ebaf0eca#172.16.199.165 303a1aff4ab6e3be82ab#172.21.112.182
score Class time_secs model_name bin_id
0 0.283141 0 1514764800 Flow 0
1 0.999300 1 1514764800 Flow 0
2 1.000000 1 1514764800 Flow 0
3 0.206360 1 1514764800 Flow 0
4 1.000000 1 1514764800 Flow 0
As the table is too big I rather than reading the full table I select a random subset of rows:
This can be done very quckly as:
random_query = "SELECT * FROM all_vs_all WHERE abs(CAST(random() AS REAL))/9223372036854775808 < %f AND %s" % (ratio, time_condition)
df = pd.read_sql(random_query, engine)
The problem is that for each triplet [user_data, user_model, time_secs] I want to get all the rows containing that triplet. Each triplet appears 1 or 2 times.
A possible way to do it is to firstly sample a random set of triplets and then get all the rows that have one of the selected triplets but this seems to be too slow.
Is there an efficient way to do it?
EDIT: If I could load all the data in pandas I would have done something like:
selected_groups = []
for group in df.groupby(['user_data', 'user_model', 'time_secs']):
if np.random.uniform(0,1) > ratio:
selected_groups.append(group)
res = pd.concat(selected_groups)
Few sample join and sql query:
currently admitted :
Select p.patient_no, p.pat_name,p.date_admitted,r.room_extension,p.date_discharged FROM Patient p JOIN room r ON p.room_location = r.room_location where p.date_discharged IS NULL ORDER BY p.patient_no,p.pat_name,p.date_admitted,r.room_extension,p.date_discharged;
vacant rooms:
SELECT r.room_location, r.room_accomadation, r.room_extension FROM room r where r.room_location NOT IN (Select p.room_location FROM patient.p where p.date_discharged IS NULL) ORDER BY r.room_location, r.room_accomadation, r.room_extension;
no charges yet :
SELECT p.patient_no, p.pat_name, COALESCE (b.charge,0.00) charge FROM patient p LEFT JOIN billed b on p.patient_no = b.patient_no WHERE p.patient_no NOT IN (SELECT patient_no FROM billed) group by p.patient_no ORDER BY p.patient_no, p.pat_name,charge;
max salarised :
SELECT phy_id,phy_name, salary FROM physician where salary in (SELECT MAX(salary) FROM physician) UNION
SELECT phy_id,phy_name, salary FROM physician where salary in (SELECT MIN(salary) FROM physician) ORDER BY phy_id,phy_name, salary;
various item consumed by:
select p.pat_name, i.discription, count (i.item code) as item code from patient p join billed b on p.patient no = b. patient no join item i on b.item code = i.item code group by p.patient no, i.item code order by..
patient not receivede treatment:
SELECT p.patient_no,p.pat_name FROM patient p where p.patient_no NOT IN (SELECT t.patient_no FROM treats t)
ORDER BY p.patient_no,p.pat_name;
2 high paid :
Select phy_id, phy_name, date_of_joining, max(salary) as salary from physician group by salary having salary IN (Select salary from physician)
Order by phy_id, phy_name, date_of_joining, salary limit 2;
over 200:
select patient_no, sum (charge), as total charge from billed group by patient no having total charges > 200 order by patient no, total charges
I am trying to create an ID based in multiple values in different columns in Power query.
The idea is to check the following values:
IF
ID_STORE = 1
ID_PRODUCT = 1
ID_CATEGORY = 1
SALE_DATE = 01/01/2018
ID_COSTUMER = 1
THEN CREATE THE SAME ID FOR THE ROWS THAT HAVE THIS INFO.
The idea is to check the rows that have that info (1 and 01/01/2018) in multiple columns (ID_STORE, ID_PRODUCT, ID_CATEGORY etc..).
Thanks in advance.
Obs: This is my first post, so feel free to correct me in any manner.
You need to add an 'AND' clause to your 'OF' statement;
if [ID_STORE] = 1
and [ID_PRODUCT] = 1
and [ID_CATEGORY] = 1
and [SALE_DATE] = Text.ToDate("01/01/2018")
and [ID_COSTUMER] = 1 then
1 else 0
I have a dataset that looks like
gr col1 col2
A 2 'haha'
A 4 'haha'
A 3 'haha'
B 5 'hoho'
B 1 'hoho'
as you can see, in each group gr there is a numeric variable col1 and some string variable col2 that is the same within each group.
How can I get the following pseudo-code in PIG?
foreach group gt : generate the mean of col1 and get the first occurrence of col2
so output would look like
gr mean name
A 3 'haha'
B 3 'hoho'
thanks!
GROUP BY gr,col2 and get the AVG of col1. Assuming the fields are tab separated.
PigScript
A = load 'test6.txt' USING PigStorage('\t') as (gr:chararray,col1:int,col2:chararray);
B = GROUP A BY (gr,col2);
C = FOREACH B GENERATE FLATTEN(group) as (gr,name),AVG(A.col1) as mean;
DUMP C;
Note: if you want them in the order then add extra step
D = FOREACH C GENERATE $0 as gr,$2 as mean,$1 as name;
Output
I have multiple files with same columns and I am trying to aggregate the values in two columns using SUM.
The column structure is below
ID first_count second_count name desc
1 10 10 A A_Desc
1 25 45 A A_Desc
1 30 25 A A_Desc
2 20 20 B B_Desc
2 40 10 B B_Desc
How can I sum the first_count and second_count?
ID first_count second_count name desc
1 65 80 A A_Desc
2 60 30 B B_Desc
Below is the script I wrote but when I execute it I get an error "Could not infer matching function for SUM as multiple of none of them fit.Please use an explicit cast.
A = LOAD '/output/*/part*' AS (id:chararray,first_count:chararray,second_count:chararray,name:chararray,desc:chararray);
B = GROUP A BY id;
C = FOREACH B GENERATE group as id,
SUM(A.first_count) as first_count,
SUM(A.second_count) as second_count,
A.name as name,
A.desc as desc;
Your load statement is wrong. first_count, second_count is loaded as chararray. Sum can't add two strings. If you are sure that these columns will take numbers only then load them as int. Try this-
A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
It should work.
I asked this question regarding how to get a rolling count of distinct users using SQL, but I also have Hadoop at my disposal and now I'm wondering if this analysis isn't better suited for Hadoop. Unfortunately, I'm new to Hadoop so beyond getting the data loaded and the most basic MapReduce jobs I'm ignorant on how to approach this. Assuming this is a good candidate for Hadoop, what is the best approach to take?
One possible way to model this in map reduce:
M/R Job 1:
Mapper: extract and output the USER_ID (output key) and date portion from the CONNECT_TS (output value).
Reducer: For each USER_ID group (output key), output the minimum date observed (output value)
M/R Job 2:
Mapper: from the previous job output, swap the key / value pairs (DATE as output key, USER_ID as value)
Reducer (single): for each input key (date), count the number of users (value), output this number along with an accumulated running total.
In Pig, everything except keeping the running total could be done using the following script:
A = LOAD '/home/cswhite/data.tsv' USING PigStorage('\t') AS (SESSION, USER_ID, TIMESTAMP);
B = foreach A GENERATE USER_ID, SUBSTRING(TIMESTAMP, 0, 10) AS DATE;
BF = filter B by DATE > '2013-01-01';
C = group BF by USER_ID;
D = foreach C {
sorted = order BF by DATE;
earliest = limit sorted 1;
generate group, flatten(earliest);
}
E = foreach D generate group as USER_ID, earliest::DATE as DATE;
F = group E by DATE;
G = foreach F generate group as DATE, COUNT(E) as USERS_CNT;
H = group G by ALL
I = foreach G generate SUM(G.USERS_CNT) as TOTAL_USERS;
So for the following tab separated input data:
1 99 2013-01-01 2:23:33
2 101 2013-01-01 2:23:55
3 104 2013-01-01 2:24:41
4 101 2013-01-01 2:24:43
5 101 2013-01-02 2:25:01
6 102 2013-01-02 2:26:01
7 99 2013-01-03 2:27:01
8 92 2013-01-04 2:28:01
9 234 2013-01-05 2:29:01
The alias G is as follows:
(2013-01-01,3)
(2013-01-02,1)
(2013-01-04,1)
(2013-01-05,1)
And alias 'I' is:
(5)