PIG Script How to - hadoop

I am trying clean up this employee volunteer data. There is no way to track if employee already is registered volunteer so he can sign up as new volunteer and will get a new VOLUNTEER_ID. I have a data feeding into where i can tie each VOLUNTEER_ID to its EMP_ID. The volunteer data needs to be cleaned up so we can figure out how the employee moved from a volunteer_level to another and when.
The business logic is that, when there is a overlaping dates, we give the highest level to the employee for the timeframe of between start_date and end_date.
I posted a Input sample of data and what the output should be.
Is it possible to do this a PIG script ? Can someone please help me
INPUT:
EMP_ID VOLUNTEER_ID V_LEVEL STATUS START_DATE END_DATE
10001 100 1 A 1/1/2006 12/31/2007
10001 200 1 A 5/1/2006
10001 100 1 A 1/1/2008
10001 300 3 P 3/1/2008 3/1/2008
10001 300 3 A 3/2/2008 12/1/2008
10001 1001 2 A 5/1/2008 6/30/2008
10001 1001 3 A 7/1/2008
10001 300 2 A 12/2/2008
OUTPUT NEEDED:( VOLUNTEER_ID is not needed in output but adding below to show which ID was selected for output and which did not)
EMP_ID VOLUNTEER_ID V_LEVEL STATUS START_DATE END_DATE
10001 100 1 A 1/1/2006 12/31/2007
10001 300 3 P 3/1/2008 3/1/2008
10001 300 3 A 3/2/2008 12/1/2008
10001 1001 2 A 5/1/2008 6/30/2008
10001 1001 3 A 7/1/2008

It seems like you want the row in your data with the earliest start date for each V_LEVEL, STATUS, EMP_ID, and VOLUNTEER_ID
First we add a unix time column and then find the min for that column (this is in the latest version of pig so you may need to update your version).
data_with_unix = foreach data generate EMP_ID, VOLUNTEER_ID, V_LEVEL, STATUS, START_DATE, END_DATE, ToUnixTime((datetime)START_DATE) as unix_time;
grp = group data_with_unix by (EMP_ID, VOLUNTEER_ID, V_LEVEL, STATUS);
max_date = foreach grp generate group, MIN(data_with_unix.unix_time);
Then join the start and end date back into your dataset since there it doesn't look like there is currently a way to convert unix time back to date.

Related

Lag() Function in SQLiteStudio

I am wanting to return the last transaction date grouped by CustomerID, and I am using SQLiteStudio 3.2.1. My table looks like this:
CustomerID Date TransactionID Amount
1 2000-07-01 1 20.00
2 2000-07-04 2 40.00
1 2002-08-01 3 20.00
1 2007-01-01 4 60.00
2 2010-05-09 5 70.00
1 2012-06-25 6 35.00`
This is what I would like the end result to look like: `
CustomerID Date TransactionID Amount Last Transaction Date
1 2000-07-01 1 20.00 NULL
2 2000-07-04 2 40.00 NULL
1 2002-08-01 3 20.00 2000-07-01
1 2007-01-01 4 60.00 2002-01-01
2 2010-05-09 5 70.00 2000-07-04
1 2012-06-25 6 35.00` 2007-01-01
I was attempting to use the following code:
SELECT CustomerID, Date, Amount, LAG(Date,1) OVER (PARTITIONED BY CustomerID ORDER BY Date)
FROM table
However, the lag function is not supported in SQLiteStudio (or maybe I am missing something?). The SQL Editor is also not recognizing the PARTITION BY clause either. Is there a way to use the LAG function or the PARTITION BY clause in the SQL Function Editor? Any help would be greatly appreciated! Thanks!
Also: does anyone have any resources for aggregate function creation in the SQL Function Editor for SQLiteStudio? I know it takes the three parameters of "Initialization code", "Per step code", and "Final step implementation code", but I am looking for examples of the syntax/requirements for these three parameters in SQLiteStudio. (Thanks again!)
Your partition clause, as your pasted above, has a typo, and it should be PARTITION BY, not PARTITIONED BY. If this be the only problem, then just fix the typo:
SELECT CustomerID, Date, Amount,
LAG(Date) OVER (PARTITION BY CustomerID
ORDER BY Date) AS "Last Transaction Date"
FROM yourTable
ORDER BY Date;
If the above still does not work, then perhaps your version of SQLite does not support LAG. One workaround in this case would be to use a correlated subquery in place of LAG:
SELECT CustomerID, Date, Amount,
(SELECT t2.Date
FROM yourTable t2
WHERE t2.CustomerID = t1.CustomerID AND
t2.TransactionID < t1.TransactionID
ORDER BY t2.TransactionID DESC
LIMIT 1) AS "Last Transaction Date"
FROM yourTable t1
ORDER BY Date;

Select single random sample from group by in Hive

I have a table that looks like so:
Name Age Num_Hobbies Num Shoes
Jane 31 10 2
Bob 23 3 4
Jane 60 2 200
Jane 31 100 6
Bob 10 8 7
etc etc
I would like to group this table by Name and Age, and at random pick one row from the rest of the columns.
In pandas, I would do the following:
df.groupby(['Name', 'Age']).apply(lambda x: x.sample(n=1))
In hive, I know how to create the group, but not how to choose a single random sample from group.
I saw this question on stack overflow: How to sample for each group in hive?
However, I do not understand how to apply Dynamic partitions or Hive bucketing to select a single sample from a group.
You can use rank() or row_number() with rand()
select * from
(
select name,age,rank() (partition by name,age order by rand()) as rank
from table
) t
where rank = 1

Oracle - Insert x amount of rows with random data

I am currently doing some testing and am in the need for a large amount of data (around 1 million rows)
I am using the following table:
CREATE TABLE OrderTable(
OrderID INTEGER NOT NULL,
StaffID INTEGER,
TotalOrderValue DECIMAL (8,2)
CustomerID INTEGER);
ALTER TABLE OrderTable ADD CONSTRAINT OrderID_PK PRIMARY KEY (OrderID)
CREATE SEQUENCE seq_OrderTable
MINVALUE 1
START WITH 1
INCREMENT BY 1
CACHE 10000;
and want to randomly insert 1000000 rows into it with the following rules:
OrderID needs to be be sequential (1, 2, 3 etc...)
StaffID needs to be a random number between 1 and 1000
CustomerID needs to be a random number between 1 and 10000
TotalOrderValue needs to be a random decimal value between 0.00 and 9999.99
Is this even possible to do? I can I could generate all of these using this update statement? however generating a million rows in 1 go I am not sure on how to do this
Thanks for any help on this matter
This is how i would randomly generate the number on update:
UPDATE StaffTable SET DepartmentID = DBMS_RANDOM.value(low => 1, high => 5);
For testing purposes I created the table and populated it in one shot, with this query:
CREATE TABLE OrderTable(OrderID, StaffID, CustomerID, TotalOrderValue)
as (select level, ceil(dbms_random.value(0, 1000)),
ceil(dbms_random.value(0,10000)),
round(dbms_random.value(0,10000),2)
from dual
connect by level <= 1000000)
/
A few notes - it is better to use NUMBER as data type, NUMBER(8,2) is the format for decimal. It is much more efficient for populating this kind of table to use the "hierarchical query without PRIOR" trick (the "connect by level <= ..." trick) to get the order ID's.
If your table is created already, insert into OrderTable (select level...) (same subquery as in my code) should work just as well. You may be better off adding the PK constraint only after you create the data though, so as not to slow things down.
A small sample from the table created (total time to create the table on my cheap laptop - 1,000,000 rows - was 7.6 seconds):
SQL> select * from OrderTable where orderid between 500020 and 500030;
ORDERID STAFFID CUSTOMERID TOTALORDERVALUE
---------- ---------- ---------- ---------------
500020 666 879 6068.63
500021 189 6444 1323.82
500022 533 2609 1847.21
500023 409 895 207.88
500024 80 2125 1314.13
500025 247 3772 5081.62
500026 922 9523 1160.38
500027 818 5197 5009.02
500028 393 6870 5067.81
500029 358 4063 858.44
500030 316 8134 3479.47

What if I want Oracle SQL to GROUP BY this but not that?

This is what I get previously from another SQL code:
Customer Id week_ending Purchase Id Price
1234 2/28/2015 8604220 15
1234 2/28/2015 8604220 13.75
1234 2/28/2015 8604220 12.95
1234 2/28/2015 8604220 18.95
567890 8/15/2015 6376243 5.15
567890 8/15/2015 6376243 0.89
567890 8/15/2015 6376243 3.99
567890 8/15/2015 6376243 2.3
1234 1/24/2015 8824241 0.99
1234 1/24/2015 8824241 3.99
1234 1/24/2015 8824241 3.89
Now I want to sum the price by Purchase ID since it is unique for every of our customer's order but I don't want my SQL to think and sum it by Customer ID (since each customer could order multiple times with multiple Purchase ID). Following is my code that I wrote but I'm afraid that it would sum them by customer_id. How do I avoid this mistake of double accounting? Thanks in advance!
WITH example AS(SELECT
customer_id
,MAX(nvl(promised_arrival_day, ship_day)) OVER (PARTITION BY purchase_id) AS ship2_day
,purchase_id
,SUM(price) AS order_size
FROM
my_table
GROUP BY
customer_id
,MAX(nvl(promised_arrival_day, ship_day)) OVER (PARTITION BY customer_purchase_id)
,purchase_id)
SELECT
example.customer_id
,TO_CHAR(example.ship2_day + (7-TO_CHAR(example.ship2_day,'d')),'MM-DD-YYYY') AS week_ending
,example.purchase_id
,example.order_size
FROM
example;
Just
SELECT customer_id, purchase_id, sum(price)
FROM your_table
GROUP BY customer_id, purchase_id
Each record will be counted only once. It doesn't do any "double accounting" as you say.
You will get one record for each unique combination of customer_id/puchase_id in your data.
Looking at your data, I would do something like.
with tbl as(
-- query by which you are getting dataset in example
)
select customer_id,purchase_id, sum(price) as total_price from tbl
group by purchase_id,customer_id

Grouping data by date ranges

I wonder how do I select a range of data depending on the date range?
I have these data in my payment table in format dd/mm/yyyy
Id Date Amount
1 4/1/2011 300
2 10/1/2011 200
3 27/1/2011 100
4 4/2/2011 300
5 22/2/2011 400
6 1/3/2011 500
7 1/1/2012 600
The closing date is on the 27 of every month. so I would like to group all the data from 27 till 26 of next month into a group.
Meaning to say I would like the output as this.
Group 1
1 4/1/2011 300
2 10/1/2011 200
Group 2
1 27/1/2011 100
2 4/2/2011 300
3 22/2/2011 400
Group 3
1 1/3/2011 500
Group 4
1 1/1/2012 600
It's not clear the context of your qestion. Are you querying a database?
If this is the case, you are asking about datetime but it seems you have a column in string format.
First of all, convert your data in datetime data type (or some equivalent, what db engine are you using?), and then use a grouping criteria like this:
GROUP BY datepart(month, dateadd(day, -26, [datefield])), DATEPART(year, dateadd(day, -26, [datefield]))
EDIT:
So, you are in Linq?
Different language, same logic:
.GroupBy(x => DateTime
.ParseExact(x.Date, "dd/mm/yyyy", CultureInfo.InvariantCulture) //Supposed your date field of string data type
.AddDays(-26)
.ToString("yyyyMM"));
If you are going to do this frequently, it would be worth investing in a table that assigns a unique identifier to each month and the start and end dates:
CREATE TABLE MonthEndings
(
MonthID INTEGER NOT NULL PRIMARY KEY,
StartDate DATE NOT NULL,
EndDate DATE NOT NULL
);
INSERT INTO MonthEndings VALUES(201101, '27/12/2010', '26/01/2011');
INSERT INTO MonthEndings VALUES(201102, '27/01/2011', '26/02/2011');
INSERT INTO MonthEndings VALUES(201103, '27/02/2011', '26/03/2011');
INSERT INTO MonthEndings VALUES(201112, '27/11/2011', '26/01/2012');
You can then group accurately using:
SELECT M.MonthID, P.Id, P.Date, P.Amount
FROM Payments AS P
JOIN MonthEndings AS M ON P.Date BETWEEN M.StartDate and M.EndDate
ORDER BY M.MonthID, P.Date;
Any group headings etc are best handled out of the DBMS - the SQL gets you the data in the correct sequence, and the software retrieving the data presents it to the user.
If you can't translate SQL to LINQ, that makes two of us. Sorry, I have never used LINQ, so I've no idea what is involved.
SELECT *, CASE WHEN datepart(day,date)<27 THEN datepart(month,date)
ELSE datepart(month,date) % 12 + 1 END as group_name
FROM payment

Resources