Pig Help: Splitting a Field into Multiple Fields - hadoop

Hi I am playing around with Pig for the first time and am curious how to deal with splitting up a field into multiple other fields.
I have a bag, A, like the one below:
grunt> Dump A;
(text, text, Mon Mar 07 12:00:00 CDT 2016)
What I'd like to do is split the Date-Time field into multiple fields so that I can explore the distribution of the data set and do group bys on the Day of Week, Month, Year, etc.
I have been looking at tokenize but am unsure this meets my needs as I need/want to have field names added to the bag or create a nested bag.
Any ideas?

Assuming that the value is already of datatype datetime, then you could use the following functions to extract individual elements.Builtin function reference DateTime Functions in PIG
B = FOREACH A GENERATE f1,f2,
GetDay(f3) as f3_Day,
GetMonth(f3) as f3_Month,
GetYear(f3) as f3_Year,
GetHour(f3) as f3_Hour,
GetMinute(f3) as f3_Minute,
GetSecond(f3) as f3_Second;
If the datatype is chararray then use the ToDate() function to convert it to datetime and extract the date parts.
B = FOREACH A GENERATE f1,f2,ToDate(f3,'choose your datetime format') as f3_Date;
C = FOREACH B GENERATE f1,f2,
GetDay(f3_Date) as f3_Day,
GetMonth(f3_Date) as f3_Month,
GetYear(f3_Date) as f3_Year,
GetHour(f3_Date) as f3_Hour,
GetMinute(f3_Date) as f3_Minute,
GetSecond(f3_Date) as f3_Second;

Related

Fill new column with data modified manipulated from data in pre-existing data in another row in Laravel

I have a pre-existing application built with a huge database. In this database, there's a column expires_in that holds a date but the date is saved in this format:
02 Mar 2018 ( i.e. 'd M Y' format)
I want to copy this data into a new column where it will be saved as date in expiration_date column (like this: 2018-03-02 ).
I want to do this in the most efficient way because the database has thousands of records.
Thank you.
Use QueryBuilder's chunk method -- this fetches small subsets of data in a loop. Otherwise with thousands of records, your request could timeout or run out of memory.
You can then use Carbons createFromFormat() method to specify how to parse the current date format and update the record.
Accounts::orderBy('id')->chunk(100, function ($accounts) {
foreach ($accounts as $account) {
$account->update([
'expiration_date' => Carbon::createFromFormat('d M Y', $account->expires_in)
]);
}
});
Found this post about converting string to date
Update `accounts` set `expiration_date` = STR_TO_DATE(expires_in, '%d %M %Y')

Apache Pig: Calculate number of days between a date and current date

I have a list of movies in the form (#,title,year,rating,duration):
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
...
I have the year in each tuple, which I need to treat it as 1st Jan of each year.
I need to calculate the number of days between this date and current date
My approach:
movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
daysbetween_data = foreach movies generate DaysBetween(ToDate(year,'<WHAT FORMAT TO GIVE HERE>'), ToDate(<CURRENT DATE HERE>));
Any idea how to do this?
Load the year to a chararray field,Use CONCAT to append 01-01- to the year field so that you get the format 'MM-dd-yyyy' and then use the ToDate and DaysBetween.
movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int,name:chararray,year:chararray,rating:double,duration:int);
daysbetween_data = foreach movies generate DaysBetween(ToDate(CONCAT('01-01-',year),'MM-dd-yyyy'),CurrentTime());

Retrieve database records between two weekdays

I have several records in my database, the table has a column named "weekday" where I store a weekday like "mon" or "fri". Now from the frontend when a user does search the parameters posted to the server are startday and endDay.
Now I would like to retrieve all records between startDay and endDay. We can assume startDay is "mon" and endDay is "sun". I do not currently know how to do this.
Create another table with the names of the days and their corresponding number. Then you'd just need to join up your current table with the days table by name, and then use the numbers in that table to do your queries.
Not exactly practical, but it is possible to convert sun,mon,tue to numbers using MySQL.
Setup a static year and week number like 201610 for the 10th week of this year, then use a combination of DATE_FORMAT with STR_TO_DATE:
DATE_FORMAT(STR_TO_DATE('201610 mon', '%X%V %a'), '%w')
DATE_FORMAT(STR_TO_DATE('201610 sun', '%X%V %a'), '%w')
DATE_FORMAT(STR_TO_DATE('201610 tue', '%X%V %a'), '%w')
These 3 statements will evaluate to 0,1,2 respectively.
The main thing this is doing is converting the %a format (Sun-Sat) to the %w format (0-6)
well i don't know the architecture of your application as i think storing and querying a week day string is not appropriate, but i can tell you a work around this.
make a helper function which return you an array of weekdays in the range i-e
function getWeekDaysArray($startWeekDay, $endWeekDay) {
returns $daysArray['mon','tue','wed'];
}
$daysRangeArray = getWeekDaysArray('mon', 'wed');
now with this array you can query in table
DB::table('TableName')->whereIn('week_day', $daysRangeArray)->get();
Hope this help

Is there a way to do datetime addition in pig?

what I would like to do in pig is something that is very common in sql.
I have date field that is of the form yyy-mm-dd hh:mm:ss and I have another field that contains an integer which represents an amount of hours. Is there a way to easily add the integer to the datetime field so that we get a result of what we expect with clock math.
Example: date is 2013-06-01 : 23:12:12.
Then I add 2 hours
I should get 2013-06-02 01:12:12.
With the latest release of Pig(0.11.0) it should be possible. But the amount of hours(time) should be as per ISO8601 Duration Format. It provides class AddDuration which allows us to add a DateTime object with a Duration object. You can find more about AddDuration at this page.
Edit :
Yes, you can add negative hours. I tried this on my Ubuntu box :
Input :
2009-01-07T01:07:01.000Z,PT1S
2008-02-06T02:06:02.000Z,PT1M
2007-03-05T03:05:03.000Z,PT-1H
Query :
grunt> a = LOAD '/pig.txt' USING PigStorage(',') AS (dt:datetime, dr:chararray);
grunt> b = FOREACH a GENERATE AddDuration(dt, dr) AS dt1;
grunt> dump b;
Output :
(2009-01-07T01:07:02.000Z)
(2008-02-06T02:07:02.000Z)
(2007-03-05T02:05:03.000Z)

How do I get the aggr of two aggrs in QlikView?

If I want to find the maximum value of a column from two states aggregated by a member's ID, should this work?
=Aggr(
MaxString(
Aggr(NODISTINCT MinString({[State1]}DATE_STRING),MBR_ID)
+
Aggr(NODISTINCT MinString({[State2]}DATE_STRING),MBR_ID)
) , MBR_ID)
So if I had this data:
MBR ID DATE_STRING
1 20120101
1 20120102
1 20120103
And State1 had 20120101 selected and State2 has 20120103 selected, my expression would return 20120103 for member 1.
Thanks!
Edit: In SQL, this would look like:
WITH MinInfo (DATE_STRING, MBR_ID)
AS (SELECT MIN(DATE_STRING), MBR_ID FROM Table WHERE TYPE IN ('State1', 'State2') GROUP BY MBR_ID, TYPE)
SELECT MAX(DATE_STRING) DATE_STRING, MBR_ID FROM MinInfo GROUP BY MBR_ID
It would be easier to accomplish your goal if you convert your that to an actual date field
Assuming that you are using a chart where MBR_ID is the Dimension, if you want the maximum date (latest date) you can do the following:
=nummax(Max({[State1]}DATE_STRING),Max({[State2]}DATE_STRING))
To convert to a date, you can use this function:
date#(DATE_STRING,'[text format of the date]')
(The date format looks like YYYYMMDD to me, but if its day then month, you would use YYYYDDMM)
I'd suggest you format it in the script, so that you wont have to worry about it every time you need to use that date.

Resources