Unable to process Timeseries data in PIG - hadoop

I have timeseries data eg: 2018-10-12 01:25:37 and extracted date(2018-10-12) and time(1:25:37) from timestamp. Now requirement is to filter the time values based on a particular condition(eg:filter the time value with another bag's atom, which contains time data(hh:mm:ss)). The PIG has no 'TIME' datatype for time(hh:mm:ss) type data.
What datatype is required to load 'time' data values in PIG?

To extract date (year, month,hr's, minutes etc). used these function
For year: GetYear()
For month: GetMonth()
For day: GetDay()
For hour: GetHour()
For minute: GetMinute()
date.txt
2018-10-12 11:15:43
2018-10-12 12:25:12
A = load 'date.txt' as (in:chararray);
B = foreach A generate ToDate(in,'yyyy-MM-dd HH:mm:ss') as (dt:DateTime);
C = foreach B {
year = GetYear(dt);
month = GetMonth(dt);
day = GetDay(dt);
hour = GetHour(dt);
minute = GetMinute(dt);
//finally you can concatenate year month and day or hour, time using CONCAT function
};

Related

combine Date and Time into a DateTime from api response

What is the efficient way to combine Date and Time (strings) into a single DateTime? i am using football-api in response i am getting time attribute in "08:50" these format and date attribute "01.01.2018" these format. I want to save in database 2018-01-01 08:50:00 format in date field.
a = match["formatted_date"].to_date.strftime("%Y, %m, %d")
y = a[0..3].to_i
m = a[6..7].to_i
d = a[10..11].to_i
date = Date.new(y, m, d).to_datetime + Time.parse(match["time"]).seconds_since_midnight.seconds

PIG - Filter or how to get in side of a bag or tuple

AS you can see we can apply filter to the first one because, we can used aggregate on the temperature. Now how do we apply the second filter on STRINGS?
We are only trying to filter e with conditions clear and partly cloudy.
Weather = LOAD 'hdfs:/home/hduser/final/Weather.csv' USING PigStorage(',');
A = FOREACH Weather GENERATE (int)$0 AS year, (int)$1 AS month, (int)$2 AS day, (int)$4 AS temp, $14 AS cond, (double)$5 as dewpoint , (double)$10 as wind;
group_by_day = GROUP A BY (year,month,day);
Schema:
{day: (year: int,month: int, day: int), temperature {(temp: int)},
condition: {cond: bytearray)}, dewPoint: {(dewpoint: double)} windSpeed:
{(wind: double)}}
You have to cast cond as chararray in the below statement.Since you have not specified the datatype in your load statement,all fields will be loaded as bytearray.That is the default datatype chosen by PigStorage.
A = FOREACH Weather GENERATE (int)$0 AS year, (int)$1 AS month, (int)$2 AS day, (int)$4 AS temp, (chararray)$14 AS cond, (double)$5 as dewpoint , (double)$10 as wind;
EDIT
I was able to get the results by use BagToString function.You can do the filtering in 1 step iteslf.
D = FILTER C BY (MIN(temperature) >= 60 AND MAX(temperature) <= 79) AND (BagToString(condition) == 'clear' OR BagToString(condition) == 'partly cloudy');
Or in your case
f = FILTER e BY BagToString(condition) == 'clear' OR BagToString(condition) == 'partly cloudy';

How to Perform Roundup of Date in Pig

I want to perform a filter condition in Pig where in I want to filter out the data belonging to the current date , or the current hour or the current week.
In input data I have input as
2016-01-05 16:59:50,text11
2016-01-05 17:59:50,text11
I am performing a load function
A = LOAD '/hoursbetween-poc/input/' using PigStorage(',') as (time:chararray,colval:chararray) ;
G = FILTER A BY HoursBetween(CurrentTime(),ToDate(time, 'yyyy-MM-dd HH:mm:ss'))<1;
dump G;
But it is substracting 60 minutes from the current date. I want to filter all records belonging to the current hour
eg:
if the current time is 6.30
the code is filtering everyting before 5.30
i want to round up and filter only before 5.
How to acheive this in pig.
Input :
2016-01-05 10:00:50,text1
2016-01-05 10:59:50,text2
2016-01-05 11:10:50,text3
2016-01-05 09:00:50,text4
Pig Script :
A = LOAD 'a.csv' USING PigStorage(',') AS (time:chararray,colval:chararray) ;
B = FOREACH A GENERATE GetHour(CurrentTime()) AS cur_hr, GetHour(ToDate(time, 'yyyy-MM-dd HH:mm:ss')) AS act_hr, time, colval;
C = FILTER B BY (cur_hr - act_hr) <= 1;
DUMP C;
Output :
(11,10,2016-01-05 10:00:50,text1)
(11,10,2016-01-05 10:59:50,text2)
(11,11,2016-01-05 11:10:50,text3)
Script was executed at 2016-01-05 11:40, as seen in output script has selected records from 10:00 onwards.

How to find date lies in which week of month

Suppose I have a date in year-month-day format. Say "2015-02-12". Now I want to find that in which week this date lies. I mean 12 lies in 2nd week of Funerary. I want if I fo something like
LocalDate date = 2015-02-12;
date.getWeekOfMoth should gives me 2 because 2 lies in 2nd week of February. How can i do it ?
Thanks
Edit
Hi, I am so sorry. I should replied you before you asked. I tried with the following code
String input = "2015-01-31";
SimpleDateFormat df = new SimpleDateFormat("w");
Date date = df.parse(input);
Calendar cal = Calendar.getInstance();
cal.setTime(date);
int week = cal.get(Calendar.WEEK_OF_MONTH);
System.out.println(week);
It prints 2.
While when I check with the following code
String valuee="2015-01-31";
Date currentDate =new SimpleDateFormat("yyyy-MM-dd").parse(valuee);
System.out.println(new SimpleDateFormat("w").format(currentDate));
It prints 5.
Try this one. Remember to feed it with your date format and string with this date as input.
SimpleDateFormat df = new SimpleDateFormat(format);
Date date = df.parse(input);
Calendar cal = Calendar.getInstance();
cal.setTime(date);
int week = cal.get(Calendar.WEEK_OF_MONTH);

entity framework grouping by datetime

Every 5 minutes a row in a sql server table is added. The fields are:
DateTime timeMark,Int total.
Using entity framework I want to populate a new list covering a whole week of five minute values using an average of the totals from the last three months.
How would I accomplish this with Entity Framework?
Assuming your log is really exact on the "five mintues", and that I understood well
, you want a list with 7 day * 24 hours * (60/5) minutes, so 2016 results ?
//define a startDate
var beginningDate = <the date 3 month ago to start with>;
//get the endDate
var endDate = beginningDate.AddMonths(3);
var list = myTable.Where(m => m.TimeMark >= beginningDate && m.TimeMark <=endDate)
//group by dayofWeek, hour and minute will give you data for each distinct day of week, hour and minutes
.GroupBy(m => new {
dayofWeek = SqlFunctions.DatePart("weekday", m.TimeMark),
hour = SqlFunction.DatePart("hour", m.TimeMark),
minute = SqlFunctions.DatePart("minute", m.TimeMark)
})
.Select(g => new {
g.Key.dayofWeek,
g.Key.hour,
g.Key.minute,
total = g.Average(x => x.Total)
});

Resources