Suggestions on what patterns/analysis to derive from Airlines Big Data - hadoop

I recently started learning Hadoop,
I found this data set http://stat-computing.org/dataexpo/2009/the-data.html - (2009 data),
I want some suggestions as what type of patterns or analysis can I do in Hadoop MapReduce, i just need something to get started with, If anyone has a better data set link which I can use for learning, help me here.
The attributes are as:
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
Thanks

Related

time difference between two date removing closing time

my company has numbers of shops around all the locations. They raised a request for delivering the item to their shop which they can sell . We wanted to understand how much time the company takes to deliver the item in minutes.However, we don't want to add the time in our elapsed time when the shop is closed i.e.
lets consider shop opening and closing time are
now elapsed time
When I deduct complain time and resolution time then I get calculatable elasped time in minutes but I need Required elapsed time in minutes so in the first case out of 2090 minutes those minutes are deducated when shop was closed. I need to write an oracle query to calcualted the required elapsed time in minutes which is in green.
help what query we can write.
One formula to get the net time is as follows:
For every day involved add up the opening times. For your first example this is two days 2021-01-11 and 2021-01-12 with 13 daily opening hours (09:00 - 22:00). That makes 26 hours.
If the first day starts after the store opens, subtract the difference. 10:12 - 09:00 = 1:12 = 72 minutes.
If the last day ends before the store closes, subtract the difference. 22:00 - 21:02 = 0:58 = 58 minutes.
Oracle doesn't have a TIME datatype, so I assume you are using Oracle's datetime data type they call DATE to store the opening and closing time and we must ignore the date part. And you are probably using the DATE type for the complain_time and the resolution_time, too.
In below query I convert the time parts to minutes right away, so the calculations get a tad more readable later.
with s as
(
select
shop,
extract(hour from opening_time) * 60 + extract(minute from opening_time) as opening_minute,
extract(hour from closing_time) * 60 + extract(minute from closing_time) as closing_minute
from shops
)
, r as
(
select
request, shop, complain_time, resolution_time,
trunc(complain_time) as complain_day,
trunc(resolution_time) as resolution_day,
extract(hour from complain_time) * 60 + extract(minute from complain_time) as complain_minute,
extract(hour from resolution_time) * 60 + extract(minute from resolution_time) as resolution_minute
from requests
)
select
r.request, r.shop, r.complain_time, r.resolution_time,
(r.resolution_day - r.complain_day + 1) * 60
- case when r.complain_minute > s.opening_minute) then r.complain_minute - s.opening_minute else 0 end
- case when r.resolution_minute < s.opening_minute) then s.closing_minute - r.resolution_minute else 0 end
as net_duration_in_minutes
from r
join s on s.shop = r.shop
order by r.request;

How to round off seconds to the nearest minute?

I am converting [ss] seconds to mm:ss format.
But, I also have to round off the value to the nearest minute.
For example, 19:29 -> 19 minutes and 19:32-> 20 minutes
I have tried using mround function. But it did not work.
=MROUND(19.45,15/60/24) gives output as 19.44791667.
It should come as 20 seconds.
try like this where B column is formatted as Time
=ARRAYFORMULA(IF(LEN(A1:A), MROUND(A1:A, "00:01:00"), ))
=TEXT(MROUND("00:"&TO_TEXT(B5), "00:01:00"), "mm:ss")
=ARRAYFORMULA(TEXT(MROUND(SUM(TIME(0,
REGEXEXTRACT(TO_TEXT(C3:C11), "(.+):"),
REGEXEXTRACT(TO_TEXT(C3:C11), ":(.+)"))), "00:01:00"), "[mm]:ss"))

Facebook Analytics - how group events based on time (breakdown to 24 segments)

I want to know at WHAT TIME mostly my events happens during a period of time, so for example:
Event : Initiate Checkout
time 00:00 ~ 01:00 = 80 events
time 01:00 ~ 02:00 = 145 events
time 02:00 ~ 03:00 = 300 events
...
time 23:00 ~ 24:00 = 20 events
between date range: 1 nov ~ 30 nov 2018
Note : the results shouldn't be 720 (30*24) time fragments, but 24 time fragments.
How to do that using facebook analytics ?
You can go to "Events" section, choose "Initiate Checkout", on the charts choose "Time Interval" as "Hourly". For date range, you can choose 1 nov ~ 30 nov on the top left corner.

Algorithm to calculate a date for complex occupation management

Hello fellow Stack Overflowers,
I have a situation, where I need some help choosing the best way to make an algorithm work, the objective is to manage the occupation of a resource (Lets consider the resource A) to have multiple tasks, and where each task takes a specified amount of time to complete. At this first stage I don't want to involve multiple variables, so lets keep it the simple way, lets consider he only has a schedule of the working days.
For example:
1 - We have 1 resource, resource A
2 - Resource A works from 8 am to 4 pm, monday to friday, to keep it simple by now, he doesn't have lunch for now, so, 8 hours of work a day.
3 - Resource A has 5 tasks to complete, to avoid complexity at this level, lets supose each one will take exactly 10 hours to complete.
4 - Resource A will start working on this tasks at 2018-05-16 exactly at 2 pm.
Problem:
Now, all I need to know is the correct finish date for all the 5 tasks, but considering all the previous limitations.
In this case, he has 6 working days and additionaly 2 hours of the 7th day.
The expected result that I want would be: 2018-05-24 (at 4 pm).
Implementation:
I thought about 2 options, and would like to have feedback on this options, or other options that I might not be considering.
Algorithm 1
1 - Create a list of "slots", where each "slot" would represent 1 hour, for x days.
2 - Cross this list of slots with the hour schedule of the resource, to remove all the slots where the resource isn't here. This would return a list with the slots that he can actually work.
3 - Occupy the remaining slots with the tasks that I have for him.
4 - Finnaly, check the date/hour of the last occupied slot.
Disadvantage: I think this might be an overkill solution, considering that I don't want to consider his occupation for the future, all I want is to know when will the tasks be completed.
Algorithm 2
1 - Add the task hours (50 hours) to the starting date, getting the expectedFinishDate. (Would get expectedFinishDate = 2018-05-18 (at 4 pm))
2 - Cross the hours, between starting date and expectedFinishDate with the schedule, to get the quantity of hours that he won't work. (would basically get the unavailable hours, 16 hours a day, would result in remainingHoursForCalc = 32 hours).
3 - calculate new expectedFinishDate with the unavailable hours, would add this 32 hours to the previous 2018-05-18 (at 4 pm).
4 - Repeat point 2 and 3 with new expectedFinishDate untill remainingHoursForCalc = 0.
Disadvantage: This would result in a recursive method or in a very weird while loop, again, I think this might be overkill for calculation of a simple date.
What would you suggest? Is there any other option that I might not be considering that would make this simpler? Or you think there is a way to improve any of this 2 algorithms to make it work?
Improved version:
import java.util.Calendar;
import java.util.Date;
public class Main {
public static void main(String args[]) throws Exception
{
Date d=new Date();
System.out.println(d);
d.setMinutes(0);
d.setSeconds(0);
d.setHours(13);
Calendar c=Calendar.getInstance();
c.setTime(d);
c.set(Calendar.YEAR, 2018);
c.set(Calendar.MONTH, Calendar.MAY);
c.set(Calendar.DAY_OF_MONTH, 17);
//c.add(Calendar.HOUR, -24-5);
d=c.getTime();
//int workHours=11;
int hoursArray[] = {1,2,3,4,5, 10,11,12, 19,20, 40};
for(int workHours : hoursArray)
{
try
{
Date end=getEndOfTask(d, workHours);
System.out.println("a task starting at "+d+" and lasting "+workHours
+ " hours will end at " +end);
}
catch(Exception e)
{
System.out.println(e.getMessage());
}
}
}
public static Date getEndOfTask(Date startOfTask, int workingHours) throws Exception
{
int totalHours=0;//including non-working hours
//startOfTask +totalHours =endOfTask
int startHour=startOfTask.getHours();
if(startHour<8 || startHour>16)
throw new Exception("a task cannot start outside the working hours interval");
System.out.println("startHour="+startHour);
int startDayOfWeek=startOfTask.getDay();//start date's day of week; Wednesday=3
System.out.println("startDayOfWeek="+startDayOfWeek);
if(startDayOfWeek==6 || startDayOfWeek==0)
throw new Exception("a task cannot start on Saturdays on Sundays");
int remainingHoursUntilDayEnd=16-startHour;
System.out.println("remainingHoursUntilDayEnd="+remainingHoursUntilDayEnd);
/*some discussion here: if task starts at 12:30, we have 3h30min
* until the end of the program; however, getHours() will return 12, which
* substracted from 16 will give 4h. It will work fine if task starts at 12:00,
* or, generally, at the begining of the hour; let's assume a task will start at HH:00*/
int remainingDaysUntilWeekEnd=5-startDayOfWeek;
System.out.println("remainingDaysUntilWeekEnd="+remainingDaysUntilWeekEnd);
int completeWorkDays = (workingHours-remainingHoursUntilDayEnd)/8;
System.out.println("completeWorkDays="+completeWorkDays);
//excluding both the start day, and the end day, if they are not fully occupied by the task
int workingHoursLastDay=(workingHours-remainingHoursUntilDayEnd)%8;
System.out.println("workingHoursLastDay="+workingHoursLastDay);
/* workingHours=remainingHoursUntilDayEnd+(8*completeWorkDays)+workingHoursLastDay */
int numberOfWeekends=(int)Math.ceil( (completeWorkDays-remainingDaysUntilWeekEnd)/5.0 );
if((completeWorkDays-remainingDaysUntilWeekEnd)%5==0)
{
if(workingHoursLastDay>0)
{
numberOfWeekends++;
}
}
System.out.println("numberOfWeekends="+numberOfWeekends);
totalHours+=(int)Math.min(remainingHoursUntilDayEnd, workingHours);//covers the case
//when task lasts 1 or 2 hours, and we have maybe 4h until end of day; that's why i use Math.min
if(completeWorkDays>0 || workingHoursLastDay>0)
{
totalHours+=8;//the hours of the current day between 16:00 and 24:00
//it might be the case that completeWorkDays is 0, yet the task spans up to tommorrow
//so we still have to add these 8h
}
if(completeWorkDays>0)//redundant if, because 24*0=0
{
totalHours+=24*completeWorkDays;//for every 8 working h, we have a total of 24 h that have
//to be added to the date
}
if(workingHoursLastDay>0)
{
totalHours+=8;//the hours between 00.00 AM and 8 AM
totalHours+=workingHoursLastDay;
}
if(numberOfWeekends>0)
{
totalHours+=48*numberOfWeekends;//every weekend between start and end dates means two days
}
System.out.println("totalHours="+totalHours);
Calendar calendar=Calendar.getInstance();
calendar.setTime(startOfTask);
calendar.add(Calendar.HOUR, totalHours);
return calendar.getTime();
}
}
You may adjust the hoursArray[], or d.setHours along with c.set(Calendar.DAY_OF_MONTH, to test various start dates along with various task lengths.
There is still a bug , due to the addition of the 8 hours between 16:00 and 24:00:
a task starting at Thu May 17 13:00:00 EEST 2018 and lasting 11 hours will end at Sat May 19 00:00:00 EEST 2018.
I've kept a lot of print statements, they are useful for debugging purposes.
Here is the terminology explained:
I agree that algorithm 1 is overkill.
I think I would make sure I had the conditions right: hours per day (8), working days (Mon, Tue, Wed, Thu, Fri). Would then divide the hours required (5 * 10 = 50) by the hours per day so I know a minimum of how many working days are needed (50 / 8 = 6). Slightly more advanced, divide by hours per week first (50 / 40 = 1 week). Count working days from the start date to get a first shot at the end date. There was probably a remainder from the division, so use this to determine whether the tasks can end on this day or run into the next working day.

Information Retrieval :URL hits in a time frame

Algorithm Challenge :
Problem statement :
How would you design a logging system for something like Google , you should be able to query for the number of times a URL was opened within two time frames.
i/p : start_time , end_time , URL1
o/p : number of times URL1 was opened between start and end time.
Some specs :
Database is not an optimal solution
A URL might have been opened multiple times for given time stamp.
A URL might have been opened a large number of times within two time stamps.
start_time and end_time can be a month apart.
time could be granular to a second.
One solution :
Hash of a hash
Key Value
URL Hash----> T1 CumFrequency
Eg :
Amazon Hash--> T CumFreq
11 00 am 3 ( opened 3 times at 11:00 am )
11 15 am 4 ( opened 1 time at 11:15 am , cumfreq is 3+1=4)
11 30 am 11 ( opened 4 times at 11:30 am , cumfreq is 3+4+4=11)
i/p : 11 : 10 am , 11 : 37 am , Amazon
the o.p can be obtained by subtracting , last timestamp less then 11:10 which 11:00 am , and last active time stamp less than 11:37 am which is 11:30 am. Hence the result is
11-3 = 8 ....
Can we do better ?

Resources