Elasticsearch count overlapping timeranges in date histogram - elasticsearch

I have events stored in Elasticsearch 6.6 that have a start and end time e.g.:
{
"startTime": "2019-01-11T14:49:16.719Z"
"endTime": "2019-01-11T16:31:56.483Z"
}
I want to display a date histogram which shows the number of overlapping events in each hour.
Example:
Hour of Day:
12 13 14 15 16 17 18 19
Events:
<====E1====> <===E2==>
<===E3====>
<==E4==>
Result:
0 1 1 3 2 2 1 0
Is there a way to do this with an elasticsearch aggregation or do I have to implement it in the application?

Related

Using Featuretools to aggregate per time time of day

I'm wondering if there's any way to calculate all the same variables I already am using deep feature synthesis (ie counts, sums, mean, etc) for different time segments within a day?
I.e. count of morning events (hours 0-12) as a separate variable from evening events (13-24).
Also, within the same vein, what would be the easiest to get counts by day of week, day of month, day of year, etc. Custom aggregate primitives?
Yes, this is possible. First, let's generate some random data and then I'll walkthrough how
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
events_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
events_df
the first thing we want to do is add a new column for the segment we want to calculate features for
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
now we have a dataframe like this
id customer_id timestamp amount time_of_day
0 0 a 2019-01-01 00:00:00 44.713802 morning
1 1 c 2019-01-01 01:00:00 58.776476 morning
2 2 a 2019-01-01 02:00:00 94.671566 morning
3 3 a 2019-01-01 03:00:00 39.271852 morning
4 4 a 2019-01-01 04:00:00 40.773290 morning
5 5 c 2019-01-01 05:00:00 19.815855 morning
6 6 a 2019-01-01 06:00:00 62.457129 morning
7 7 b 2019-01-01 07:00:00 95.114636 morning
8 8 b 2019-01-01 08:00:00 37.824668 morning
9 9 a 2019-01-01 09:00:00 46.502904 morning
Next, let's load it into our entityset
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="events",
time_index="timestamp",
dataframe=events_df)
es.normalize_entity(new_entity_id="customers", index="customer_id", base_entity_id="events")
es.plot()
Now, we are ready to set the segments we want to create aggregations for by using interesting_values
es["events"]["time_of_day"].interesting_values = ["morning", "afternoon", "evening"]
Then we can run DFS and place the aggregation primitives we want to do on a per segment basis in the where_primitives parameter
fm, fl = ft.dfs(target_entity="customers",
entityset=es,
agg_primitives=["count", "mean", "sum"],
trans_primitives=[],
where_primitives=["count", "mean", "sum"])
fm
In the resulting feature matrix, you can now see we have aggregations per morning, afternoon, and evening
COUNT(events) MEAN(events.amount) SUM(events.amount) COUNT(events WHERE time_of_day = afternoon) COUNT(events WHERE time_of_day = evening) COUNT(events WHERE time_of_day = morning) MEAN(events.amount WHERE time_of_day = afternoon) MEAN(events.amount WHERE time_of_day = evening) MEAN(events.amount WHERE time_of_day = morning) SUM(events.amount WHERE time_of_day = afternoon) SUM(events.amount WHERE time_of_day = evening) SUM(events.amount WHERE time_of_day = morning)
customer_id
a 37 49.753630 1840.884300 12 7 18 35.098923 45.861881 61.036892 421.187073 321.033164 1098.664063
b 30 51.241484 1537.244522 3 10 17 45.140800 46.170996 55.300715 135.422399 461.709963 940.112160
c 33 39.563222 1305.586314 9 7 17 50.129136 34.593936 36.015679 451.162220 242.157549 612.266545

Laravel 5.7 Database Design Layout / Average from Collection

I have a situation where each Order can have Feedback. In case the product is physical, the Feedback can have many packaging_feedbacks. The packaging_feedbacks are supposed to be a relation to the packaging_feedback_details.
Feedback Model
public function packagingFeedbacks()
{
return $this->hasManyThrough('App\PackagingFeedbackDetail', 'App\PackagingFeedback',
'feedback_id', 'id', 'id', 'user_selection');
}
packaging_feedback_details
id|type_id(used to group the "names" for each feedback option)|name
1 0 well packed
2 0 bad packaging
3 1 fast shipping
4 1 express delivery
packaging_feedbacks
id|feedback_id|user_selection (pointing to the ID of packaging_feedback_details)
1 1 2
2 1 6
3 1 7
4 1 12
5 1 15
6 1 17
7 2 1
8 2 6
9 2 7
10 2 12
11 2 15
12 2 17
13 3 1
14 3 6
15 3 7
16 3 12
17 3 15
18 3 17
Now I would like to be able to get the average selection of the users for a physical product. I started by using:
$result = Product::with('userFeedbacks.packagingFeedbacks')->where('id', 1)->first();
$collection = collect();
foreach ($result->userFeedbacks as $key) {
foreach ($key->packagingFeedbacks as $skey) {
$collection->push($skey);
}
}
foreach ($collection->groupBy('type_id') as $key) {
echo($key->average('type_id'));
}
But it returns not the average id since it will calculate the average not the way I need it to calculate. Is there some better way, because I think it's not the cleverest way to do so. Is my database design, in general, the "best" way to handle this?
The type of average you're looking for here is mode. Laravel's collection instances have a mode() method which was introduced in 5.2 which when provide a key returns an array containing the highest occurring value for that key.
If I have understood your question correctly this should give you what you're after:
$result->userFeedbacks
->flatMap->packagingFeedbacks
->groupBy('type_id')
->map->mode('id');
The above is taking advantage of flatMap() and higher order messages() on collections.

Calculate features at multiple training windows in Featuretools

I have a table with customers and transactions. Is there a way how to get features that would be filtered for last 3/6/9/12 months? I would like to automatically generate features:
number of trans in last 3 months
....
number of trans in last 12 months
average trans in last 3 months
...
average trans in last 12 months
I've tried using the training_window =["1 month", "3 months"],, but it does not seem to return multiple features for each window.
Example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
window_features = ft.dfs(entityset=es,
target_entity="customers",
training_window=["1 hour", "1 day"],
features_only = True)
window_features
Do I have to do individual windows separately and then merge the results?
As you mentioned, in Featuretools 0.2.1 you have to build the feature matrices individually for each training window and then merge the results. With your example, you would do that as follows:
import pandas as pd
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
features = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=['count'],
trans_primitives=[],
features_only = True)
fm_1 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1h',
verbose=True)
fm_2 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1d',
verbose=True)
new_df = fm_1.reset_index()
new_df = new_df.merge(fm_2.reset_index(), on="customer_id", suffixes=("_1h", "_1d"))
Then, the new dataframe will look like:
customer_id COUNT(sessions)_1h COUNT(transactions)_1h COUNT(sessions)_1d COUNT(transactions)_1d
1 1 17 3 43
2 3 36 3 36
3 0 0 1 25
4 0 0 0 0
5 1 15 2 29

SparkR - Retaining the previous value in another column

I have a spark dataFrame that looks like this:
id dates value
1 11 2013-11-15 10
2 11 2013-11-16 15
3 22 2013-11-15 20
4 22 2013-11-16 21
5 22 2013-11-17 3
I wish to retain the value from the previous date per id.
The final result should look like this:
id dates value prev_value
1 11 2013-11-15 10 NA
2 11 2013-11-16 15 10
3 22 2013-11-15 20 NA
4 22 2013-11-16 21 20
5 22 2013-11-17 3 21
The solution from this question would not work for various reasons.
I would appreciate the help!
So after playing with it for a while, here's the workaround that I found:
First of all, here's the example DF
id<-c(11,11,22,22,22)
dates<-as.Date(c('2013-11-15','2013-11-16','2013-11-15','2013-11-16','2013-11-17'), "%Y-%m-%d")
value <- c(10,15,20,21,3)
example<-as.DataFrame(data.frame(id=id,dates=dates, value))
I copy the example DF and add 1 day to the original date, then rename the column
example_p <- example
example_p$dates <- date_add(example_p$dates, 1)
colnames(example_p) <- c("id", "dates", "prev_value")
Finally, I merge the new DF to the original one
result <- select(merge(example, example_p, by = intersect(names(example),names(example_p))
, all.x = T), c("id_x", "dates_x", "value", "prev_value"))
showDF(result)
+----+----------+-----+----------+
|id_x| dates_x|value|prev_value|
+----+----------+-----+----------+
|22.0|2013-11-15| 20.0| null|
|11.0|2013-11-15| 10.0| null|
|11.0|2013-11-16| 15.0| 10.0|
|22.0|2013-11-16| 21.0| 20.0|
|22.0|2013-11-17| 3.0| 21.0|
+----+----------+-----+----------+
Obviously, this is somehow clumsy and I will be happy to give the points to anyone who can suggest a solution that would work faster than this.

Suggestions on what patterns/analysis to derive from Airlines Big Data

I recently started learning Hadoop,
I found this data set http://stat-computing.org/dataexpo/2009/the-data.html - (2009 data),
I want some suggestions as what type of patterns or analysis can I do in Hadoop MapReduce, i just need something to get started with, If anyone has a better data set link which I can use for learning, help me here.
The attributes are as:
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
Thanks

Resources