Using Featuretools to aggregate per time time of day - feature-extraction

I'm wondering if there's any way to calculate all the same variables I already am using deep feature synthesis (ie counts, sums, mean, etc) for different time segments within a day?
I.e. count of morning events (hours 0-12) as a separate variable from evening events (13-24).
Also, within the same vein, what would be the easiest to get counts by day of week, day of month, day of year, etc. Custom aggregate primitives?

Yes, this is possible. First, let's generate some random data and then I'll walkthrough how
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
events_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
events_df
the first thing we want to do is add a new column for the segment we want to calculate features for
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
now we have a dataframe like this
id customer_id timestamp amount time_of_day
0 0 a 2019-01-01 00:00:00 44.713802 morning
1 1 c 2019-01-01 01:00:00 58.776476 morning
2 2 a 2019-01-01 02:00:00 94.671566 morning
3 3 a 2019-01-01 03:00:00 39.271852 morning
4 4 a 2019-01-01 04:00:00 40.773290 morning
5 5 c 2019-01-01 05:00:00 19.815855 morning
6 6 a 2019-01-01 06:00:00 62.457129 morning
7 7 b 2019-01-01 07:00:00 95.114636 morning
8 8 b 2019-01-01 08:00:00 37.824668 morning
9 9 a 2019-01-01 09:00:00 46.502904 morning
Next, let's load it into our entityset
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="events",
time_index="timestamp",
dataframe=events_df)
es.normalize_entity(new_entity_id="customers", index="customer_id", base_entity_id="events")
es.plot()
Now, we are ready to set the segments we want to create aggregations for by using interesting_values
es["events"]["time_of_day"].interesting_values = ["morning", "afternoon", "evening"]
Then we can run DFS and place the aggregation primitives we want to do on a per segment basis in the where_primitives parameter
fm, fl = ft.dfs(target_entity="customers",
entityset=es,
agg_primitives=["count", "mean", "sum"],
trans_primitives=[],
where_primitives=["count", "mean", "sum"])
fm
In the resulting feature matrix, you can now see we have aggregations per morning, afternoon, and evening
COUNT(events) MEAN(events.amount) SUM(events.amount) COUNT(events WHERE time_of_day = afternoon) COUNT(events WHERE time_of_day = evening) COUNT(events WHERE time_of_day = morning) MEAN(events.amount WHERE time_of_day = afternoon) MEAN(events.amount WHERE time_of_day = evening) MEAN(events.amount WHERE time_of_day = morning) SUM(events.amount WHERE time_of_day = afternoon) SUM(events.amount WHERE time_of_day = evening) SUM(events.amount WHERE time_of_day = morning)
customer_id
a 37 49.753630 1840.884300 12 7 18 35.098923 45.861881 61.036892 421.187073 321.033164 1098.664063
b 30 51.241484 1537.244522 3 10 17 45.140800 46.170996 55.300715 135.422399 461.709963 940.112160
c 33 39.563222 1305.586314 9 7 17 50.129136 34.593936 36.015679 451.162220 242.157549 612.266545

Related

Calculate features at multiple training windows in Featuretools

I have a table with customers and transactions. Is there a way how to get features that would be filtered for last 3/6/9/12 months? I would like to automatically generate features:
number of trans in last 3 months
....
number of trans in last 12 months
average trans in last 3 months
...
average trans in last 12 months
I've tried using the training_window =["1 month", "3 months"],, but it does not seem to return multiple features for each window.
Example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
window_features = ft.dfs(entityset=es,
target_entity="customers",
training_window=["1 hour", "1 day"],
features_only = True)
window_features
Do I have to do individual windows separately and then merge the results?
As you mentioned, in Featuretools 0.2.1 you have to build the feature matrices individually for each training window and then merge the results. With your example, you would do that as follows:
import pandas as pd
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
features = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=['count'],
trans_primitives=[],
features_only = True)
fm_1 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1h',
verbose=True)
fm_2 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1d',
verbose=True)
new_df = fm_1.reset_index()
new_df = new_df.merge(fm_2.reset_index(), on="customer_id", suffixes=("_1h", "_1d"))
Then, the new dataframe will look like:
customer_id COUNT(sessions)_1h COUNT(transactions)_1h COUNT(sessions)_1d COUNT(transactions)_1d
1 1 17 3 43
2 3 36 3 36
3 0 0 1 25
4 0 0 0 0
5 1 15 2 29

SparkR - Retaining the previous value in another column

I have a spark dataFrame that looks like this:
id dates value
1 11 2013-11-15 10
2 11 2013-11-16 15
3 22 2013-11-15 20
4 22 2013-11-16 21
5 22 2013-11-17 3
I wish to retain the value from the previous date per id.
The final result should look like this:
id dates value prev_value
1 11 2013-11-15 10 NA
2 11 2013-11-16 15 10
3 22 2013-11-15 20 NA
4 22 2013-11-16 21 20
5 22 2013-11-17 3 21
The solution from this question would not work for various reasons.
I would appreciate the help!
So after playing with it for a while, here's the workaround that I found:
First of all, here's the example DF
id<-c(11,11,22,22,22)
dates<-as.Date(c('2013-11-15','2013-11-16','2013-11-15','2013-11-16','2013-11-17'), "%Y-%m-%d")
value <- c(10,15,20,21,3)
example<-as.DataFrame(data.frame(id=id,dates=dates, value))
I copy the example DF and add 1 day to the original date, then rename the column
example_p <- example
example_p$dates <- date_add(example_p$dates, 1)
colnames(example_p) <- c("id", "dates", "prev_value")
Finally, I merge the new DF to the original one
result <- select(merge(example, example_p, by = intersect(names(example),names(example_p))
, all.x = T), c("id_x", "dates_x", "value", "prev_value"))
showDF(result)
+----+----------+-----+----------+
|id_x| dates_x|value|prev_value|
+----+----------+-----+----------+
|22.0|2013-11-15| 20.0| null|
|11.0|2013-11-15| 10.0| null|
|11.0|2013-11-16| 15.0| 10.0|
|22.0|2013-11-16| 21.0| 20.0|
|22.0|2013-11-17| 3.0| 21.0|
+----+----------+-----+----------+
Obviously, this is somehow clumsy and I will be happy to give the points to anyone who can suggest a solution that would work faster than this.

Hash key sort week numbers using ruby

I am having an hash whose keys are week numbers and values are attendance scores. I am tying to calculate the average attendance for each month based on the week number i.e.keys.
Below is the example of the hash
weekly_attendance = {31 => 40.0, 32 => 100.00, 33 => 34.00, 34 => 23.78, 35 => 56.79, 36 => 44.50, 37 => 67.00, 38 => 55.00 }
Since a month consists of 4 weeks and the beginning week of the month is divisible by 4, the attendance needs to be sorted as follows
Month 1 attendance consists of weeks 31,32 i.e. (40.00+100.00)/2 =70.0
Month 2 attendance consists of weeks 33,34,35,36
i.e. (34.00+23.78+56.79+44.50)/4 = 39.5
Month 3 attendance consists of weeks 37, 38 i.e. (67.00+55.00)/2 = 69.5
The output should be
monthly_attendance = [70.0,39.5,61]
I had tried each and select approaches and used the modulo operator condition i.e. week % 4 == 0 to add the attendance values. But could not effectively group them based on months
tmp = 0
monthly_attendance = []
weekly_attendance.select do |k,v|
tmp += v
monthly_attendance << tmp if k % 4 == 0
end
I am unable to sort the week number in ranges using the above code.
You can try something like this:
results = weekly_attendance.group_by { |week, value| (week + 3) / 4 }.map do |month, groups|
values = groups.map(&:last)
average = values.inject(0) { |sum, val| sum + val } / values.length
[month, average]
end.to_h
p results # {8=>70.0, 9=>39.7675, 10=>61.0}
But the logic of converting weeks to months is flawed here, it's better to use some calendar function instead of just division by 4.
You can get the real month numbers using:
require 'date'
weekly_attendance.group_by { |week, value| Date.commercial(Time.now.year, week, 1).month }
But the result will not match the result you expect, because for example week 31 is in July, while week 32 is in August (this year), instead of being the same month like you expect.
I assume that if x units are produced in a given week, x/7 units are produced on each day of that week. The code below could be easily changed if this assumption were changed.
First construct a hash whose keys are months (1-12) and whose values are hashes whose keys are weeks and whose values are the numbers of days in the given week for the given month. (Whew!)
require 'date'
def months_to_weeks(year)
day = Date.new(year)
days = day.leap? ? 365 : 364
days.times.with_object(Hash.new { |h,k| h[k] = Hash.new(0) }) do |_,h|
h[day.month][day.cweek] += 1
day = day.next
end
end
The doc for Hash#new provides an explanation of the statement:
Hash.new { |h,k| h[k] = Hash.new(0) }
In brief, this creates an empty hash with a default given by the block. If h is the hash that is created, and h does not have a key k, h[k] will cause the block to be executed, which adds that key to the hash and sets its value to an empty hash with a default value of 0. The latter hash is often referred to as a "counting hash". I realize this is still rather a mouthful for a Ruby newbie.
Let's generate this hash for the current year:
year = 2015
mon_to_wks = months_to_weeks(year)
#=> {1 =>{1 =>4, 2 =>7, 3 =>7, 4 =>7, 5=>6},
# 2 =>{5 =>1, 6 =>7, 7 =>7, 8 =>7, 9=>6},
# 3 =>{9 =>1, 10=>7, 11=>7, 12=>7, 13=>7, 14=>2},
# 4 =>{14=>5, 15=>7, 16=>7, 17=>7, 18=>4},
# 5 =>{18=>3, 19=>7, 20=>7, 21=>7, 22=>7},
# 6 =>{23=>7, 24=>7, 25=>7, 26=>7, 27=>2},
# 7 =>{27=>5, 28=>7, 29=>7, 30=>7, 31=>5},
# 8 =>{31=>2, 32=>7, 33=>7, 34=>7, 35=>7, 36=>1},
# 9 =>{36=>6, 37=>7, 38=>7, 39=>7, 40=>3},
# 10=>{40=>4, 41=>7, 42=>7, 43=>7, 44=>6},
# 11=>{44=>1, 45=>7, 46=>7, 47=>7, 48=>7, 49=>1},
# 12=>{49=>6, 50=>7, 51=>7, 52=>7, 53=>3}}
Because of how Date#cweek is defined, the weeks in this hash begin on Mondays. In January, for example, there 4 days are in week 1. These four days, Jan. 1-4, 2015, would be the first Thursday, Friday, Saturday and Sunday of 2015. (Check your calendar.)
If the first day of each week is to be a day other than Monday (Sunday, for example) the hash calculation would have to be changed slightly.
This shows, for example, that in January of 2015, there are 4 days in week 1, 7 days in weeks 2, 3 and 4 and 6 days in week 5. The remaining day of week 5 is the first day in February.
Once this hash has been constructed, it is a simple matter to compute the averages for each month:
weekly_attendance = {31 => 40.00, 32 => 100.00, 33 => 34.00, 34 => 23.78,
35 => 56.79, 36 => 44.50, 37 => 67.00, 38 => 55.00 }
prod_by_mon = (1..12).each_with_object(Hash.new(0)) do |i,h|
mon_to_wks[i].each do |week, days|
h[i] += (days/7.0)*weekly_attendance[week] if weekly_attendance.key?(week)
end
end
#=> {7=>28.571428571428573, 8=>232.3557142857143, 9=>160.14285714285714}
prod_by_mon.merge(prod_by_mon) { |_,v| v.round(2) }
#=> {7=>28.57, 8=>232.36, 9=>160.14}
This shows that production in month 7 was 27.57, and so on. Note that:
28.57 + 232.36 + 160.14 #=> 421.07
weekly_attendance.values.reduce(:+) #=> 421.07

Smaller variation between times of different days

I have working on a algorithm that select a set of date/time objects with a certain characteristic, but with no success.
The data to be used are in a list of lists of date/time objects,
e.g.:
lstDays[i][j], i <= day chooser, j <= time chooser
What is the problem? I need a set of nearest date/time objects. Each time of this set must come from different days.
For example: [2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00]
This example of a set of date/time objects is the best example because it minimize to zero.
Important
Trying to contextualize this: I want to observe if a phenomenon occurs at the same time in differents days. If not, I want to evaluate if distance between the hours is reasonable for my study.
I would like a generic algorithm to any number of days and time. This algorithm should return all set of datetime objects and its time distance:
[2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00], 0
[2012-09-09 13:00,2012-09-10 13:00, 2012-09-11 13:05], 5
and so on.
:: "0", because the diff between all times on the first line from datetime objects is zero seconds.
:: "5", because the diff between all times on the second line from datetime objects is five seconds.
Edit: Code here
for i in range(len(lstDays)):
for j in range(len(lstDays[i])):
print lstDays[i][j]
Output:
2013-07-18 11:16:00
2013-07-18 12:02:00
2013-07-18 12:39:00
2013-07-18 13:14:00
2013-07-18 13:50:00
2013-07-19 11:30:00
2013-07-19 12:00:00
2013-07-19 12:46:00
2013-07-19 13:19:00
2013-07-22 11:36:00
2013-07-22 12:21:00
2013-07-22 12:48:00
2013-07-22 13:26:00
2013-07-23 11:18:00
2013-07-23 11:48:00
2013-07-23 12:30:00
2013-07-23 13:12:00
2013-07-24 11:18:00
2013-07-24 11:42:00
2013-07-24 12:20:00
2013-07-24 12:52:00
2013-07-24 13:29:00
Note: lstDays[i][j] is a datetime object.
lstDays = [ [/*datetime objects from a day i*/], [/*datetime objects from a day i+1*/], [/*datetime objects from a day i+2/*], ... ]
And I am not worried with perfomance, a priori.
Hope that you can help me! (:
Generate a histogram:
hours = [0] * 24
for object in objects: # whatever your objects are
# assuming object.date_time looks like '2013-07-18 10:55:00'
hour = object.date_time[11:13] # assuming the hour is in positions 11-12
hours[int(hour)] += 1
for hour in xrange(24):
print '%02d: %d' % (hour, hours[hour])
You can always resort to calculating the times into a list, then estimate the differences, and group those objects that are below that limit. All packed into a dictionary with the difference as the value and the the timestamps as keys. If this is not exactly what you need, I'm pretty sure it should be easy to select whatever result you need from it.
import numpy
import datetime
times_list = [object1.time(), object2(), ..., objectN()]
limit = 5 # limit of five seconds
groups = {}
for time in times_list:
delta_times = numpy.asarray([(tt-time).total_seconds() for tt in times_list])
whr = numpy.where(abs(delta_times) < limit)[0]
similar = [str(times_list[ii]) for ii in whr]
if len(similar) > 1:
similar.sort()
max_time = numpy.max(delta_times[whr]) # max? median? mean?
groups[tuple(similar)] = max_time

Mac dayofweek issue

Would anyone know why the following code works correctly on Windows and not on Mac??
Today (24/11/2010) should return 47 not 48 as per MacOS
def fm_date = '24/11/2010'
import java.text.SimpleDateFormat
def lPad = {it ->
st = '00' + it.toString()
return st.substring(st.length()-2, st.length())
}
dfm = new SimpleDateFormat("dd/MM/yyyy")
cal=Calendar.getInstance()
cal.setTime( dfm.parse(fm_date) )
now = cal.get(Calendar.WEEK_OF_YEAR)
cal.add(Calendar.DAY_OF_MONTH,-7)
prev = cal.get(Calendar.WEEK_OF_YEAR)
cal.add(Calendar.DAY_OF_MONTH,14)
next = cal.get(Calendar.WEEK_OF_YEAR)
prev = 'diary' + lPad(prev) + '.shtml'
next = 'diary' + lPad(next) + '.shtml'
return 'diary' + lPad(now) + '.shtml'
I believe it's an ISO week number issue...
If I use this code adapted (and groovyfied) from yours:
import java.text.SimpleDateFormat
def fm_date = '24/11/2010'
Calendar.getInstance().with { cal ->
// We want ISO Week numbers
cal.firstDayOfWeek = MONDAY
cal.minimalDaysInFirstWeek = 4
setTime new SimpleDateFormat( 'dd/MM/yyyy' ).parse( fm_date )
now = cal[ WEEK_OF_YEAR ]
}
"diary${"$now".padLeft( 2, '0' )}.shtml"
I get diary47.shtml returned
As the documentation for GregorianCalendar explains, if you want ISO Month numbers:
Values calculated for the WEEK_OF_YEAR
field range from 1 to 53. Week 1 for a
year is the earliest seven day period
starting on getFirstDayOfWeek() that
contains at least
getMinimalDaysInFirstWeek() days from
that year. It thus depends on the
values of getMinimalDaysInFirstWeek(),
getFirstDayOfWeek(), and the day of
the week of January 1. Weeks between
week 1 of one year and week 1 of the
following year are numbered
sequentially from 2 to 52 or 53 (as
needed).
For example, January 1, 1998 was a
Thursday. If getFirstDayOfWeek() is
MONDAY and getMinimalDaysInFirstWeek()
is 4 (these are the values reflecting
ISO 8601 and many national standards),
then week 1 of 1998 starts on December
29, 1997, and ends on January 4, 1998.
If, however, getFirstDayOfWeek() is
SUNDAY, then week 1 of 1998 starts on
January 4, 1998, and ends on January
10, 1998; the first three days of 1998
then are part of week 53 of 1997.
Edit
Even Groovier (from John's comment)
def fm_date = '24/11/2010'
Calendar.getInstance().with { cal ->
// We want ISO Week numbers
cal.firstDayOfWeek = MONDAY
cal.minimalDaysInFirstWeek = 4
cal.time = Date.parse( 'dd/MM/yyyy', fm_date )
now = cal[ WEEK_OF_YEAR ]
}
"diary${"$now".padLeft( 2, '0' )}.shtml"
Edit2
Just ran this on Windows using VirtualBox, and got the same result

Resources