I'm wondering if there's any way to calculate all the same variables I already am using deep feature synthesis (ie counts, sums, mean, etc) for different time segments within a day?
I.e. count of morning events (hours 0-12) as a separate variable from evening events (13-24).
Also, within the same vein, what would be the easiest to get counts by day of week, day of month, day of year, etc. Custom aggregate primitives?
Yes, this is possible. First, let's generate some random data and then I'll walkthrough how
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
events_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
events_df
the first thing we want to do is add a new column for the segment we want to calculate features for
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
now we have a dataframe like this
id customer_id timestamp amount time_of_day
0 0 a 2019-01-01 00:00:00 44.713802 morning
1 1 c 2019-01-01 01:00:00 58.776476 morning
2 2 a 2019-01-01 02:00:00 94.671566 morning
3 3 a 2019-01-01 03:00:00 39.271852 morning
4 4 a 2019-01-01 04:00:00 40.773290 morning
5 5 c 2019-01-01 05:00:00 19.815855 morning
6 6 a 2019-01-01 06:00:00 62.457129 morning
7 7 b 2019-01-01 07:00:00 95.114636 morning
8 8 b 2019-01-01 08:00:00 37.824668 morning
9 9 a 2019-01-01 09:00:00 46.502904 morning
Next, let's load it into our entityset
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="events",
time_index="timestamp",
dataframe=events_df)
es.normalize_entity(new_entity_id="customers", index="customer_id", base_entity_id="events")
es.plot()
Now, we are ready to set the segments we want to create aggregations for by using interesting_values
es["events"]["time_of_day"].interesting_values = ["morning", "afternoon", "evening"]
Then we can run DFS and place the aggregation primitives we want to do on a per segment basis in the where_primitives parameter
fm, fl = ft.dfs(target_entity="customers",
entityset=es,
agg_primitives=["count", "mean", "sum"],
trans_primitives=[],
where_primitives=["count", "mean", "sum"])
fm
In the resulting feature matrix, you can now see we have aggregations per morning, afternoon, and evening
COUNT(events) MEAN(events.amount) SUM(events.amount) COUNT(events WHERE time_of_day = afternoon) COUNT(events WHERE time_of_day = evening) COUNT(events WHERE time_of_day = morning) MEAN(events.amount WHERE time_of_day = afternoon) MEAN(events.amount WHERE time_of_day = evening) MEAN(events.amount WHERE time_of_day = morning) SUM(events.amount WHERE time_of_day = afternoon) SUM(events.amount WHERE time_of_day = evening) SUM(events.amount WHERE time_of_day = morning)
customer_id
a 37 49.753630 1840.884300 12 7 18 35.098923 45.861881 61.036892 421.187073 321.033164 1098.664063
b 30 51.241484 1537.244522 3 10 17 45.140800 46.170996 55.300715 135.422399 461.709963 940.112160
c 33 39.563222 1305.586314 9 7 17 50.129136 34.593936 36.015679 451.162220 242.157549 612.266545
I am having an hash whose keys are week numbers and values are attendance scores. I am tying to calculate the average attendance for each month based on the week number i.e.keys.
Below is the example of the hash
weekly_attendance = {31 => 40.0, 32 => 100.00, 33 => 34.00, 34 => 23.78, 35 => 56.79, 36 => 44.50, 37 => 67.00, 38 => 55.00 }
Since a month consists of 4 weeks and the beginning week of the month is divisible by 4, the attendance needs to be sorted as follows
Month 1 attendance consists of weeks 31,32 i.e. (40.00+100.00)/2 =70.0
Month 2 attendance consists of weeks 33,34,35,36
i.e. (34.00+23.78+56.79+44.50)/4 = 39.5
Month 3 attendance consists of weeks 37, 38 i.e. (67.00+55.00)/2 = 69.5
The output should be
monthly_attendance = [70.0,39.5,61]
I had tried each and select approaches and used the modulo operator condition i.e. week % 4 == 0 to add the attendance values. But could not effectively group them based on months
tmp = 0
monthly_attendance = []
weekly_attendance.select do |k,v|
tmp += v
monthly_attendance << tmp if k % 4 == 0
end
I am unable to sort the week number in ranges using the above code.
You can try something like this:
results = weekly_attendance.group_by { |week, value| (week + 3) / 4 }.map do |month, groups|
values = groups.map(&:last)
average = values.inject(0) { |sum, val| sum + val } / values.length
[month, average]
end.to_h
p results # {8=>70.0, 9=>39.7675, 10=>61.0}
But the logic of converting weeks to months is flawed here, it's better to use some calendar function instead of just division by 4.
You can get the real month numbers using:
require 'date'
weekly_attendance.group_by { |week, value| Date.commercial(Time.now.year, week, 1).month }
But the result will not match the result you expect, because for example week 31 is in July, while week 32 is in August (this year), instead of being the same month like you expect.
I assume that if x units are produced in a given week, x/7 units are produced on each day of that week. The code below could be easily changed if this assumption were changed.
First construct a hash whose keys are months (1-12) and whose values are hashes whose keys are weeks and whose values are the numbers of days in the given week for the given month. (Whew!)
require 'date'
def months_to_weeks(year)
day = Date.new(year)
days = day.leap? ? 365 : 364
days.times.with_object(Hash.new { |h,k| h[k] = Hash.new(0) }) do |_,h|
h[day.month][day.cweek] += 1
day = day.next
end
end
The doc for Hash#new provides an explanation of the statement:
Hash.new { |h,k| h[k] = Hash.new(0) }
In brief, this creates an empty hash with a default given by the block. If h is the hash that is created, and h does not have a key k, h[k] will cause the block to be executed, which adds that key to the hash and sets its value to an empty hash with a default value of 0. The latter hash is often referred to as a "counting hash". I realize this is still rather a mouthful for a Ruby newbie.
Let's generate this hash for the current year:
year = 2015
mon_to_wks = months_to_weeks(year)
#=> {1 =>{1 =>4, 2 =>7, 3 =>7, 4 =>7, 5=>6},
# 2 =>{5 =>1, 6 =>7, 7 =>7, 8 =>7, 9=>6},
# 3 =>{9 =>1, 10=>7, 11=>7, 12=>7, 13=>7, 14=>2},
# 4 =>{14=>5, 15=>7, 16=>7, 17=>7, 18=>4},
# 5 =>{18=>3, 19=>7, 20=>7, 21=>7, 22=>7},
# 6 =>{23=>7, 24=>7, 25=>7, 26=>7, 27=>2},
# 7 =>{27=>5, 28=>7, 29=>7, 30=>7, 31=>5},
# 8 =>{31=>2, 32=>7, 33=>7, 34=>7, 35=>7, 36=>1},
# 9 =>{36=>6, 37=>7, 38=>7, 39=>7, 40=>3},
# 10=>{40=>4, 41=>7, 42=>7, 43=>7, 44=>6},
# 11=>{44=>1, 45=>7, 46=>7, 47=>7, 48=>7, 49=>1},
# 12=>{49=>6, 50=>7, 51=>7, 52=>7, 53=>3}}
Because of how Date#cweek is defined, the weeks in this hash begin on Mondays. In January, for example, there 4 days are in week 1. These four days, Jan. 1-4, 2015, would be the first Thursday, Friday, Saturday and Sunday of 2015. (Check your calendar.)
If the first day of each week is to be a day other than Monday (Sunday, for example) the hash calculation would have to be changed slightly.
This shows, for example, that in January of 2015, there are 4 days in week 1, 7 days in weeks 2, 3 and 4 and 6 days in week 5. The remaining day of week 5 is the first day in February.
Once this hash has been constructed, it is a simple matter to compute the averages for each month:
weekly_attendance = {31 => 40.00, 32 => 100.00, 33 => 34.00, 34 => 23.78,
35 => 56.79, 36 => 44.50, 37 => 67.00, 38 => 55.00 }
prod_by_mon = (1..12).each_with_object(Hash.new(0)) do |i,h|
mon_to_wks[i].each do |week, days|
h[i] += (days/7.0)*weekly_attendance[week] if weekly_attendance.key?(week)
end
end
#=> {7=>28.571428571428573, 8=>232.3557142857143, 9=>160.14285714285714}
prod_by_mon.merge(prod_by_mon) { |_,v| v.round(2) }
#=> {7=>28.57, 8=>232.36, 9=>160.14}
This shows that production in month 7 was 27.57, and so on. Note that:
28.57 + 232.36 + 160.14 #=> 421.07
weekly_attendance.values.reduce(:+) #=> 421.07
Can any one tell, how to get the result of LINQ query contains group by to DataTable .
var query= from d in dtable.AsEnumerable()
group d by d["Id"];
WId FirstName LastName Age
1 Jass we 23
1 Mady wer 54
3 Servy gr 22
4 Jan fr 11
Expected
WId FirstName LastName Age
1 Jass we 23
3 Servy gr 22
4 Jan fr 11
Thanks
Pradeep
If you just want to take the first person per ID-Group:
var distinctIdPersons = from p in dtable.AsEnumerable()
group p by p.Field<int>("WId") into IdGroups
select IdGroups.First();
or in method syntax:
distinctIdPersons = dtable.AsEnumerable().GroupBy(r => r.Field<int>("WId"))
.Select( g => g.First());
If you want to see the result(f.e. for testing purposes), you can use string.Join:
var output = string.Join(", ", distinctIdPersons.Select(r =>
r.Field<string>("FirstName") + " " + r.Field<string>("LastName")));
Console.WriteLine(output); // Jass we, Servy gr, Jan fr
I need to build a Linq query that will show the results as follow:
Data:
Sales Month
----------------------
10 January
20 February
30 March
40 April
50 May
60 June
70 July
80 August
90 September
100 October
110 November
120 December
I need to get the results based on this scenario:
month x = month x + previous month
that will result in:
Sales Month
--------------------
10 January
30 February (30 = February 20 + January 10)
60 March (60 = March 30 + February 30)
100 April (100 = April 40 + March 60)
.........
Any help how to build this query ?
Thanks a lot!
Since you wanted it in LINQ...
void Main()
{
List<SaleCount> sales = new List<SaleCount>() {
new SaleCount() { Sales = 10, Month = 1 },
new SaleCount() { Sales = 20, Month = 2 },
new SaleCount() { Sales = 30, Month = 3 },
new SaleCount() { Sales = 40, Month = 4 },
...
};
var query = sales.Select ((s, i) => new
{
CurrentMonth = s.Month,
CurrentAndPreviousSales = s.Sales + sales.Take(i).Sum(sa => sa.Sales)
});
}
public class SaleCount
{
public int Sales { get; set; }
public int Month { get; set; }
}
...but in my opinion, this is a case where coming up with some fancy LINQ isn't going to be as clear as just writing out the code that the LINQ query is going to generate. This also doesn't scale. For example, including multiple years worth of data gets even more hairy when it wouldn't have to if it was just written out the "old fashioned way".
If you don't want add up all of the previous sales for each month, you will have to keep track of the total sales somehow. The Aggregate function works okay for this because we can build a list and use its last element as the current total for calculating the next element.
var sales = Enumerable.Range(1,12).Select(x => x * 10).ToList();
var sums = sales.Aggregate(new List<int>(), (list, sale) => list.Concat(new List<int>{list.LastOrDefault() + sale});
Would anyone know why the following code works correctly on Windows and not on Mac??
Today (24/11/2010) should return 47 not 48 as per MacOS
def fm_date = '24/11/2010'
import java.text.SimpleDateFormat
def lPad = {it ->
st = '00' + it.toString()
return st.substring(st.length()-2, st.length())
}
dfm = new SimpleDateFormat("dd/MM/yyyy")
cal=Calendar.getInstance()
cal.setTime( dfm.parse(fm_date) )
now = cal.get(Calendar.WEEK_OF_YEAR)
cal.add(Calendar.DAY_OF_MONTH,-7)
prev = cal.get(Calendar.WEEK_OF_YEAR)
cal.add(Calendar.DAY_OF_MONTH,14)
next = cal.get(Calendar.WEEK_OF_YEAR)
prev = 'diary' + lPad(prev) + '.shtml'
next = 'diary' + lPad(next) + '.shtml'
return 'diary' + lPad(now) + '.shtml'
I believe it's an ISO week number issue...
If I use this code adapted (and groovyfied) from yours:
import java.text.SimpleDateFormat
def fm_date = '24/11/2010'
Calendar.getInstance().with { cal ->
// We want ISO Week numbers
cal.firstDayOfWeek = MONDAY
cal.minimalDaysInFirstWeek = 4
setTime new SimpleDateFormat( 'dd/MM/yyyy' ).parse( fm_date )
now = cal[ WEEK_OF_YEAR ]
}
"diary${"$now".padLeft( 2, '0' )}.shtml"
I get diary47.shtml returned
As the documentation for GregorianCalendar explains, if you want ISO Month numbers:
Values calculated for the WEEK_OF_YEAR
field range from 1 to 53. Week 1 for a
year is the earliest seven day period
starting on getFirstDayOfWeek() that
contains at least
getMinimalDaysInFirstWeek() days from
that year. It thus depends on the
values of getMinimalDaysInFirstWeek(),
getFirstDayOfWeek(), and the day of
the week of January 1. Weeks between
week 1 of one year and week 1 of the
following year are numbered
sequentially from 2 to 52 or 53 (as
needed).
For example, January 1, 1998 was a
Thursday. If getFirstDayOfWeek() is
MONDAY and getMinimalDaysInFirstWeek()
is 4 (these are the values reflecting
ISO 8601 and many national standards),
then week 1 of 1998 starts on December
29, 1997, and ends on January 4, 1998.
If, however, getFirstDayOfWeek() is
SUNDAY, then week 1 of 1998 starts on
January 4, 1998, and ends on January
10, 1998; the first three days of 1998
then are part of week 53 of 1997.
Edit
Even Groovier (from John's comment)
def fm_date = '24/11/2010'
Calendar.getInstance().with { cal ->
// We want ISO Week numbers
cal.firstDayOfWeek = MONDAY
cal.minimalDaysInFirstWeek = 4
cal.time = Date.parse( 'dd/MM/yyyy', fm_date )
now = cal[ WEEK_OF_YEAR ]
}
"diary${"$now".padLeft( 2, '0' )}.shtml"
Edit2
Just ran this on Windows using VirtualBox, and got the same result