How to pivot / un pivot a stream - apache-kafka-streams

There is a way to PIVOT / UNPIVOT (explode, traspose) an Stream with Kafka streams?
If I have the input stream with
machineId ts VarName VarValue
m1 2017-10-01 00:00:00 var1 1.0
m1 2017-10-01 00:00:00 var2 2.0
m2 2017-10-01 00:00:00 var1 3.0
m2 2017-10-01 00:00:00 var3 4.0
m3 2017-10-01 00:00:00 var4 5.0
...
I want a way to get the output stream
machineId ts Vars
m1 2017-10-01 00:00:00 [[var1, 1.0], [var2, 2.0]]
m2 2017-10-01 00:00:00 [[var1, 3.0], [var3, 4.0]]
m3 2017-10-01 00:00:00 [[var4, 5.0]]
...

You can use an aggregation with output type List. Something like
KStream<MachineId, V> inputStream = ...
KTable<MachineId, List<V>> result = inputStream.groupByKey()
.aggregate(...);
The Initializer returns an empty List<V> and the Aggregator would append values to the list.
Check out the docs and examples for more details:
https://docs.confluent.io/current/streams/developer-guide.html#aggregating
https://github.com/confluentinc/kafka-streams-examples

Related

Can't get value from map[time.Time]Measure in Golang with if val, ok := mapMeasures[ts]; ok {}

I have a map that is defined like:
mapMeasures := make(map[time.Time]models.Measure, 0)
with
type Measure struct {
Delta float64 // I just let one field to simplificate
}
So the initial loop will fill values from 22/01/20-10:10:00 to 22/01/20-12:00:00, so it will store 12 keys value (10 minutes timestep)
Then, it will loop again those timestamp, and add delta to existing value.
So, I need to check if there is already a key corresponding to my actual timestamp:
if val, ok := mapMeasures[ts]; ok { // ts already exists, we must sum delta values
measure.Delta += val.Delta
}
But it appears this condition is never true.
I debugged the code, and I can see the timestamp is actually present inside map:
mapMeasures = {map[time.Time]gitlab.com/company/common/models.Measure}
0 = ->
key = {time.Time} 2020-01-22 11:40:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132460}
1 = ->
key = {time.Time} 2020-01-22 12:30:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132780}
2 = ->
key = {time.Time} 2020-01-22 12:50:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc0001328c0}
3 = ->
key = {time.Time} 2020-01-22 11:00:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132140}
4 = ->
key = {time.Time} 2020-01-22 11:10:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132280}
5 = ->
key = {time.Time} 2020-01-22 11:20:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132320}
6 = ->
key = {time.Time} 2020-01-22 11:30:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc0001323c0}
7 = ->
key = {time.Time} 2020-01-22 11:50:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132500}
8 = ->
key = {time.Time} 2020-01-22 12:00:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc0001325a0}
9 = ->
key = {time.Time} 2020-01-22 12:10:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132640}
10 = ->
key = {time.Time} 2020-01-22 12:20:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc0001326e0}
11 = ->
key = {time.Time} 2020-01-22 12:40:00 +0100
value = {*gitlab.com/company/common/models.Measure | 0xc000132820}
Actual ts:
{time.Time} 2020-01-22 11:00:00 +0100
Is there any issue with a key to be a timestamp ? Should I convert it to a string, or int ???
Quoting from time.Time:
Note that the Go == operator compares not just the time instant but also the Location and the monotonic clock reading. Therefore, Time values should not be used as map or database keys without first guaranteeing that the identical Location has been set for all values, which can be achieved through use of the UTC or Local method, and that the monotonic clock reading has been stripped by setting t = t.Round(0). In general, prefer t.Equal(u) to t == u, since t.Equal uses the most accurate comparison available and correctly handles the case when only one of its arguments has a monotonic clock reading.
Do not use time.Time as map keys, instead use the Unix timestamp returned by Time.Unix(). The Unix timestamp is location and monotonic clock reading "free".
If your keys must also include the location (time zone), then use a struct which includes the Unix timestamp and the zone offset, e.g.:
type Key {
ts int64
offset int
}
See related question: Why do 2 time structs with the same date and time return false when compared with ==?

What is this time format? (10 digits, 5 decimals)

So a website that I'm using has a websocket and they provide the broadcast time in the following manner:
"broadcasted_at":1574325570.71308
What is this time format and how do they generate it?
Unix epoch time ... the number of seconds that have elapsed since the Unix epoch, that is the time 00:00:00 UTC on 1 January 1970
now : 1574327074 : Thu Nov 21 03:04:34 2019
start of day : 1574316000 : Thu Nov 21 00:00:00 2019
1574325570 : 1574325570 : Thu Nov 21 02:39:30 2019
convert online : https://www.epochconverter.com/
... or download code (to build) to have command line program to perform the conversion https://github.com/darrenjs/c_dev_utils
I'm guessing the fractional part is the number of microseconds within the current second.
… and how do they generate it?
I don’t know, of course, what language or libraries your website is using. So this is just an example. To generate a value like 1574325570.71308 in Java:
Instant now = Instant.now();
double epochSeconds = now.getEpochSecond()
+ (double) now.getNano() / (double) TimeUnit.SECONDS.toNanos(1);
String result = String.format(Locale.ROOT, "%f", epochSeconds);
System.out.println("result: " + result);
When I ran this snippet just now (2019-12-15T11:18:01.562699Z), the output was:
result: 1576408681.562699
If you want exactly 5 decimals always another way is to use a DateTimeFormatter:
DateTimeFormatter formatter = new DateTimeFormatterBuilder()
.appendValue(ChronoField.INSTANT_SECONDS)
.appendPattern(".SSSSS")
.toFormatter();
String result = formatter.format(now);
result: 1576408681.56269

Using Featuretools to aggregate per time time of day

I'm wondering if there's any way to calculate all the same variables I already am using deep feature synthesis (ie counts, sums, mean, etc) for different time segments within a day?
I.e. count of morning events (hours 0-12) as a separate variable from evening events (13-24).
Also, within the same vein, what would be the easiest to get counts by day of week, day of month, day of year, etc. Custom aggregate primitives?
Yes, this is possible. First, let's generate some random data and then I'll walkthrough how
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
events_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
events_df
the first thing we want to do is add a new column for the segment we want to calculate features for
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
now we have a dataframe like this
id customer_id timestamp amount time_of_day
0 0 a 2019-01-01 00:00:00 44.713802 morning
1 1 c 2019-01-01 01:00:00 58.776476 morning
2 2 a 2019-01-01 02:00:00 94.671566 morning
3 3 a 2019-01-01 03:00:00 39.271852 morning
4 4 a 2019-01-01 04:00:00 40.773290 morning
5 5 c 2019-01-01 05:00:00 19.815855 morning
6 6 a 2019-01-01 06:00:00 62.457129 morning
7 7 b 2019-01-01 07:00:00 95.114636 morning
8 8 b 2019-01-01 08:00:00 37.824668 morning
9 9 a 2019-01-01 09:00:00 46.502904 morning
Next, let's load it into our entityset
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="events",
time_index="timestamp",
dataframe=events_df)
es.normalize_entity(new_entity_id="customers", index="customer_id", base_entity_id="events")
es.plot()
Now, we are ready to set the segments we want to create aggregations for by using interesting_values
es["events"]["time_of_day"].interesting_values = ["morning", "afternoon", "evening"]
Then we can run DFS and place the aggregation primitives we want to do on a per segment basis in the where_primitives parameter
fm, fl = ft.dfs(target_entity="customers",
entityset=es,
agg_primitives=["count", "mean", "sum"],
trans_primitives=[],
where_primitives=["count", "mean", "sum"])
fm
In the resulting feature matrix, you can now see we have aggregations per morning, afternoon, and evening
COUNT(events) MEAN(events.amount) SUM(events.amount) COUNT(events WHERE time_of_day = afternoon) COUNT(events WHERE time_of_day = evening) COUNT(events WHERE time_of_day = morning) MEAN(events.amount WHERE time_of_day = afternoon) MEAN(events.amount WHERE time_of_day = evening) MEAN(events.amount WHERE time_of_day = morning) SUM(events.amount WHERE time_of_day = afternoon) SUM(events.amount WHERE time_of_day = evening) SUM(events.amount WHERE time_of_day = morning)
customer_id
a 37 49.753630 1840.884300 12 7 18 35.098923 45.861881 61.036892 421.187073 321.033164 1098.664063
b 30 51.241484 1537.244522 3 10 17 45.140800 46.170996 55.300715 135.422399 461.709963 940.112160
c 33 39.563222 1305.586314 9 7 17 50.129136 34.593936 36.015679 451.162220 242.157549 612.266545

How can I retain NA's in timestamp when using dapply

I'm trying to convert many date characters columns to timestamp format using dapply. However rows that are empty characters are being converted to the origin date "1970-01-01".
df <- data.frame(a = c("12/31/2016", "12/31/2016", "12/31/2016"),
b = c("01/01/2016", "01/01/2017", ""))
ddf <- as.DataFrame(df)
schema <- structType(
structField("a", 'timestamp'),
structField("b", 'timestamp'))
converted_dates <- dapply(ddf,
function(x){ as.data.frame(lapply(x, function(y) as.POSIXct(y, format = "%m/%d/%Y"))) },
schema)
head(converted_dates)
a b
1 2016-12-31 2016-01-01
2 2016-12-31 2017-01-01
3 2016-12-31 1970-01-01
whereas running the function within the dapply call on an R data.frame retains the NA result for the empty date value
as.data.frame(lapply(df, function(y) as.POSIXct(y, format = "%m/%d/%Y")))
a b
1 2016-12-31 2016-01-01
2 2016-12-31 2017-01-01
3 2016-12-31 <NA>
Using Spark 2.0.1

Month's name in FreePascal

Input: Month name (January / February / ... / December)
Output: Season (Winter / Spring / Summer / Autumn)
Winter: Dec - Feb
Spring: Mar - May
Summer: Jun - Aug
Autumn: Sept - Nov
I have tried:
Program Months;
var
Month:String;
begin
writeln('Insert month name:');
readln(Month);
if Month = 'March' or Month = 'April' or Month = 'May' then
begin
writeln(Month,' is Spring month');
end
...
etc
...
end.
But this program is not working.
Operator precedence - it's important. You need to write:
if (Month = 'March') or (Month = 'April') or (Month = 'May') then
This is because in Pascal, or has a higher priority than = so what is actually being evaluated is:
if ((Month = ('March' or Month)) = ('April' or Month)) = 'May' then
Which is obviously meaningless and will not compile (I might've made a mistake on the line above but it's the general idea). Please refer to this link to learn more about operator precedence in Pascal.

Resources