How can I retain NA's in timestamp when using dapply - sparkr

I'm trying to convert many date characters columns to timestamp format using dapply. However rows that are empty characters are being converted to the origin date "1970-01-01".
df <- data.frame(a = c("12/31/2016", "12/31/2016", "12/31/2016"),
b = c("01/01/2016", "01/01/2017", ""))
ddf <- as.DataFrame(df)
schema <- structType(
structField("a", 'timestamp'),
structField("b", 'timestamp'))
converted_dates <- dapply(ddf,
function(x){ as.data.frame(lapply(x, function(y) as.POSIXct(y, format = "%m/%d/%Y"))) },
schema)
head(converted_dates)
a b
1 2016-12-31 2016-01-01
2 2016-12-31 2017-01-01
3 2016-12-31 1970-01-01
whereas running the function within the dapply call on an R data.frame retains the NA result for the empty date value
as.data.frame(lapply(df, function(y) as.POSIXct(y, format = "%m/%d/%Y")))
a b
1 2016-12-31 2016-01-01
2 2016-12-31 2017-01-01
3 2016-12-31 <NA>
Using Spark 2.0.1

Related

HIVE - How do I convert Timestamp and sum valeus

I have a timestamp value like
2021-09-01T00:16:18.971228-03:00
And I need to separate the hours and date. After that summarize the day values ​​and night values
My problem is how can I do this with this kind of format?
Maybe this helps you, I created a data.frame just to show all variables together.
x <- "2021-09-01T00:16:18.971228-03:00"
library(lubridate)
library(dplyr)
tibble(x) %>%
mutate(
dttm = ymd_hms(x),
date = as_date(dttm),
hour = hour(dttm)
)
# A tibble: 1 x 4
x dttm date hour
<chr> <dttm> <date> <int>
1 2021-09-01T00:16:18.971228-03:00 2021-09-01 03:16:18 2021-09-01 3

Using Featuretools to aggregate per time time of day

I'm wondering if there's any way to calculate all the same variables I already am using deep feature synthesis (ie counts, sums, mean, etc) for different time segments within a day?
I.e. count of morning events (hours 0-12) as a separate variable from evening events (13-24).
Also, within the same vein, what would be the easiest to get counts by day of week, day of month, day of year, etc. Custom aggregate primitives?
Yes, this is possible. First, let's generate some random data and then I'll walkthrough how
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
events_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
events_df
the first thing we want to do is add a new column for the segment we want to calculate features for
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
now we have a dataframe like this
id customer_id timestamp amount time_of_day
0 0 a 2019-01-01 00:00:00 44.713802 morning
1 1 c 2019-01-01 01:00:00 58.776476 morning
2 2 a 2019-01-01 02:00:00 94.671566 morning
3 3 a 2019-01-01 03:00:00 39.271852 morning
4 4 a 2019-01-01 04:00:00 40.773290 morning
5 5 c 2019-01-01 05:00:00 19.815855 morning
6 6 a 2019-01-01 06:00:00 62.457129 morning
7 7 b 2019-01-01 07:00:00 95.114636 morning
8 8 b 2019-01-01 08:00:00 37.824668 morning
9 9 a 2019-01-01 09:00:00 46.502904 morning
Next, let's load it into our entityset
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="events",
time_index="timestamp",
dataframe=events_df)
es.normalize_entity(new_entity_id="customers", index="customer_id", base_entity_id="events")
es.plot()
Now, we are ready to set the segments we want to create aggregations for by using interesting_values
es["events"]["time_of_day"].interesting_values = ["morning", "afternoon", "evening"]
Then we can run DFS and place the aggregation primitives we want to do on a per segment basis in the where_primitives parameter
fm, fl = ft.dfs(target_entity="customers",
entityset=es,
agg_primitives=["count", "mean", "sum"],
trans_primitives=[],
where_primitives=["count", "mean", "sum"])
fm
In the resulting feature matrix, you can now see we have aggregations per morning, afternoon, and evening
COUNT(events) MEAN(events.amount) SUM(events.amount) COUNT(events WHERE time_of_day = afternoon) COUNT(events WHERE time_of_day = evening) COUNT(events WHERE time_of_day = morning) MEAN(events.amount WHERE time_of_day = afternoon) MEAN(events.amount WHERE time_of_day = evening) MEAN(events.amount WHERE time_of_day = morning) SUM(events.amount WHERE time_of_day = afternoon) SUM(events.amount WHERE time_of_day = evening) SUM(events.amount WHERE time_of_day = morning)
customer_id
a 37 49.753630 1840.884300 12 7 18 35.098923 45.861881 61.036892 421.187073 321.033164 1098.664063
b 30 51.241484 1537.244522 3 10 17 45.140800 46.170996 55.300715 135.422399 461.709963 940.112160
c 33 39.563222 1305.586314 9 7 17 50.129136 34.593936 36.015679 451.162220 242.157549 612.266545

Sorting data with gnuplot

Sometimes it might be required to sort data. Unfortunately, gnuplot (as far as I know) doesn't offer this possibility. Of course, you can use external tools like awk, Perl, Python, etc. However, for maximum platform independence and avoiding the installation of additional programs and related complications, and also for curiosity, I was interested whether gnuplot can sort somehow nevertheless.
I will be grateful for comments on improvements, limitations.
Does anybody have ideas how to sort alphanumerical data with gnuplot only?
### Sorting with gnuplot
reset session
# generate some random example data
N = 10
set samples N
RandomNo(n) = sprintf("%.02f",rand(0)*n)
set table $Data
plot '+' u (RandomNo(10)):(RandomNo(10)):(RandomNo(10)) w table
unset table
print $Data
# Settings for sorting
ColNo = 2 # ColumnNo for sorting
stats $Data nooutput # get the number of rows if data is from file
RowCount = STATS_records # with the example data above, of course RowCount=N
# create the sortkey and put it into an array
array SortKey[RowCount]
set table $Dummy
plot $Data u (SortKey[$0+1] = sprintf("%.06f%02d",column(ColNo),$0+1)) w table
unset table
# print $Dummy
# get lines as whole into array
set datafile separator "\n"
array DataSeq[RowCount]
set table $Dummy2
plot $Data u (SortKey[$0+1]):(DataSeq[$0+1] = stringcolumn(1)) with table
unset table
print $Dummy2
set datafile separator whitespace
# do the actual sorting with 'smooth unique'
set table $Dummy3
plot $Dummy2 u 1:0 smooth unique
unset table
# print $Dummy3
# extract the sorted sortkeys
set table $Dummy4
plot $Dummy3 u (SortKey[$0+1]=$2) with table
unset table
# print $Dummy4
# create the table with sorted lines
set table $DataSorted
plot $Data u (DataSeq[SortKey[$0+1]+1]) with table
unset table
print $DataSorted
### end of code
First datablock unsorted data
second datablock intermediate with sortkeys
third datablock sorted data by the second column
Output:
5.24 6.68 3.09
1.64 1.27 9.82
6.44 9.23 7.03
8.14 8.87 3.82
4.27 5.98 0.93
7.96 3.64 6.15
6.21 6.28 6.17
1.52 3.17 3.58
4.24 2.16 8.99
8.73 6.54 1.13
6.68000001 5.24 6.68 3.09
1.27000002 1.64 1.27 9.82
9.23000003 6.44 9.23 7.03
8.87000004 8.14 8.87 3.82
5.98000005 4.27 5.98 0.93
3.64000006 7.96 3.64 6.15
6.28000007 6.21 6.28 6.17
3.17000008 1.52 3.17 3.58
2.16000009 4.24 2.16 8.99
6.54000010 8.73 6.54 1.13
1.64 1.27 9.82
4.24 2.16 8.99
1.52 3.17 3.58
7.96 3.64 6.15
4.27 5.98 0.93
6.21 6.28 6.17
8.73 6.54 1.13
5.24 6.68 3.09
8.14 8.87 3.82
6.44 9.23 7.03
For curiosity, I wanted to know whether an alphanumerical sort could be implemented with gnuplot code only.
This avoids the need for external tools and ensures maximum platform compatibility.
I haven't heard yet about an external tool which could assist gnuplot and which works under Windows and Linux and MacOS.
I am happy to take comments and suggestions about bugs, simplifications, improvements, performance comparisons, and limits.
For alphanumerical sort, the first stage is alphanumerical string comparison, which to my knowledge does not exist in gnuplot directly. So, the first part Compare.plt is about comparison of strings.
### compare function for strings
# Compare.plt
# function cmp(a,b,cs) returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
reset session
ASCII = ' !"' . "#$%&'()*+,-./0123456789:;<=>?#".\
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\`".\
"abcdefghijklmnopqrstuvwxyz{|}~"
ord(c) = strstrt(ASCII,c)>0 ? strstrt(ASCII,c)+31 : 0
# comparing char: case-sensitive
cmpcharcs(c1,c2) = sgn(ord(c1)-ord(c2))
# comparing char: case-insentitive
cmpcharci(c1,c2) = sgn(( cmpcharci_o1=ord(c1), ((cmpcharci_o1>96) && (cmpcharci_o1<123)) ?\
cmpcharci_o1-32 : cmpcharci_o1) - \
( cmpcharci_o2=ord(c2), ((cmpcharci_o2>96) && (cmpcharci_o2<123)) ?\
cmpcharci_o2-32 : cmpcharci_o2) )
# function cmp returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
cmp(a,b,cs) = ((cmp_r=0, cmp_flag=0, cmp_maxlen=strlen(a)>strlen(b) ? strlen(a) : strlen(b)),\
(sum[cmp_i=1:cmp_maxlen] \
((cmp_flag==0 && (cmp_c1 = substr(a,cmp_i,cmp_i), cmp_c2 = substr(b,cmp_i,cmp_i), \
(cmp_r = (cs==0 ? cmpcharci(cmp_c1,cmp_c2) : cmpcharcs(cmp_c1,cmp_c2) ) )!=0 ? \
(cmp_flag=1, cmp_r) : 0)), 1 )), cmp_r)
cmpsymb(a,b,cs) = (cmpsymb_r = cmp(a,b,cs))<0 ? "<" : cmpsymb_r>0 ? ">" : "="
### end of code
Example:
### example compare strings
load "Compare.plt"
a="Alligator"
b="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Zebra"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
### end of code
Result:
-1: Alligator < Tiger
0: Tiger = Tiger
1: Zebra > Tiger
The second part makes use of the comparison for sorting.
### alpha-numerical sort with gnuplot
reset session
load "Compare.plt"
$Data <<EOD
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
EOD
stats $Data u 0 nooutput
RowCount = STATS_records
ColSort = 3
array Key[RowCount]
array Index[RowCount]
set table $Dummy
plot $Data u (Key[$0+1]=stringcolumn(ColSort),Index[$0+1]=$0+1) w table
unset table
# Bubblesort
do for [n=RowCount:2:-1] {
do for [i=1:n-1] {
if ( cmp(Key[i],Key[i+1],0) > 0) {
tmp=Key[i]; Key[i]=Key[i+1]; Key[i+1]=tmp
tmp2=Index[i]; Index[i]=Index[i+1]; Index[i+1]=tmp2
}
}
}
set datafile separator "\n"
set table $Dummy # and reuse Key-array
plot $Data u (Key[$0+1]=stringcolumn(1)) with table
unset table
set datafile separator whitespace
set table $DataSorted
plot $Data u (Key[Index[$0+1]]) with table
unset table
print $DataSorted
set grid xtics,ytics
plot [-0.5:RowCount-0.5][0:1.1] $DataSorted u 0:2:xtic(3) w lp lt 7 lc rgb "red"
### end of code
Input:
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
Output:
2 0.456 Apple
5 0.654 Banana
7 0.111 Lemon
1 0.123 Orange
3 0.789 Peach
4 0.987 Pineapple
6 0.321 Raspberry
and the output graph:

How can I extract parts of one column and append them to other columns?

I have a large .csv file that I need to extract information from and add this information to another column. My csv looks something like this:
file_name,#,Date,Time,Temp (°C) ,Intensity
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,
I want to create a two new columns that contains the data from the "file_name" column. I want to extract the one to two numbers after the text "trap" and I want to extract the c or the u and create new columns with this data. Data should look like something like this after processing:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
I suspect the way to do this is with awk and a regular expression, but I'm not sure how to implement the regular expression. How can I extract parts of one column and append them to other columns?
Using sed you can do this:
sed -E '1s/.*/&,can_und,trap_no/; 2,$s/trap([0-9]+)([a-z]).*/&\2,\1/' file.csv
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
gawk approach:
awk -F, 'NR==1{ print $0,"can_und,trap_no" }
NR>1{ match($1,/^trap([0-9]+)([a-z])/,a); print $0 a[2],a[1] }' OFS="," file
The output:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
NR==1{ print $0,"can_und,trap_no" } - print the header line
match($1,/^trap([0-9]+)([a-z])/,a) - matches the number following trap word and the next following suffix letter
With use of sed, this will be like:
sed 's/trap\([[:digit:]]\+\)\(.\)\(.*\)$/trap\1\2\3\2,\1/' file
Use sed -i ... to replace it in file.
Using python pandas reader because python is awesome for numerical analysis:
First: I had to modify the data header row so that the columns were consistent by appending 3 commas:
file_name,#,Date,Time,Temp (°C) ,Intensity,,,
There is probably a way to tell pandas to ignore the column differences - but I am yet a noob.
Python code to read your data into columns and create 2 new columns named 'cu_int' and 'cu_char' which contain the parsed elements of the filenames:
import pandas
def main():
df = pandas.read_csv("file.csv")
df['cu_int'] = 0 # Add the new columns to the data frame.
df['cu_char'] = ' '
for index, df_row in df.iterrows():
file_name = df['file_name'][index].strip()
trap_string = file_name.split("_")[0] # Get the file_name string prior to the underscore
numeric_offset_beg = len("trap") # Parse the number following the 'trap' string.
numeric_offset_end = len(trap_string) - 1 # Leave off the 'c' or 'u' char.
numeric_value = trap_string[numeric_offset_beg : numeric_offset_end]
cu_value = trap_string[len(trap_string) - 1]
df['cu_int'] = int(numeric_value)
df['cu_char'] = cu_value
# The pandas dataframe is ready for number crunching.
# For now just print it out:
print df
if __name__ == "__main__":
main()
The printed output (note there are inconsistencies in the data set posted - see row 1 as an example):
$ python read_csv.py
file_name # Date Time Temp (°C) Intensity Unnamed: 6 Unnamed: 7 Unnamed: 8 cu_int cu_char
0 trap12u_10733862_150809.txt 1 05/28/15 06:00:00.0 20.424 215.3 NaN NaN NaN 12 c
1 trap12u_10733862_150809.txt 2 05/28/15 07:00:00.0 21.091 1.0 130.2 NaN NaN 12 c
2 trap12u_10733862_150809.txt 3 05/28/15 08:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
3 trap11u_10733862_150809.txt 4 05/28/15 09:00:00.0 25.222 3.0 444.5 NaN NaN 12 c
4 trap11u_10733862_150809.txt 5 05/28/15 10:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
5 trap11u_10733862_150809.txt 6 05/28/15 11:00:00.0 25.902 2.0 927.8 NaN NaN 12 c
6 trap11u_10733862_150809.txt 7 05/28/15 12:00:00.0 25.708 2.0 325.0 NaN NaN 12 c
7 trap12c_10733862_150809.txt 8 05/28/15 13:00:00.0 26.292 3.0 100.0 NaN NaN 12 c
8 trap12c_10733862_150809.txt 9 05/28/15 14:00:00.0 26.390 2.0 66.7 NaN NaN 12 c
9 trap12c_10733862_150809.txt 10 05/28/15 15:00:00.0 26.097 1.0 463.9 NaN NaN 12 c

SparkR - Retaining the previous value in another column

I have a spark dataFrame that looks like this:
id dates value
1 11 2013-11-15 10
2 11 2013-11-16 15
3 22 2013-11-15 20
4 22 2013-11-16 21
5 22 2013-11-17 3
I wish to retain the value from the previous date per id.
The final result should look like this:
id dates value prev_value
1 11 2013-11-15 10 NA
2 11 2013-11-16 15 10
3 22 2013-11-15 20 NA
4 22 2013-11-16 21 20
5 22 2013-11-17 3 21
The solution from this question would not work for various reasons.
I would appreciate the help!
So after playing with it for a while, here's the workaround that I found:
First of all, here's the example DF
id<-c(11,11,22,22,22)
dates<-as.Date(c('2013-11-15','2013-11-16','2013-11-15','2013-11-16','2013-11-17'), "%Y-%m-%d")
value <- c(10,15,20,21,3)
example<-as.DataFrame(data.frame(id=id,dates=dates, value))
I copy the example DF and add 1 day to the original date, then rename the column
example_p <- example
example_p$dates <- date_add(example_p$dates, 1)
colnames(example_p) <- c("id", "dates", "prev_value")
Finally, I merge the new DF to the original one
result <- select(merge(example, example_p, by = intersect(names(example),names(example_p))
, all.x = T), c("id_x", "dates_x", "value", "prev_value"))
showDF(result)
+----+----------+-----+----------+
|id_x| dates_x|value|prev_value|
+----+----------+-----+----------+
|22.0|2013-11-15| 20.0| null|
|11.0|2013-11-15| 10.0| null|
|11.0|2013-11-16| 15.0| 10.0|
|22.0|2013-11-16| 21.0| 20.0|
|22.0|2013-11-17| 3.0| 21.0|
+----+----------+-----+----------+
Obviously, this is somehow clumsy and I will be happy to give the points to anyone who can suggest a solution that would work faster than this.

Resources