SparkR - Retaining the previous value in another column - sparkr

I have a spark dataFrame that looks like this:
id dates value
1 11 2013-11-15 10
2 11 2013-11-16 15
3 22 2013-11-15 20
4 22 2013-11-16 21
5 22 2013-11-17 3
I wish to retain the value from the previous date per id.
The final result should look like this:
id dates value prev_value
1 11 2013-11-15 10 NA
2 11 2013-11-16 15 10
3 22 2013-11-15 20 NA
4 22 2013-11-16 21 20
5 22 2013-11-17 3 21
The solution from this question would not work for various reasons.
I would appreciate the help!

So after playing with it for a while, here's the workaround that I found:
First of all, here's the example DF
id<-c(11,11,22,22,22)
dates<-as.Date(c('2013-11-15','2013-11-16','2013-11-15','2013-11-16','2013-11-17'), "%Y-%m-%d")
value <- c(10,15,20,21,3)
example<-as.DataFrame(data.frame(id=id,dates=dates, value))
I copy the example DF and add 1 day to the original date, then rename the column
example_p <- example
example_p$dates <- date_add(example_p$dates, 1)
colnames(example_p) <- c("id", "dates", "prev_value")
Finally, I merge the new DF to the original one
result <- select(merge(example, example_p, by = intersect(names(example),names(example_p))
, all.x = T), c("id_x", "dates_x", "value", "prev_value"))
showDF(result)
+----+----------+-----+----------+
|id_x| dates_x|value|prev_value|
+----+----------+-----+----------+
|22.0|2013-11-15| 20.0| null|
|11.0|2013-11-15| 10.0| null|
|11.0|2013-11-16| 15.0| 10.0|
|22.0|2013-11-16| 21.0| 20.0|
|22.0|2013-11-17| 3.0| 21.0|
+----+----------+-----+----------+
Obviously, this is somehow clumsy and I will be happy to give the points to anyone who can suggest a solution that would work faster than this.

Related

why my dxf file not working in AutoCAD giving me ID 11 incorrect: already used

I have generated a dxf file but when I opened it with AutoCAD, crashes AutoCAD and gives a message ID 11 incorrect: already used.
the dxf content: https://github.com/tarikjabiri/dxf/blob/dev/examples/latest.dxf
I can't spot the problem 3 days I am trying to solve it.
I think something wrong with the APPID because it holding the ID 11 or the Handle in the language of DXF.
I have a dxf working: https://github.com/tarikjabiri/dxf/blob/dev/examples/Minimal_DXF_AC1021.dxf
Thanks in advance.
There are two minor issues:
DIMSTYLE table
0
TABLE
2
DIMSTYLE
105 <<< handle group code of the table "head" is 5 as usual
8
100
AcDbSymbolTable
100
AcDbDimStyleTable
70
1
0
DIMSTYLE
5 <<< handle group code of the table entry is 105
12
330
8
100
AcDbSymbolTableRecord
100
AcDbDimStyleTableRecord
2
STANDARD
70
0
40
1
BLOCK_RECORD table entries for *MODEL_SPACE and *PAPER_SPACE
0
TABLE
2
BLOCK_RECORD
5
9
330
0
100
AcDbSymbolTable
70
2
0
BLOCK_RECORD
5
14
330
9
100
AcDbSymbolTableRecord
100
AcDbRegAppTableRecord <<< subclass marker string "AcDbBlockTableRecord"
2
*MODEL_SPACE
70
0
70
0
280
After this changes the file opens in Autodesk DWG Trueview 2022.

How to avoid row names in further analysis in R?

I´m just running the following example from GGEBiplotGUI package and of course, it works properly.
library(GGEBiplotGUI)
data("Ontario")
Ontario
GGEBiplot(Data = Ontario)
But when I download "Ontario" data and I want to run the above cited script on my PC. See the example below.
Ontario <- read.csv("Book.csv")
library(GGEBiplotGUI)
GGEBiplot(Data = Ontario)
The result is the following table (from column 0 to 10) taking numbers (From 1 to 17) as genotypes and "X" as another location.
See the result below please.
X BH93 EA93 HW93 ID93 KE93 NN93 OA93 RN93 WP93
1 ann 4.460 4.150 2.849 3.084 5.940 4.450 4.351 4.039 2.672
2 ari 4.417 4.771 2.912 3.506 5.699 5.152 4.956 4.386 2.938
3 aug 4.669 4.578 3.098 3.460 6.070 5.025 4.730 3.900 2.621
4 cas 4.732 4.745 3.375 3.904 6.224 5.340 4.226 4.893 3.451
5 del 4.390 4.603 3.511 3.848 5.773 5.421 5.147 4.098 2.832
6 dia 5.178 4.475 2.990 3.774 6.583 5.045 3.985 4.271 2.776
7 ena 3.375 4.175 2.741 3.157 5.342 4.267 4.162 4.063 2.032
8 fun 4.852 4.664 4.425 3.952 5.536 5.832 4.168 5.060 3.574
9 ham 5.038 4.741 3.508 3.437 5.960 4.859 4.977 4.514 2.859
10 har 5.195 4.662 3.596 3.759 5.937 5.345 3.895 4.450 3.300
11 kar 4.293 4.530 2.760 3.422 6.142 5.250 4.856 4.137 3.149
12 kat 3.151 3.040 2.388 2.350 4.229 4.257 3.384 4.071 2.103
13 luc 4.104 3.878 2.302 3.718 4.555 5.149 2.596 4.956 2.886
14 m12 3.340 3.854 2.419 2.783 4.629 5.090 3.281 3.918 2.561
15 reb 4.375 4.701 3.655 3.592 6.189 5.141 3.933 4.208 2.925
16 ron 4.940 4.698 2.950 3.898 6.063 5.326 4.302 4.299 3.031
17 rub 3.786 4.969 3.379 3.353 4.774 5.304 4.322 4.858 3.382
How can I fix this problem? I mean, in order to avoid "rownames" and "x" as a variables in the GGEBiplotGUI analysis.
I have also tried with these codes and they didn´t work:
attributes(Ontario)$row.names <- NULL
print(Ontario, row.names = F)
row.names(Ontario) <- NULL
Ontario[, -1] ## It deletes the first column not the 0 one.
Many thanks in advance!
This code worked properly.
Ontario <- read.csv("Libro.csv")
rownames(Ontario)<-Ontario$X
Ontario1<-Ontario[,-1]
library(GGEBiplotGUI)
GGEBiplot(Data = Ontario)

removing bad data from a data file using pig

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')
Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

Elasticsearch count overlapping timeranges in date histogram

I have events stored in Elasticsearch 6.6 that have a start and end time e.g.:
{
"startTime": "2019-01-11T14:49:16.719Z"
"endTime": "2019-01-11T16:31:56.483Z"
}
I want to display a date histogram which shows the number of overlapping events in each hour.
Example:
Hour of Day:
12 13 14 15 16 17 18 19
Events:
<====E1====> <===E2==>
<===E3====>
<==E4==>
Result:
0 1 1 3 2 2 1 0
Is there a way to do this with an elasticsearch aggregation or do I have to implement it in the application?

Using Featuretools to aggregate per time time of day

I'm wondering if there's any way to calculate all the same variables I already am using deep feature synthesis (ie counts, sums, mean, etc) for different time segments within a day?
I.e. count of morning events (hours 0-12) as a separate variable from evening events (13-24).
Also, within the same vein, what would be the easiest to get counts by day of week, day of month, day of year, etc. Custom aggregate primitives?
Yes, this is possible. First, let's generate some random data and then I'll walkthrough how
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
events_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
events_df
the first thing we want to do is add a new column for the segment we want to calculate features for
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)
now we have a dataframe like this
id customer_id timestamp amount time_of_day
0 0 a 2019-01-01 00:00:00 44.713802 morning
1 1 c 2019-01-01 01:00:00 58.776476 morning
2 2 a 2019-01-01 02:00:00 94.671566 morning
3 3 a 2019-01-01 03:00:00 39.271852 morning
4 4 a 2019-01-01 04:00:00 40.773290 morning
5 5 c 2019-01-01 05:00:00 19.815855 morning
6 6 a 2019-01-01 06:00:00 62.457129 morning
7 7 b 2019-01-01 07:00:00 95.114636 morning
8 8 b 2019-01-01 08:00:00 37.824668 morning
9 9 a 2019-01-01 09:00:00 46.502904 morning
Next, let's load it into our entityset
es = ft.EntitySet()
es.entity_from_dataframe(entity_id="events",
time_index="timestamp",
dataframe=events_df)
es.normalize_entity(new_entity_id="customers", index="customer_id", base_entity_id="events")
es.plot()
Now, we are ready to set the segments we want to create aggregations for by using interesting_values
es["events"]["time_of_day"].interesting_values = ["morning", "afternoon", "evening"]
Then we can run DFS and place the aggregation primitives we want to do on a per segment basis in the where_primitives parameter
fm, fl = ft.dfs(target_entity="customers",
entityset=es,
agg_primitives=["count", "mean", "sum"],
trans_primitives=[],
where_primitives=["count", "mean", "sum"])
fm
In the resulting feature matrix, you can now see we have aggregations per morning, afternoon, and evening
COUNT(events) MEAN(events.amount) SUM(events.amount) COUNT(events WHERE time_of_day = afternoon) COUNT(events WHERE time_of_day = evening) COUNT(events WHERE time_of_day = morning) MEAN(events.amount WHERE time_of_day = afternoon) MEAN(events.amount WHERE time_of_day = evening) MEAN(events.amount WHERE time_of_day = morning) SUM(events.amount WHERE time_of_day = afternoon) SUM(events.amount WHERE time_of_day = evening) SUM(events.amount WHERE time_of_day = morning)
customer_id
a 37 49.753630 1840.884300 12 7 18 35.098923 45.861881 61.036892 421.187073 321.033164 1098.664063
b 30 51.241484 1537.244522 3 10 17 45.140800 46.170996 55.300715 135.422399 461.709963 940.112160
c 33 39.563222 1305.586314 9 7 17 50.129136 34.593936 36.015679 451.162220 242.157549 612.266545

Resources