How to calculate delta array for a cumulated array in a performant way using JSONata? - jsonata

I am having a json structure consisting of an array of 253 location objects. Each location object has an array reporting the cumulated total over time (array having 67 elements).
I would like to extend each location object in the json structure with an array (delta array) that is reporting the increase based on the cumulated total array.
The following json query is correctly doing this task:
(
covid ~> | locations | {
"new_total" : $map(cumulated_total, function($v, $i, $a) {$a[$i]-($i=0 ? 0 : $a[$i-1])})
} |;
)
The problem with that query is that it takes about 66 seconds to execute on my intel NUC device.
The JSONata Exerciser link containing an example of the json input structure together with the query I have used:
https://try.jsonata.org/BGqIUkWe7
Note that when I open that link it is reporting Expression evaluation timeout: Check for infinite loop because the query also takes too long to execute in my JSONata Exerciser window.

Not sure if it makes much difference, but you could try pulling the function definition up to the top level so that it only gets done once (since you're not relying on any lexical context in the function body):
(
$fn := function($v, $i, $a) {$v-($i=0 ? 0 : $a[$i-1])};
covid ~> | locations | {
"new_total" : $map(cumulated_total, $fn)
} |;
)

Following query is more performant (47 sec total time). So this is 19 sec more performant compared to query in the original post (66 sec total time).
(
covid ~> | locations | {
"new_total" : $map(cumulated_total, function($v, $i, $a) {$v-($i=0 ? 0 : $a[$i-1])})
} |;
)
As you can see I have only changed $a[$i] by $v in the function code.

Related

Extract multiple values from unformated text

My question is how to extract the value following Count:, Temp:, Total:, Used: from this multi-line text in go.
Welcome, user [User CP] [Count: 1,014,747.1] [some] [Ohter: 0]
Temp: 14.231 Total: 10.0 TB Used: 964.57 GB On line: 2 0 Traffic Count: 1995
10 (0 New) 0
So I can get these values 1,014,747.1, 14.231, 10.0TB, 964.57GB then assign to a go struct like
struct {
Count float64
Temp float64
Total string
Used string
}
I've tried with regexp but it result in I need to write four regxp and run four times with the same text to extract these value one by one. Why I need run 4 times is clear because I write 4 regxp to extract the for values.
var count = regexp.MustCompile(`(?m)(Count:\s*(\d+([\,]\d+)*([\.]\d+)))`)
var temp = regexp.MustCompile(`(?m)(Temp:\s*(\d+[\.]?\d*))`)
var total = regexp.MustCompile(`(?m)(Total:\s*(\d+\.?\d*\s\w\w))`)
var used = regexp.MustCompile(`(?m)(Used:\s*(\d+\.?\d*\s\w\w))`)
// run these regexp to get values
I've tried using one regexp, but the match result contain a lot of emtpy elements I cant get the value by a fixed index.
package main
import (
"regexp"
"fmt"
)
func main() {
var re = regexp.MustCompile(`(?m)(Count:\s*(\d+([\,]\d+)*([\.]\d+)))|(Temp:\s*(\d+[\.]?\d*))|(Total:\s*(\d+\.?\d*\s\w\w))|(Used:\s*(\d+\.?\d*\s\w\w))`)
var str = `Welcome, user [User CP] [Count: 1,014,747.1] [some] [Ohter: 0]
Temp: 14.231 Total: 10.0 TB Used: 964.57 GB On line: 2 0 Traffic Count: 1995
10 (0 New) 0`
for i, match := range re.FindAllStringSubmatch(str, -1) {
fmt.Println(match, "found at index", i)
}
}
the result is, there some different amount of empty elements in the result so I can't get the value via a fix index.
[Count: 1,014,747.1 Count: 1,014,747.1 1,014,747.1 ,747 .1 ] found at index 0
[Temp: 14.231 Temp: 14.231 14.231 ] found at index 1
[Total: 10.0 TB Total: 10.0 TB 10.0 TB ] found at index 2
[Used: 964.57 GB Used: 964.57 GB 964.57 GB] found at index 3
1,014,747.1 at index 2, 14.231 at index 6, 10.0 TB at index 8, 964.57 GB at index 10. So I can't get the value by using a fixed index.
more clear subgroup result at https://regex101.com/r/jenOHn/3, the match information shows the problem.
So is there a more elegant way to extract these values? The order of the values may vary and it might have some extra word (or miss some word) in between the text so extract by count length is not possible.
I've thought about using a finite state machine but can't figure how to implement one and I also not sure its a right way to do so.
It looks like you've got a ton of capturing groups in there that you aren't actually trying to capture, and a lot of unnecessarily specified stuff, and a missing s flag. I've cleaned up the expression and it works: https://play.golang.org/p/D9WxFCYQ8s0
(?ms)Count:\s*([0-9,.]+).*Temp:\s*([0-9.]+).*Total:\s*([0-9.]+).*Used:\s*([0-9.]+)

Referencing macro values by index

I defined the macros below as levels of the variables id, var1 and var2:
levelsof id, local(id_lev) sep(,)
levelsof var1, local(var1_lev) sep(,)
levelsof var2, local(var2_lev) sep(,)
I'd like to be able to reference the level values stored in these macros by their index during foreach and forval loops. I'm learning how to use macros, so I'm not sure if this is possible.
When I try to access a single element of any of the above macros, every element of the macro is displayed. For example, if I display the first element of id_lev, every element is displayed as a single element (and, the last element is listed as an invalid name which I don't understand):
. di `id_lev'[1]
0524062407240824092601260226032604 invalid name
r(198);
Furthermore, if I attempt to refer to elements of any of the macros in a loop (examples of what I've tried given below), I receive the error that the third value of the list of levels is an invalid number.
foreach i of numlist 1/10 {
whatever `var1'[i] `var2'[i], gen(newvar)
}
forval i = 1/10 {
local var1_ `: word `i' of `var1''
local var2_ `: word `i' of `var2''
whatever `var1_' `var2_', gen(newvar)
}
Is it not possible to reference elements of a macro by its index?
Or am I referencing the index values incorrectly?
Update 1:
I've gotten everything to work (thank you), save for adapting the forval loop given in William's answer to my loops above in which I am trying to access the macros of two variables at the same index value.
Specifically, I want to call on the first, second, ..., last elements of var1 and var2 simultaneously so that I can use the elements in a loop to produce a new variable. How can I adapt the forval loop suggested by William to accomplish this?
Update 2:
I was able to adapt the code given by William below to create the functioning loop:
levelsof id, clean local(id_lev)
macro list _id_lev
local nid_lev : word count `id_lev'
levelsof var1, local(var1_lev)
macro list _var1_lev
local nvar1_lev : word count `var1_lev'
levelsof var2, local(var2_lev)
macro list _var2_lev
local nvar2_lev : word count `var2_lev'
forval i = 1/`nid_lev' {
local id : word `i' of `id_lev'
macro list _id
local v1 : word `i' of `var1_lev'
macro list _v1
local v2 : word `i' of `var2_lev'
macro list _v2
whatever `v1' `v2', gen(newvar)
}
You will benefit, as I mentioned in my closing remark on your previous question, from close study of section 18.3 of the Stata User's Guide PDF.
sysuse auto, clear
tab rep78, missing
levelsof rep78, missing local(replvl)
macro list _replvl
local numlvl : word count `replvl'
macro list _numlvl
forval i = 1/`numlvl' {
local level : word `i' of `replvl'
macro list _level
display `level'+1000
}
yields
. sysuse auto, clear
(1978 Automobile Data)
. tab rep78, missing
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.70 2.70
2 | 8 10.81 13.51
3 | 30 40.54 54.05
4 | 18 24.32 78.38
5 | 11 14.86 93.24
. | 5 6.76 100.00
------------+-----------------------------------
Total | 74 100.00
. levelsof rep78, missing local(replvl)
1 2 3 4 5 .
. macro list _replvl
_replvl: 1 2 3 4 5 .
. local numlvl : word count `replvl'
. macro list _numlvl
_numlvl: 6
. forval i = 1/`numlvl' {
2. local level : word `i' of `replvl'
3. macro list _level
4. display `level'+1000
5. }
_level: 1
1001
_level: 2
1002
_level: 3
1003
_level: 4
1004
_level: 5
1005
_level: .
.

Convert date with milliseconds using PIG

Really stuck on this! Assume I have a following data set:
A | B
------------------
1/2/12 | 13:3.8
04:4.1 | 12:1.4
15:4.3 | 1/3/13
Observations A and B are in general in the format minutes:seconds.milliseconds like A is a click and B is a response. Sometimes time format has a form of month/day/year if any of the events happens to be in the beginning of the new day.
What I want? Is to calculate average difference between B and A. I can easily handle m:s.ms as splitting them into two parts for each A and B and then cast as DOUBLE and perform all needed operations but it all fails when m/d/yy are introduced. The easiest way to omit them but it is not a really good practice. Is there is a clear way to handle such exceptions using PIG?
A thought worth contemplating ....
Ref : http://pig.apache.org/docs/r0.12.0/func.html for String and Date functions used.
Input :
1/2/12|13:3.8
04:4.1|12:1.4
15:4.3|1/3/13
Pig Script :
A = LOAD 'input.csv' USING PigStorage('|') AS (start_time:chararray,end_time:chararray);
B = FOREACH A GENERATE (INDEXOF(end_time,'/',0) > 0 AND LAST_INDEX_OF(end_time,'/') > 0 AND (INDEXOF(end_time,'/',0) != LAST_INDEX_OF(end_time,'/'))
? (ToUnixTime(ToDate(end_time,'MM/dd/yy'))) : (ToUnixTime(ToDate(end_time,'mm:ss.S')))) -
(INDEXOF(start_time,'/',0) >0 AND LAST_INDEX_OF(start_time,'/') > 0 AND (INDEXOF(start_time,'/',0) != LAST_INDEX_OF(start_time,'/'))
? (ToUnixTime(ToDate(start_time,'MM/dd/yy'))) : (ToUnixTime(ToDate(start_time,'mm:ss.S')))) AS diff_time;
C = FOREACH (GROUP B ALL) GENERATE AVG(B.diff_time);
DUMP C;
N.B. In place of ToUnixTime we can use ToMilliSeconds() method.
Output :
(1.0569718666666666E7)

Conversion between different implementations of Seq in Scala

From what I have read, one should prefer to use generic Seq when defining sequences instead of specific implementations such as List or Vector.
Though I have parts of my code when a sequence will be used mostly for full traversal (mapping, filtering, etc) and some parts of my code where the same sequence will be used for indexing operations (indexOf, lastIndexWhere).
In the first case, I think it is better to use LinearSeq (implem is List) whereas in the second case it is better to use IndexedSeq (implem is Vector).
My question is: do I need to explicitly call the conversion method toList and toIndexedSeqin my code or is the conversion done under the hood in an intelligent manner ? If I use these conversions, is it a penalty for performance when going back and forth between IndexedSeq and LinearSeq ?
Thanks in advance
Vector will almost always out-perform List. Unless your algorithm uses only ::, head and tail, Vector will be faster than List.
Using List is more of a conceptual question on your algorithm (data is stack-structured, only accessing head/tail, only adding elements by prepending, use of pattern matching (which can be used with Vector, just feels more natural to me to use it with List)).
You might want to look at Why should I use vector in scala
Now for some nice number to compare (obviously not a 'real' benchmark, but eh) :
val l = List.range(1,1000000)
val a = Vector.range(1,1000000)
import System.{currentTimeMillis=> milli}
val startList = milli
l.map(_*2).map(_+2).filter(_%2 == 0)
println(s"time for list map/filter operations : ${milli - startList}")
val startVector = milli
a.map(_*2).map(_+2).filter(_%2 == 0)
println(s"time for vector map/filter operations : ${milli - startVector}")
Output :
time for list map/filter operations : 1214
time for vector map/filter operations : 364
Edit :
Just realized this doesn't actually answer your question. As far as I know, you will have to call toList/toVector yourself. As for performances, it depends on your sequence, but unless you're going back and forth all the time, it shouldn't be a problem.
Once again, not a serious benchmark, but :
val startConvToList = milli
a.toList
println(s"time for conversion to List: ${milli - startConvToList}")
val startConvToVector = milli
l.toVector
println(s"time for conversion to Vector: ${milli - startConvToVector}")
Output :
time for conversion to List: 48
time for conversion to Vector: 18
I have done the same for indexOf and Vector is also more performant
val l = List.range(1,1000000)
val a = Vector.range(1,1000000)
import System.{currentTimeMillis=> milli}
val startList = milli
l.indexOf(500000)
println("time for list index operation : " + (milli - startList))
val startVector = milli
a.indexOf(500000)
println("time for vector index operation : " + (milli - startVector))
Output :
time for list index operation : 36
time for vector index operation : 33
So I guess I should use Vector all the times in internal implementations but I must use Seq when I build interface as specified here :
Difference between a Seq and a List in Scala

R: Why is the [[ ]] approach for subsetting a list faster than using $?

I've been working on a few projects that have required me to do a lot of list subsetting and while profiling code I realised that the object[["nameHere"]] approach to subsetting lists was usually faster than the object$nameHere approach.
As an example if we create a list with named components:
a.long.list <- as.list(rep(1:1000))
names(a.long.list) <- paste0("something",1:1000)
Why is this:
system.time (
for (i in 1:10000) {
a.long.list[["something997"]]
}
)
user system elapsed
0.15 0.00 0.16
faster than this:
system.time (
for (i in 1:10000) {
a.long.list$something997
}
)
user system elapsed
0.23 0.00 0.23
My question is simply whether this behaviour is true universally and I should avoid the $ subset wherever possible or does the most efficient choice depend on some other factors?
Function [[ first goes through all elements trying for exact match, then tries to do partial match. The $ function tries both exact and partial match on each element in turn. If you execute:
system.time (
for (i in 1:10000) {
a.long.list[["something9973", exact=FALSE]]
}
)
i.e., you are running a partial match where there is no exact match, you will find that $ is in fact ever so slightly faster.

Resources