Group time intervals by date (in d3.js) - d3.js

For instance, there is an array of objects with start, end and duration (in hours) attributes.
[{start: new Date(2013, 2, 4, 0),
end: new Date(2013, 2, 4, 8),
duration: 8},
{start: new Date(2013, 2, 4, 22),
end: new Date(2013, 2, 5, 2),
duration: 4},
{start: new Date(2013, 2, 5, 5),
end: new Date(2013, 2, 7, 5),
duration: 48}]
I'd like to visualize them into something like the following (y - hours, x - dates):
I'm thinking about creating additional objects to fill the empty spaces between events like this
[{start: new Date(2013, 2, 4, 0),
end: new Date(2013, 2, 4, 8),
status: "busy"},
{start: new Date(2013, 2, 4, 8, 0, 1),
end: new Date(2013, 2, 4, 21, 59, 59),
status: "free"},
{start: new Date(2013, 2, 4, 22),
end: new Date(2013, 2, 4, 23, 59, 59),
status: "busy"},
{start: new Date(2013, 2, 5, 0),
end: new Date(2013, 2, 5, 2),
status: "busy"}]
And then map this to Stack Layout.
So my question is, how would be better to split and group the array, to make this visualization easier? Maybe there is some built-in D3.js features for this?

I would consider changing the data format to
[{start: new Date(2013, 2, 4, 0),
end: new Date(2013, 2, 4, 8)},
{start: new Date(2013, 2, 4, 22),
end: new Date(2013, 2, 5, 2)},
{start: new Date(2013, 2, 5, 5),
end: new Date(2013, 2, 7, 5)}]
Since you have the start and end date, you don't really need a duration. Alternatively you could have just the start date and a duration.
I'm not extremely familiar with the stacklayout, but it might be sufficent (and easier) for this project to simply append rect elements to the right position. I made an example here: http://tributary.io/inlet/5841372 which doesn't take into account the fact that you need to wrap events that start one day and end the next. This just displays all events in the same column, with the white space representing free time.

Related

Alorithm to find line segments closest to the X-Axis

We have a list of line segments (intervals) :
input Array of objects in the following order:
// Start and end represents the x coordinate and distance represents the y coordinate.
{start: 13, end: 15, distance: 1}, // S[0] -- pale red
{start: 12, end: 15, distance: 2}, // S[1] -- pale orange
{start: 2, end: 5, distance: 1}, // S[2] -- pale yellow
{start: 7, end: 9, distance: 2}, // S[3] -- pale green 1
{start: 7, end: 9, distance: 2}, // S[4] -- pale green 2
{start: 6, end: 8, distance: 2}, // S[5] -- fresh green
{start: 2, end: 5, distance: 4}, // S[6] -- pale gray
{start: 5, end: 11, distance: 4}, // S[7] -- air blue
{start: 9, end: 10, distance: 1}, // S[8] -- cyan blue
{start: 1, end: 11, distance: 3}, // S[9] -- magenta purple
We want to find the parts of each of intervals closest to x axis:
{start: 1, end: 2, in: S[9]},
{start: 2, end: 5, in: S[2]},
{start: 5, end: 6, in: S[9]},
{start: 6, end: 8, in: S[5]},
{start: 8, end: 9, in: S[3]},
{start: 9, end: 10, in: S[8]},
{start: 10, end: 11, in: S[9]},
{start: 12, end: 13, in: S[1]},
{start: 13, end: 15, in: S[0]},
One way of achieving this is
First sorting the array of objects.
Then using N no. of Stacks to push each interval and keep the one which is closest to x axis (lowest distance) and creating the final set.
But this wont be the optimal one.
What should be the optimal solution for this?

Any form for a year-to-date or rolling sum function in Power Query?

I'm quite newby to Power Query. I have a column for the date, called MyDate, format (dd/mm/yy), and another variable called TotalSales. Is there any way of obtaining a variable TotalSalesYTD, with the sum of year-to-date TotalSales for each row? I've seen you can do that at Power Pivot or Power Bi, but didn't find anything for Power Query.
Alternatively, is there a way of creating a variable TotalSales12M, for the rolling sum of the last 12 months of TotalSales?
I wasn't able to test this properly, but the following code gave me your expected result:
let
initialTable = Table.FromRows({
{#date(2020, 5, 1), 150},
{#date(2020, 4, 1), 20},
{#date(2020, 3, 1), 54},
{#date(2020, 2, 1), 84},
{#date(2020, 1, 1), 564},
{#date(2019, 12, 1), 54},
{#date(2019, 11, 1), 678},
{#date(2019, 10, 1), 885},
{#date(2019, 9, 1), 54},
{#date(2019, 8, 1), 98},
{#date(2019, 7, 1), 654},
{#date(2019, 6, 1), 45},
{#date(2019, 5, 1), 64},
{#date(2019, 4, 1), 68},
{#date(2019, 3, 1), 52},
{#date(2019, 2, 1), 549},
{#date(2019, 1, 1), 463},
{#date(2018, 12, 1), 65},
{#date(2018, 11, 1), 45},
{#date(2018, 10, 1), 68},
{#date(2018, 9, 1), 65},
{#date(2018, 8, 1), 564},
{#date(2018, 7, 1), 16},
{#date(2018, 6, 1), 469},
{#date(2018, 5, 1), 4}
}, type table [MyDate = date, TotalSales = Int64.Type]),
ListCumulativeSum = (numbers as list) as list =>
let
accumulator = (listState as list, toAdd as nullable number) as list =>
let
previousTotal = List.Last(listState, 0),
combined = listState & {List.Sum({previousTotal, toAdd})}
in combined,
accumulated = List.Accumulate(numbers, {}, accumulator)
in accumulated,
TableCumulativeSum = (someTable as table, columnToSum as text, newColumnName as text) as table =>
let
values = Table.Column(someTable, columnToSum),
cumulative = ListCumulativeSum(values),
columns = Table.ToColumns(someTable) & {cumulative},
toTable = Table.FromColumns(columns, Table.ColumnNames(someTable) & {newColumnName})
in toTable,
yearToDateColumn =
let
groupKey = Table.AddColumn(initialTable, "$groupKey", each Date.Year([MyDate]), Int64.Type),
grouped = Table.Group(groupKey, "$groupKey", {"toCombine", each
let
sorted = Table.Sort(_, {"MyDate", Order.Ascending}),
cumulative = TableCumulativeSum(sorted, "TotalSales", "TotalSalesYTD")
in cumulative
}),
combined = Table.Combine(grouped[toCombine]),
removeGroupKey = Table.RemoveColumns(combined, "$groupKey")
in removeGroupKey,
rolling = Table.AddColumn(yearToDateColumn, "TotalSales12M", each
let
inclusiveEnd = [MyDate],
exclusiveStart = Date.AddMonths(inclusiveEnd, -12),
filtered = Table.SelectRows(yearToDateColumn, each [MyDate] > exclusiveStart and [MyDate] <= inclusiveEnd),
sum = List.Sum(filtered[TotalSales])
in sum
),
sortedRows = Table.Sort(rolling, {{"MyDate", Order.Descending}})
in
sortedRows
There might be more efficient ways to do what this code does, but if the size of your data is relatively small, then this approach should be okay.
For the year to date cumulative, the data is grouped by year, then sorted ascendingly, then a running total column is added.
For the rolling 12-month total, the data is grouped into 12-month windows and then the sales are totaled within each window. The totaling is a bit inefficient (since all rows are re-processed as opposed to only those which have entered/left the window), but you might not notice it.
Table.Range could have been used instead of Table.SelectRows when creating the 12-month windows, but I figured Table.SelectRows makes less assumptions about the input data (i.e. whether it's sorted, whether any months are missing, etc.) and is therefore safer/more robust.
This is what I get:

Find Top N Most Frequent Sequence of Numbers in List of a Billion Sequences

Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
Important
This is super simplified but, in reality, my actual list-of-sequences
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there known algorithms for performing this kind of analysis for various combinations of N and M? I've looked at suffix trees but I'd have to roll my own custom version to even get close to what I need.
For the same dataset, I need to repeatedly query the dataset for various values or different combinations of target, N, and M (where target <= 10,000, N <= 100 and `M <= 100). How can I do this efficiently?
Extending on my comment. Here is a sketch how you could approach this using an out-of-the-box suffix array:
1) reverse and concatenate your lists with a stop symbol (I used 0 here).
[7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 5, 6, 0, 5, 4, 3, 2, 8, 9, 0, 5, 6, 12, 12, 0, 2, 4, 3, 8, 5, 0, 5, 1, 0, 6, 5, 12, 4, 1, 9, 5, 3, 8, 8, 2, 0, 2, 1, 4, 3, 7, 1, 7, 0, 1, 5, 6, 12, 12, 4, 9]
2) Build a suffix array
[53, 45, 24, 30, 12, 19, 33, 7, 32, 6, 47, 54, 51, 38, 44, 5, 46, 25, 16, 4, 15, 49, 27, 41, 37, 3, 14, 48, 26, 59, 29, 31, 40, 2, 13, 10, 20, 55, 35, 11, 1, 34, 21, 56, 52, 50, 0, 43, 28, 42, 17, 18, 39, 60, 9, 8, 23, 36, 58, 22, 57]
3) Build the LCP array. The LCP array will tell you how many numbers a suffix has in common with its neighbour in the suffix array. However, you need to stop counting when you encounter a stop symbol
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 0, 2, 1, 1, 2, 0, 1, 3, 2, 2, 1, 0, 1, 1, 1, 4, 1, 2, 4, 1, 0, 1, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 2, 0]
4) When a query comes in (target = 5, M= 4) you search for the first occurence of your target in the suffix array and scan the corresponding LCP-array until the starting number of suffixes changes. Below is the part of the LCP array that corresponds to all suffixes starting with 5.
[..., 1, 1, 1, 4, 1, 2, 4, 1, 0, ...]
This tells you that there are two sequences of length 4 that occur two times. Brushing over some details using the indexes you can find the sequences and revert them back to get your final results.
Complexity
Building up the suffix array is O(n) where n is the total number of elements in all lists and O(n) space
Building the LCP array is also O(n) in both time and space
Searching a target number in the suffix is O(log n) in average
The cost of scanning through the relevant subsequences is linear in the number of times the target occurs. Which should be 1/10000 on average according to your given parameters.
The first two steps happen offline. Querying is technically O(n) (due to step 4) but with a small constant (0.0001).

How can I find the minimum value for a field each day over a timespan using RethinkDb?

I have a database of readings from weather sensors. One of the items measured is 'sky temperature'. I want to find the minimum sky temperature each day over a period of a month or two.
The first thing I tried was this:
r.db('Weather').table('TAO_SkyNet', {readMode:'outdated'})
.group(r.row('time').dayOfYear(),{index:'time'})
.min('sky')
I think that might work, except that it is a large database and the query times out after 300 seconds. Fair enough, I really don't want the data back to the beginning of time. A few weeks will do. So I tried to restrict the records examined like this:
r.db('Weather').table('TAO_SkyNet', {readMode:'outdated'})
.between(r.time(2018,3,1,'Z'), r.now())
.group(r.row('time').dayOfYear(),{index:'time'})
.min('sky')
..and I get...
e: Expected type TABLE but found TABLE_SLICE:
SELECTION ON table(TAO_SkyNet) in:
r.db("Weather").table("TAO_SkyNet", {"readMode": "outdated"}).between(r.time(2018, 3, 1, "Z"), r.now()).group(r.row("time").dayOfYear(), {"index": "time"}).min("sky")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
So, I'm stuck here. How do I group on a subset of the table?
between returns a table slice, and table slices don't support indexes.
table.between(lowerKey, upperKey[, options]) → table_slice
By the way, between operates over indexes itself.
Once you remove {index:'time'} from your group clause (if TAO_SkyNet has time as its primary key):
r.db('Weather')
.table('TAO_SkyNet', {readMode: 'outdated'})
.between(r.time(2018, 3, 1, 'Z'), r.now())
.group(r.row('time').dayOfYear())
.min('sky')
or move the index option to the between clause (if TAO_SkyNet has time as its secondary key)
r.db('Weather')
.table('TAO_SkyNet', {readMode: 'outdated'})
.between(r.time(2018, 3, 1, 'Z'), r.now(), {index: 'time'})
.group(r.row('time').dayOfYear())
.min('sky')
it should work fine.
Test dataset:
r.db('Weather').table('TAO_SkyNet').insert([
// day 1
{time: r.time(2018, 3, 1, 0, 0, 0, 'Z'), sky: 10},
{time: r.time(2018, 3, 1, 8, 0, 0, 'Z'), sky: 4}, // min
{time: r.time(2018, 3, 1, 16, 0, 0, 'Z'), sky: 7},
// day 2
{time: r.time(2018, 3, 2, 0, 0, 0, 'Z'), sky: 2}, // min
{time: r.time(2018, 3, 2, 8, 0, 0, 'Z'), sky: 4},
{time: r.time(2018, 3, 2, 16, 0, 0, 'Z'), sky: 9},
// day 3
{time: r.time(2018, 3, 3, 0, 0, 0, 'Z'), sky: 7},
{time: r.time(2018, 3, 3, 8, 0, 0, 'Z'), sky: 7},
{time: r.time(2018, 3, 3, 16, 0, 0, 'Z'), sky: 1} // min
]);
Query result:
[{
"group": 60,
"reduction": {"sky": 4, "time": Thu Mar 01 2018 08:00:00 GMT+00:00}
},
{
"group": 61,
"reduction": {"sky": 2, "time": Fri Mar 02 2018 00:00:00 GMT+00:00}
},
{
"group": 62,
"reduction": {"sky": 1, "time": Sat Mar 03 2018 16:00:00 GMT+00:00}
}]

python appending issues, function keeps changing values of list

I was trying to visualize bubblesort by making an animated plot on some unsorted list, say np.random.permutation(10)
so naturally I would append the list every time it's altered within the bubblesort function until it's completely sorted. Here's the code
def bubblesort(A):
instant = []
for i in range(len(A)-1):
lindex=0
while lindex+1<len(A):
if A[lindex]> A[lindex+1]:
swap(A,lindex,lindex+1)
lindex+=1
else:
lindex+=1
instant.append(A)
return instant
The problem is though, instant only returns
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]
which is obviously not right. What has gone wrong? Thanks!
A is being operated on in-place, and bubblesort is returning a list of references to this array. Notice that if you check A now, it is also sorted.
Changing
if A[lindex]> A[lindex+1]:
swap(A,lindex,lindex+1)
to
if A[lindex]> A[lindex+1]:
A = A.copy()
swap(A,lindex,lindex+1)
making a copy before changing anything, should show the progress of the sort.

Resources