Why can't "asof" join be both "nearest" and "strict"? - empirical-lang

Going through the demo, why does Empirical prohibit the following?
>>> join trades, events on symbol asof timestamp nearest strict
Error: join 'asof' cannot be both 'nearest' and 'strict'
Is there a way to match a timestamp that is closest but not exact?

This is not permitted because the matched row could be out of order. Imagine this setup:
data LeftItem: time: Time, code1: Char end
data RightItem: time: Time, code2: Char end
let left = !LeftItem([Time("09:30"), Time("09:31")], ['A', 'B'])
let right = !RightItem([Time("09:30"), Time("09:31")], ['a', 'b'])
We now have these Dataframes:
>>> left
time code1
09:30:00 A
09:31:00 B
>>> right
time code2
09:30:00 a
09:31:00 b
If there were a "nearest strict", then the result would be
time code1 code2
09:30:00 A b
09:31:00 B a
It's correct in the sense that we have the closest row that isn't exact, but it doesn't make any sense. We expect time to monotonically increase, so the matched rows should never be in reverse order.
So the most sensible approach is to allow "strict" on "backward" and "forward" directions, but not on "nearest".

Related

Alternative to using ungroup in kdb?

I have two tables in KDB.
One is a timeseries with a datetime, sym column (spanning multiple dates, eg could be 1mm rows or 2mm rows). Each timepoint has the same number of syms and few other standard columns such as price.
Let's call this t1:
`date`datetime`sym`price
The other table is of this structure:
`date`sym`factors`weights
where factors is a list and weights is a list of equal length for each sym.
Let's call this t2.
I'm doing a left join on these two tables and then an ungroup.
factors and weights are of not equal length for each sym.
I'm doing the following:
select sum (weights*price) by date, factors from ungroup t1 lj `date`sym xkey t2
However this is very slow and can be as slow as 5-6 seconds if t1 has a million rows or more.
Calling all kdb experts for some advice!
EDIT:
here's a full example:
(apologies for the roundabout way of defining t1 and t2)
interval: `long$`time$00:01:00;
hops: til 1+ `int$((`long$(et:`time$17:00)-st:`time$07:00))%interval;
times: st + `long$interval*hops;
dates: .z.D - til .z.D-.z.D-10;
timepoints: ([] date: dates) cross ([] time:times);
syms: ([] sym: 300?`5);
universe: timepoints cross syms;
t1: update datetime: date+time, price:count[universe]?100.0 from universe;
t2: ([] date:dates) cross syms;
/ note here my real life t2, doesn't have a count of 10 weights/factors for each sym, it can vary by sym.
t2: `date`sym xkey update factors: count[t2]#enlist 10?`5, weights: count[t2]#enlist 10?10 from t2;
/ what is slow is the ungroup
select sum weights*price by date, datetime, factors from ungroup t1 lj t2
One approach to avoid the ungroup is to work with matrices (aka lists of lists) and take advantage of the optimised matrix-multiply $ seen here: https://code.kx.com/q/ref/mmu/
In my approach below, instead of joining t2 to t1 to ungroup, I group t1 and join to t2 (thus keeping everything as lists of lists) and then use some matrix manipulation (with a final ungroup at the end on a much smaller set)
q)\ts res:select sum weights*price by date, factors from ungroup t1 lj t2
4100 3035628112
q)\ts resT:ungroup exec first factors,sum each flip["f"$weights]$price by date:date from t2 lj select price by date,sym from t1;
76 83892800
q)(0!res)~`date`factors xasc `date`factors`weights xcol resT
1b
As you can see its much quicker (at least on my machine) and the result is identical save for ordering and column names.
You may still need to modify this solution somewhat to work in your actual use-case (with variable weights etc - in this case perhaps enforce a uniform number of weights across each sym filling with zeros if necessary)

What is the most efficient algorithm/data structure for finding the smallest range containing a point?

Given a data set of a few millions of price ranges, we need to find the smallest range that contains a given price.
The following rules apply:
Ranges can be fully nested (ie, 1-10 and 5-10 is valid)
Ranges cannot be partially nested (ie, 1-10 and 5-15 is invalid)
Example:
Given the following price ranges:
1-100
50-100
100-120
5-10
5-20
The result for searching price 7 should be 5-10
The result for searching price 100 should be 100-120 (smallest range containing 100).
What's the most efficient algorithm/data structure to implement this?
Searching the web, I only found solutions for searching ranges within ranges.
I've been looking at Morton count and Hilbert curve, but can't wrap my head around how to use them for this case.
Thanks.
Because you did not mention this ad hoc algorithm, I'll propose this as a simple answer to your question:
This is a python function, but it's fairly easy to understand and convert it in another language.
def min_range(ranges, value):
# ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20)]
# value = 100
# INIT
import math
best_range = None
best_range_len = math.inf
# LOOP THROUGH ALL RANGES
for b, e in ranges:
# PICK THE SMALLEST
if b <= value <= e and e - b < best_range_len:
best_range = (b, e)
best_range_len = e - b
print(f'Minimal range containing {value} = {best_range}')
I believe there are more efficient and complicated solutions (if you can do some precomputation for example) but this is the first step you must take.
EDIT : Here is a better solution, probably in O(log(n)) but it's not trivial. It is a tree where each node is an interval, and has a child list of all strictly non overlapping intervals that are contained inside him.
Preprocessing is done in O(n log(n)) time and queries are O(n) in worst case (when you can't find 2 ranges that don't overlap) and probably O(log(n)) in average.
2 classes: Tree that holds the tree and can query:
class tree:
def __init__(self, ranges):
# sort the ranges by lowest starting and then greatest ending
ranges = sorted(ranges, key=lambda i: (i[0], -i[1]))
# recursive building -> might want to optimize that in python
self.node = node( (-float('inf'), float('inf')) , ranges)
def __str__(self):
return str(self.node)
def query(self, value):
# bisect is for binary search
import bisect
curr_sol = self.node.inter
node_list = self.node.child_list
while True:
# which of the child ranges can include our value ?
i = bisect.bisect_left(node_list, (value, float('inf'))) - 1
# does it includes it ?
if i < 0 or i == len(node_list):
return curr_sol
if value > node_list[i].inter[1]:
return curr_sol
else:
# if it does then go deeper
curr_sol = node_list[i].inter
node_list = node_list[i].child_list
Node that holds the structure and information:
class node:
def __init__(self, inter, ranges):
# all elements in ranges will be descendant of this node !
import bisect
self.inter = inter
self.child_list = []
for i, r in enumerate(ranges):
if len(self.child_list) == 0:
# append a new child when list is empty
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
else:
# the current range r is included in a previous range
# r is not a child of self but a descendant !
if r[0] < self.child_list[-1].inter[1]:
continue
# else -> this is a new child
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
def __str__(self):
# fancy
return f'{self.inter} : [{", ".join([str(n) for n in self.child_list])}]'
def __lt__(self, other):
# this is '<' operator -> for bisect to compare our items
return self.inter < other
and to test that:
ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20), (50, 51)]
t = tree(ranges)
print(t)
print(t.query(10))
print(t.query(5))
print(t.query(40))
print(t.query(50))
Preprocessing that generates disjoined intervals
(I call source segments as ranges and resulting segments as intervals)
For ever range border (both start and end) make tuple: (value, start/end fiels, range length, id), put them in array/list
Sort these tuples by the first field. In case of tie make longer range left for start and right for end.
Make a stack
Make StartValue variable.
Walk through the list:
if current tuple contains start:
if interval is opened: //we close it
if current value > StartValue: //interval is not empty
make interval with //note id remains in stack
(start=StartValue, end = current value, id = stack.peek)
add interval to result list
StartValue = current value //we open new interval
push id from current tuple onto stack
else: //end of range
if current value > StartValue: //interval is not empty
make interval with //note id is removed from stack
(start=StartValue, end = current value, id = stack.pop)
add interval to result list
if stack is not empty:
StartValue = current value //we open new interval
After that we have sorted list of disjointed intervals containing start/end value and id of the source range (note that many intervals might correspond to the same source range), so we can use binary search easily.
If we add source ranges one-by-one in nested order (nested after it parent), we can see that every new range might generate at most two new intervals, so overall number of intervals M <= 2*N and overall complexity is O(Nlog N + Q * logN) where Q is number of queries
Edit:
Added if stack is not empty section
Result for your example 1-100, 50-100, 100-120, 5-10, 5-20 is
1-5(0), 5-10(3), 10-20(4), 20-50(0), 50-100(1), 100-120(2)
Since pLOPeGG already covered the ad hoc case, I will answer the question under the premise that preporcessing is performed in order to support multiple queries efficiently.
General data structures for efficient queries on intervals are the Interval Tree and the Segment Tree
What about an approach like this. Since we only allow nested and not partial-nesting. This looks to be a do-able approach.
Split segments into (left,val) and (right,val) pairs.
Order them with respect to their vals and left/right relation.
Search the list with binary search. We get two outcomes not found and found.
If found check if it is a left or right. If it is a left go right until you find a right without finding a left. If it is a right go left until you find a left without finding a right. Pick the smallest.
If not found stop when the high-low is 1 or 0. Then compare the queried value with the value of the node you are at and then according to that search right and left to it just like before.
As an example;
We would have (l,10) (l,20) (l,30) (r,45) (r,60) (r,100) when searching for say, 65 you drop on (r,100) so you go left and can't find a spot with a (l,x) such that x>=65 so you go left until you get balanced lefts and rights and first right and last left is your interval. The reprocessing part will be long but since you will keep it that way. It is still O(n) in worst-case. But that worst case requires you to have everything nested inside each other and you searching for the outer-most.

How to find a series of transactions happening in a range of time?

I have a dataset with nodes that are companies linked by transactions.
A company has these properties : name, country, type, creation_date
The relationships "SELLS_TO" have these properties : item, date, amount
All dates are in the following format YYYYMMDD.
I'm trying to find a series of transactions that :
- include 2 companies from 2 distinct countries
- where between the first node in the series and the last one, there is a company that has been created less than 90 days ago
- where the total time between the first transaction and the last transaction is < 15 days
I think I can handle the conditions 1) and 2) but I'm stuck on 3).
MATCH (a:Company)-[r:SELLS_TO]->(b:Company)-[v:SELLS_TO*]->(c:Company)
WHERE NOT(a.country = c.country) AND (b.creation_date + 90 < 20140801)
Basically I don't know how to get the date of the last transaction in the series. Anyone knows how to do that?
jvilledieu,
In answer to your most immediate question, you can access the collections of nodes and relationships in the matched path and get the information you need. The query would look something like this.
MATCH p=(a:Company)-[rs:SELLS_TO*]->(c:Company)
WHERE a.country <> c.country
WITH p, a, c, rs, nodes(p) AS ns
WITH p, a, c, rs, filter(n IN ns WHERE n.creation_date - 20140801 < 90) AS bs
WITH p, a, c, rs, head(bs) AS b
WHERE NOT b IS NULL
WITH p, a, b, c, head(rs) AS r1, last(rs) AS rn
WITH p, a, b, c, r1, rn, rn.date - r1.date AS d
WHERE d < 15
RETURN a, b, c, d, r1, rn
This query finds a chain with at least one :SELLS_TO relationship between :Company nodes and assigns the matched path to 'p'. The match is then limited to cases where the first and last company have different countries. At this point the WITH clauses develop the other elements that you need. The collection of nodes in the path is obtained and named 'ns'. From this, a collection of nodes where the creation date is less than 90 days from the target date is found and named 'bs'. The first node of the 'bs' collection is then found and named 'b', and the match is limited to cases where a 'b' node was found. The first and last relationships are then found and named 'r1' and 'rn'. After this, the difference in their dates is calculated and named 'd'. The match is then limited to cases where d is less than 15.
So that gives you an idea of how to do this. There is another problem though. At least, in the way you have described the problem, you will find that the date math will fail. Dates that are represented as numbers, such as 20140801, are not linear, and thus cannot be used for interval math. As an example, 15 days from 20140820 is 20140904. If you subtract these two date 'numbers', you get 84. One example of how to do this is to represent your dates as days since an epoch date.
Grace and peace,
Jim

Finding smallest set of criteria for uniqueness

I have a collection of objects with properties. I want to find the simplest set of criteria that will specify exactly one of these objects (I do not care which one).
For example, given {a=1, b=1, c=1}, {a=1, b=2, c=1}, {a=1, b=1, c=2}, specifying b==2 (or c==2) will give me an unique object.
Likewise, given {a=1, b=1, c=1}, {a=1, b=2, c=2}, {a=1, b=2, c=1}, specifying b==2 and c==2 (or b==1 && c==1 or b==2 && c==1) will give me an unique object.
This sounds like a known problem, with a known solution, but I haven't been able to find the correct formulation of the problem to allow me to Google it.
It is indeed a known problem in AI - feature selection. There are many algorithms for doing this Just Google "feature selection" "artificial intelligence".
The main issue is that when the samples set is large, you need to use some sort of heuristics in order to reach a solution within a reasonable time.
Feature Selection in Data Mining
The main idea of feature selection is to choose a subset of input
variables by eliminating features with little or no predictive
information.
The freedom of choosing the target is sort of unusual. If the target is specified, then this is essentially the set cover problem. Here's two corresponding instances side by side.
A={1,2,3} B={2,4} C={3,4} D={4,5}
0: {a=0, b=0, c=0, d=0} # separate 0 from the others
1: {a=1, b=0, c=0, d=0}
2: {a=1, b=1, c=0, d=0}
3: {a=1, b=0, c=1, d=0}
4: {a=0, b=1, c=1, d=1}
5: {a=0, b=0, c=0, d=1}
While set cover is NP-hard, however, your problem has an O(mlog n + O(1) poly(n)) algorithm where m is the number of attributes and n is the number of items (the optimal set of criteria has size at most log n), which makes it rather unlikely that an NP-hardness proof is forthcoming. I'm reminded of the situation with the Junta problem (basically the theoretical formulation of feature selection).
I don't know how easily this could be translated into an algoritm but using SQL, which is already set based, it could go like this
construct a table with all possible combinations of columns from the input table
select all combinations that appear equal to the amount of records present in the input table as distinct combinations.
SQL Script
;WITH q (a, b, c) AS (
SELECT '1', '1', '1'
UNION ALL SELECT '1', '2', '2'
UNION ALL SELECT '1', '2', '1'
UNION ALL SELECT '1', '1', '2'
)
SELECT col
FROM (
SELECT val = a, col = 'a' FROM q
UNION ALL SELECT b, 'b' FROM q
UNION ALL SELECT c, 'c' FROM q
UNION ALL SELECT a+b, 'a+b' FROM q
UNION ALL SELECT a+c, 'a+c' FROM q
UNION ALL SELECT b+c, 'b+c' FROM q
UNION ALL SELECT a+b+c, 'a+b+c' FROM q
) f
GROUP BY
col
HAVING
COUNT(DISTINCT (val)) = (SELECT COUNT(*) FROM q)
Your problem can be defined as follows:
1 1 1 -> A
1 2 1 -> B
1 1 2 -> C
.
.
where 1 1 1 is called the feature vector and A is the object class. You can then use decision trees (with pruning) to find a set of rules to classify objects. So, if your objective is to automatically decide the set of criteria to identify object A then, you can observe the path on the decision tree which leads to A.
If you have access to MATLAB, it is really easy to obtain a decision tree for your data.

How to decide on weights?

For my work, I need some kind of algorithm with the following input and output:
Input: a set of dates (from the past). Output: a set of weights - one weight per one given date (the sum of all weights = 1).
The basic idea is that the closest date to today's date should receive the highest weight, the second closest date will get the second highest weight, and so on...
Any ideas?
Thanks in advance!
First, for each date in your input set assign the amount of time between the date and today.
For example: the following date set {today, tomorrow, yesterday, a week from today} becomes {0, 1, 1, 7}. Formally: val[i] = abs(today - date[i]).
Second, inverse the values in such a way that their relative weights are reversed. The simplest way of doing so would be: val[i] = 1/val[i].
Other suggestions:
val[i] = 1/val[i]^2
val[i] = 1/sqrt(val[i])
val[i] = 1/log(val[i])
The hardest and most important part is deciding how to inverse the values. Think, what should be the nature of the weights? (do you want noticeable differences between two far away dates, or maybe two far away dates should have pretty equal weights? Do you want a date which is very close to today have an extremely bigger weight or a reasonably bigger weight?).
Note that you should come up with an inverting procedure where you cannot divide by zero. In the example above, dividing by val[i] results in division by zero. One method to avoid division by zero is called smoothing. The most trivial way to "smooth" your data is using the add-one smoothing where you just add one to each value (so today becomes 1, tomorrow becomes 2, next week becomes 8, etc).
Now the easiest part is to normalize the values so that they'll sum up to one.
sum = val[1] + val[2] + ... + val[n]
weight[i] = val[i]/sum for each i
Sort dates and remove dups
Assign values (maybe starting from the farthest date in steps of 10 or whatever you need - these value can be arbitrary, they just reflect order and distance)
Normalize weights to add up to 1
Executable pseudocode (tweakable):
#!/usr/bin/env python
import random, pprint
from operator import itemgetter
# for simplicity's sake dates are integers here ...
pivot_date = 1000
past_dates = set(random.sample(range(1, pivot_date), 5))
weights, stepping = [], 10
for date in sorted(past_dates):
weights.append( (date, stepping) )
stepping += 10
sum_of_steppings = sum([ itemgetter(1)(x) for x in weights ])
normalized = [ (d, (w / float(sum_of_steppings)) ) for d, w in weights ]
pprint.pprint(normalized)
# Example output
# The 'date' closest to 1000 (here: 889) has the highest weight,
# 703 the second highest, and so forth ...
# [(151, 0.06666666666666667),
# (425, 0.13333333333333333),
# (571, 0.2),
# (703, 0.26666666666666666),
# (889, 0.3333333333333333)]
How to weight: just compute the difference of all dates and the current date
x(i) = abs(date(i) - current_date)
you can then use different expression to assign weights:
w(i) = 1/x(i)
w(i) = exp(-x(i))
w(i) = exp(-x(i)^2))
use gaussian distribution - more complicated, do not recommend
Then use normalized weights: w(i)/sum(w(i)) so that the sum is 1.
(Note that the exponential func is always used by statisticians in survival analysis)
The first thing that comes to my mind to to use a geometric series:
http://en.wikipedia.org/wiki/Geometric_series
(1/2)+(1/4)+(1/8)+(1/16)+(1/32)+(1/64)+(1/128)+(1/256)..... sums to one.
Yesterday would be 1/2
2 days ago would be 1/4
and so on
Is is the index for the i-th date.
Assign weights equal to to Ni / D.
D0 is the first date.
Ni is the difference in days between the i-th date and the first date D0.
D is the normalization factor
converts dates to yyyymmddhhmiss format (24 hours), add all these values ​​and the total, divide by the total time, and sort by this value.
declare #data table
(
Date bigint,
Weight float
)
declare #sumTotal decimal(18,2)
insert into #Data (Date)
select top 100
replace(replace(replace(convert(varchar,Datetime,20),'-',''),':',''),' ','')
from Dates
select #sumTotal=sum(Date)
from #Data
update #Data set
Weight=Date/#sumTotal
select * from #Data order by 2 desc

Resources