Alternative to using ungroup in kdb? - performance

I have two tables in KDB.
One is a timeseries with a datetime, sym column (spanning multiple dates, eg could be 1mm rows or 2mm rows). Each timepoint has the same number of syms and few other standard columns such as price.
Let's call this t1:
`date`datetime`sym`price
The other table is of this structure:
`date`sym`factors`weights
where factors is a list and weights is a list of equal length for each sym.
Let's call this t2.
I'm doing a left join on these two tables and then an ungroup.
factors and weights are of not equal length for each sym.
I'm doing the following:
select sum (weights*price) by date, factors from ungroup t1 lj `date`sym xkey t2
However this is very slow and can be as slow as 5-6 seconds if t1 has a million rows or more.
Calling all kdb experts for some advice!
EDIT:
here's a full example:
(apologies for the roundabout way of defining t1 and t2)
interval: `long$`time$00:01:00;
hops: til 1+ `int$((`long$(et:`time$17:00)-st:`time$07:00))%interval;
times: st + `long$interval*hops;
dates: .z.D - til .z.D-.z.D-10;
timepoints: ([] date: dates) cross ([] time:times);
syms: ([] sym: 300?`5);
universe: timepoints cross syms;
t1: update datetime: date+time, price:count[universe]?100.0 from universe;
t2: ([] date:dates) cross syms;
/ note here my real life t2, doesn't have a count of 10 weights/factors for each sym, it can vary by sym.
t2: `date`sym xkey update factors: count[t2]#enlist 10?`5, weights: count[t2]#enlist 10?10 from t2;
/ what is slow is the ungroup
select sum weights*price by date, datetime, factors from ungroup t1 lj t2

One approach to avoid the ungroup is to work with matrices (aka lists of lists) and take advantage of the optimised matrix-multiply $ seen here: https://code.kx.com/q/ref/mmu/
In my approach below, instead of joining t2 to t1 to ungroup, I group t1 and join to t2 (thus keeping everything as lists of lists) and then use some matrix manipulation (with a final ungroup at the end on a much smaller set)
q)\ts res:select sum weights*price by date, factors from ungroup t1 lj t2
4100 3035628112
q)\ts resT:ungroup exec first factors,sum each flip["f"$weights]$price by date:date from t2 lj select price by date,sym from t1;
76 83892800
q)(0!res)~`date`factors xasc `date`factors`weights xcol resT
1b
As you can see its much quicker (at least on my machine) and the result is identical save for ordering and column names.
You may still need to modify this solution somewhat to work in your actual use-case (with variable weights etc - in this case perhaps enforce a uniform number of weights across each sym filling with zeros if necessary)

Related

Indexing Strategy in Oracle

I have a table with 2 million rows.
The ndv ( number of distinct values ) in the columns are as follows :
A - 3
B - 60
D - 150
E - 600,000
The most frequently updated columns are A & B ( NDV = 3 for both ).
Assuming every query will have either column D or column E in WHERE clause, which of following will be the best set of indexes for SELECT statement:
D
D,E,A
E,A
A,E
Not really enough information to give a definitive assessment, but some things to consider:
You're unlikely to get a skip scan benefit, so if you want snappy
response from predicates with leading E or leading D, that will be 2
indexes. (One leading with D, and one leading with E).
If A/B are updated frequently (although that's a generic term),
you might choose to leave them out of the index definition in
order to reduce index maintenance overhead.

How does Excel Solver compare 2 [n*n]matrices in a constraint

Suppose i have 2 tables t1, t2 (or matrices) that are both square (e.g. both are 3X3).
In Solver I add the following constraint :
t1 >= t2
Then how does solver compare these values?
-Value at 1X1 in t1 >= 1X1 in t2, 1X2 in t1 >= 1X2 in t2,...
-Any value in t1 must be >= the largest value in t2
-...
If it is not the one, how can I obtain the first situation? Do I enter every value comparison by hand (since that will take quite some time)
It makes the comparison element-wise. You can confirm by getting the "Answer Report".
Here are the matrices on C5:G19 and J5:N19
You add the constraint stating C5:G9 <= (or >=) J5:N9
As you can see from the formula column, it makes the comparisons element-wise (C5<J5, D5<K5, ..., G9<N9).

How to find a series of transactions happening in a range of time?

I have a dataset with nodes that are companies linked by transactions.
A company has these properties : name, country, type, creation_date
The relationships "SELLS_TO" have these properties : item, date, amount
All dates are in the following format YYYYMMDD.
I'm trying to find a series of transactions that :
- include 2 companies from 2 distinct countries
- where between the first node in the series and the last one, there is a company that has been created less than 90 days ago
- where the total time between the first transaction and the last transaction is < 15 days
I think I can handle the conditions 1) and 2) but I'm stuck on 3).
MATCH (a:Company)-[r:SELLS_TO]->(b:Company)-[v:SELLS_TO*]->(c:Company)
WHERE NOT(a.country = c.country) AND (b.creation_date + 90 < 20140801)
Basically I don't know how to get the date of the last transaction in the series. Anyone knows how to do that?
jvilledieu,
In answer to your most immediate question, you can access the collections of nodes and relationships in the matched path and get the information you need. The query would look something like this.
MATCH p=(a:Company)-[rs:SELLS_TO*]->(c:Company)
WHERE a.country <> c.country
WITH p, a, c, rs, nodes(p) AS ns
WITH p, a, c, rs, filter(n IN ns WHERE n.creation_date - 20140801 < 90) AS bs
WITH p, a, c, rs, head(bs) AS b
WHERE NOT b IS NULL
WITH p, a, b, c, head(rs) AS r1, last(rs) AS rn
WITH p, a, b, c, r1, rn, rn.date - r1.date AS d
WHERE d < 15
RETURN a, b, c, d, r1, rn
This query finds a chain with at least one :SELLS_TO relationship between :Company nodes and assigns the matched path to 'p'. The match is then limited to cases where the first and last company have different countries. At this point the WITH clauses develop the other elements that you need. The collection of nodes in the path is obtained and named 'ns'. From this, a collection of nodes where the creation date is less than 90 days from the target date is found and named 'bs'. The first node of the 'bs' collection is then found and named 'b', and the match is limited to cases where a 'b' node was found. The first and last relationships are then found and named 'r1' and 'rn'. After this, the difference in their dates is calculated and named 'd'. The match is then limited to cases where d is less than 15.
So that gives you an idea of how to do this. There is another problem though. At least, in the way you have described the problem, you will find that the date math will fail. Dates that are represented as numbers, such as 20140801, are not linear, and thus cannot be used for interval math. As an example, 15 days from 20140820 is 20140904. If you subtract these two date 'numbers', you get 84. One example of how to do this is to represent your dates as days since an epoch date.
Grace and peace,
Jim

Pearson Correlation from multiple rows

I want to calculate the Pearson Correlation between two arrays.
The function CORR only accepts 2 values which has to be in a table. In my procedure I select multiple rows of numbers from two different sets and I want to calculate the correlation from them.
EDIT:
The corr function is an oracle function which calculates the pearson correlation between two values. Here is the problem. I want to calculate the correlation between two arrays which says to me array1 is similar to array2 of for example 50%.
You can simply calculate average of pairwise correlations
select
(abs(corr1) + abs(corr2) + abs(corr3))/3 as Avg_Corr
from (
SELECT
CORR(a.col1, b.col1) as corr1,
CORR(a.col2, b.col2) as corr2,
CORR(a.col3, b.col3) as corr3
FROM table1 a, table2 b
WHERE a.id = b.id
)
or use more complex but more adequate generalization of Pearson correlation (there are no internal function in Oracle for this)

Leads to find tables correlation

Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.

Resources