I have a table with 2 million rows.
The ndv ( number of distinct values ) in the columns are as follows :
A - 3
B - 60
D - 150
E - 600,000
The most frequently updated columns are A & B ( NDV = 3 for both ).
Assuming every query will have either column D or column E in WHERE clause, which of following will be the best set of indexes for SELECT statement:
D
D,E,A
E,A
A,E
Not really enough information to give a definitive assessment, but some things to consider:
You're unlikely to get a skip scan benefit, so if you want snappy
response from predicates with leading E or leading D, that will be 2
indexes. (One leading with D, and one leading with E).
If A/B are updated frequently (although that's a generic term),
you might choose to leave them out of the index definition in
order to reduce index maintenance overhead.
Related
I have two tables in KDB.
One is a timeseries with a datetime, sym column (spanning multiple dates, eg could be 1mm rows or 2mm rows). Each timepoint has the same number of syms and few other standard columns such as price.
Let's call this t1:
`date`datetime`sym`price
The other table is of this structure:
`date`sym`factors`weights
where factors is a list and weights is a list of equal length for each sym.
Let's call this t2.
I'm doing a left join on these two tables and then an ungroup.
factors and weights are of not equal length for each sym.
I'm doing the following:
select sum (weights*price) by date, factors from ungroup t1 lj `date`sym xkey t2
However this is very slow and can be as slow as 5-6 seconds if t1 has a million rows or more.
Calling all kdb experts for some advice!
EDIT:
here's a full example:
(apologies for the roundabout way of defining t1 and t2)
interval: `long$`time$00:01:00;
hops: til 1+ `int$((`long$(et:`time$17:00)-st:`time$07:00))%interval;
times: st + `long$interval*hops;
dates: .z.D - til .z.D-.z.D-10;
timepoints: ([] date: dates) cross ([] time:times);
syms: ([] sym: 300?`5);
universe: timepoints cross syms;
t1: update datetime: date+time, price:count[universe]?100.0 from universe;
t2: ([] date:dates) cross syms;
/ note here my real life t2, doesn't have a count of 10 weights/factors for each sym, it can vary by sym.
t2: `date`sym xkey update factors: count[t2]#enlist 10?`5, weights: count[t2]#enlist 10?10 from t2;
/ what is slow is the ungroup
select sum weights*price by date, datetime, factors from ungroup t1 lj t2
One approach to avoid the ungroup is to work with matrices (aka lists of lists) and take advantage of the optimised matrix-multiply $ seen here: https://code.kx.com/q/ref/mmu/
In my approach below, instead of joining t2 to t1 to ungroup, I group t1 and join to t2 (thus keeping everything as lists of lists) and then use some matrix manipulation (with a final ungroup at the end on a much smaller set)
q)\ts res:select sum weights*price by date, factors from ungroup t1 lj t2
4100 3035628112
q)\ts resT:ungroup exec first factors,sum each flip["f"$weights]$price by date:date from t2 lj select price by date,sym from t1;
76 83892800
q)(0!res)~`date`factors xasc `date`factors`weights xcol resT
1b
As you can see its much quicker (at least on my machine) and the result is identical save for ordering and column names.
You may still need to modify this solution somewhat to work in your actual use-case (with variable weights etc - in this case perhaps enforce a uniform number of weights across each sym filling with zeros if necessary)
I'm very confused when it comes to the topic of cardinality in Relational Algebra. I understand that cardinality essentially refers to the uniqueness of a table or data set. So I'll walk through a problem I attempted to solve and maybe someone can help me out, or give me better resources than the ones I've found.
I've got a table R2, with attributes D, E, and F, where D is a Primary Key, and E and F are Foreign Keys relating to the Primary Keys of the following table. Table R3, with attributes G, H, and I, where G and H are PKs. R2 has cardinality N2 = 100, R3 has cardinality N3 = 200. So what would the min and max cardinality be of a table created by joining R2 to R3 with the condition that E = G and F = H?
My answer is that the minimum is 1, and max is 200, or N3. My thought process is that E and F are FKs, so they can have many repeating values so long as they come from G and H, but since G and H are PKs, at least one value for E and F would be unique, and D is a PK as well, so at least one value is unique there too. So I assume those unique values mean the cardinality must be at least 1, and at most, it can have the same cardinality as R3, which is 200. But honestly, my own reasoning doesn't even make sense to me...
The whole idea seems really abstract to me. Attribute I is the only non FK/PK in the problem, so how does that affect the cardinality? Sorry for the long winded question, I'm just very confused by the whole idea of this and would love any help in general regarding the subject.
You are not equijoining FK-to-CK (foreign key to candidate key). You are equijoining on EF subtuples matching GH subtuples. Although every E has a G & every F has an H, there does not have to be a single EF-GH match. G & H are unique so GH is unique so each EF can match at most one, so there could be 0 to 100 rows in the result.
(If you want to make sound analyses you need to find the minimum & maximum results for various cases of kinds of joins on column sets referencing (having to appear elsewhere as) others. You can handle more cases by dealing with superkeys (unique column sets) not CKs (candidate keys) (superkeys containing no smaller superkeys). You mean CK when you say "PK" (primary key)--there can be at most one PK per table. For no duplicates or nulls, SQL UNIQUE is superkey & FOREIGN KEY is foreign superkey.
I have a dataset with nodes that are companies linked by transactions.
A company has these properties : name, country, type, creation_date
The relationships "SELLS_TO" have these properties : item, date, amount
All dates are in the following format YYYYMMDD.
I'm trying to find a series of transactions that :
- include 2 companies from 2 distinct countries
- where between the first node in the series and the last one, there is a company that has been created less than 90 days ago
- where the total time between the first transaction and the last transaction is < 15 days
I think I can handle the conditions 1) and 2) but I'm stuck on 3).
MATCH (a:Company)-[r:SELLS_TO]->(b:Company)-[v:SELLS_TO*]->(c:Company)
WHERE NOT(a.country = c.country) AND (b.creation_date + 90 < 20140801)
Basically I don't know how to get the date of the last transaction in the series. Anyone knows how to do that?
jvilledieu,
In answer to your most immediate question, you can access the collections of nodes and relationships in the matched path and get the information you need. The query would look something like this.
MATCH p=(a:Company)-[rs:SELLS_TO*]->(c:Company)
WHERE a.country <> c.country
WITH p, a, c, rs, nodes(p) AS ns
WITH p, a, c, rs, filter(n IN ns WHERE n.creation_date - 20140801 < 90) AS bs
WITH p, a, c, rs, head(bs) AS b
WHERE NOT b IS NULL
WITH p, a, b, c, head(rs) AS r1, last(rs) AS rn
WITH p, a, b, c, r1, rn, rn.date - r1.date AS d
WHERE d < 15
RETURN a, b, c, d, r1, rn
This query finds a chain with at least one :SELLS_TO relationship between :Company nodes and assigns the matched path to 'p'. The match is then limited to cases where the first and last company have different countries. At this point the WITH clauses develop the other elements that you need. The collection of nodes in the path is obtained and named 'ns'. From this, a collection of nodes where the creation date is less than 90 days from the target date is found and named 'bs'. The first node of the 'bs' collection is then found and named 'b', and the match is limited to cases where a 'b' node was found. The first and last relationships are then found and named 'r1' and 'rn'. After this, the difference in their dates is calculated and named 'd'. The match is then limited to cases where d is less than 15.
So that gives you an idea of how to do this. There is another problem though. At least, in the way you have described the problem, you will find that the date math will fail. Dates that are represented as numbers, such as 20140801, are not linear, and thus cannot be used for interval math. As an example, 15 days from 20140820 is 20140904. If you subtract these two date 'numbers', you get 84. One example of how to do this is to represent your dates as days since an epoch date.
Grace and peace,
Jim
Let's say I have a dataset with following schema:
ItemName (String) , Length (long)
I need to find items that are duplicates based on their length. That's pretty easy to do in PIG:
raw_data = LOAD...dataset
grouped = GROUP raw_data by length
items = FOREACH grouped GENERATE COUNT(raw_data) as count, raw_data.name;
dups = FILTER items BY count > 1;
STORE dups....
The above finds exact duplicates. Given the set bellow:
a, 100
b, 105
c, 100
It will output 2, (a,c)
Now I need to find duplicates using a threshold. For example a threshold of 5 would mean match items if their length +/- 5. So the output should look like:
3, (a,b,c)
Any ideas how I can go about doing this?
It is almost like I want PIG to use a UDF as its comparator when it is comparing records during its join...
I think the only way to do what you want is to load the data into two tables and do a cartesian join of the data set onto itself, so that each value can be compared to each other value.
Pseudo-code:
r1 = load dataset
r2 = load dataset
rcross = cross r1, r2
rcross is a cartesian product that will allow you to check the difference in length between each pair.
I was solving a similar problem once and got one crazy and dirty solution.
It is based on next lemma:
If |a - b| < r then there exists such an integer number x: 0 <= x < r that
floor((a+x)/r) = floor((b+x)/r)
(further I will mean only integer division and will omit floor() function, i.e. 5/2=2)
This lemma is obvious, I'm not gonna prove it here
Based on this lemma you may do a next join:
RESULT = JOIN A by A.len / r, B By B.len / r
And get several values from all values corresponding to |A.len - B.len| < r
But doing this r times:
RESULT0 = JOIN A by A.len / r, B By (B.len / r)
RESULT1 = JOIN A by (A.len+1) / r, B By (B.len+1) / r
...
RESULT{R-1} = JOIN A by (A.len+r-1) / r, B By (B.len+r-1) / r
you will get all needed values. Of course you will get more rows than you need, but as I said already it's a dirty solution (i.e. it's not optimal, but works)
The other big disadvantage of this solution is that JOINs should be written dynamically and their number will be big for big r.
Still it works if you know r and it is rather small (like r=6 in your case)
Hope it helps
Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.