Oracle hierarchical SQL - cycle clause or collection of seen elements? - oracle

I have undirected and potentially disconnected graph represented as table of edges.
I need to return list of all edges reachable from given initial set of edges.
This is common task which can be found at many web sites, the recursive query with cycle clause is in many tutorials.
What particularly occupies my mind is:
In what aspect is the cycle clause better, in comparison with detecting cycles "manually"?
Example:
1
1-----2
| /|
| / |
3| /5 |2
| / |
|/ |
3-----4
4
with graph (a, id, b) as (
select 1, 1, 2 from dual union all
select 2, 2, 4 from dual union all
select 1, 3, 3 from dual union all
select 3, 4, 4 from dual union all
select 2, 5, 3 from dual union all
select null, null, null from dual where 0=1
) --select * from graph;
, input (id) as (
select column_value from table(sys.ku$_objnumset(2,4))
) --select * from input;
, s (l, path, dup, seen, a, id, b) as ( -- solution using set of seen edges
select 0, '/' || g.id, 1
, cast(collect(to_number(g.id)) over () as sys.ku$_objnumset)
, g.a, g.id, g.b
from graph g
where g.id in (select i.id from input i)
union all
select s.l + 1, s.path || '/' || g.id, row_number() over (partition by g.id order by null)
, s.seen multiset union distinct cast(collect(to_number(g.id)) over () as sys.ku$_objnumset)
, g.a, g.id, g.b
from s
join graph g on s.id != g.id
and g.id not member of (select s.seen from dual)
and (s.a in (g.a, g.b) or s.b in (g.a, g.b))
where s.dup = 1
)
, c (l, path, a, id, b) as ( -- solution using cycle clause
select 0, '/' || g.id
, g.a, g.id, g.b
from graph g
where g.id in (select i.id from input i)
union all
select c.l + 1, c.path || '/' || g.id
, g.a, g.id, g.b
from c
join graph g on c.id != g.id
and (c.a in (g.a, g.b) or c.b in (g.a, g.b))
)
cycle id set is_cycle to 1 default 0
--select * from s; --6 rows
--select distinct id from s order by id; --5 rows
select * from c order by l; --214 rows (!)
--select distinct id from c where is_cycle = 0 order by id; --5 rows
There are 2 different solutions represented by CTEs s and c.
In both solutions an edge is expanded from another edge if they have common vertex.
Solution s (seen set-based) works like flood.
It is based on mass collection of all edges on particular recursion level thanks to collect() over () clause.
Input edges are on 0th level, their neigbors on 1st level etc.
Each edge belongs to just one level.
Some edge can occur multiple times on given level thanks to expansion from many edges on parent level (for instance the edge 5 in sample graph) but these duplicities are eliminated on next level using dup column.
Solution c (cycle clause-based) is based on built-in cycle detection.
The substantial difference from solution s is in the way how rows on next recursion level are expanded.
Every row in recursive part is aware only of the information of single ancestor row from previous recursion level.
Hence there are many repetitions since the graph traversal practically generates all distinct walks.
For instance, if initial edges are {2,4}, each of them is not aware of the other one so edge 2 expands to edge 4 and edge 4 expands to edge 2. Similarly on further levels where this effect is multiplied.
The cycle clause eliminates only duplicates within ancestor chain of given row, without respect to siblings.
Various sources on the web recommend to postprocess such huge resultset using distinct or analytical function (see here).
In my experience this does not eliminate the explosion of many possibilities. For real graph with 65 edges, which is still small, the query c didn't finished but query s finished in hundreds of milliseconds.
I am interested in knowing why cycle-based solution is so much favoured in tutorials and literature.
I prefer using standard ways and don't strive to built own cycle detecting solution as I've been taught here, however the s solution works much better for me, which makes me little bit confused. (Note explain plan looks less expensive for s solution. Also I tried Oracle proprietary connect by-based solution which was slow too - I omit it here for brevity.)
My question is:
Do you see any substantial drawbacks of s solution or have any idea how to improve the c solution to avoid traversal of unnecessary combinations?

My experience working with SET functions such as COLLECT, DISTINCT, UNION, MEMBER OF and etc shows that they are quite slow with growing number of elements. You won't notice that until start testing with really big collections. So without going in details I would say that straight away just based on previous experience.
But it does not mean it will not work. They functions are really easy to use and code looks more readable. But they just slow comparing with alternative methods.

Related

Alternative to using ungroup in kdb?

I have two tables in KDB.
One is a timeseries with a datetime, sym column (spanning multiple dates, eg could be 1mm rows or 2mm rows). Each timepoint has the same number of syms and few other standard columns such as price.
Let's call this t1:
`date`datetime`sym`price
The other table is of this structure:
`date`sym`factors`weights
where factors is a list and weights is a list of equal length for each sym.
Let's call this t2.
I'm doing a left join on these two tables and then an ungroup.
factors and weights are of not equal length for each sym.
I'm doing the following:
select sum (weights*price) by date, factors from ungroup t1 lj `date`sym xkey t2
However this is very slow and can be as slow as 5-6 seconds if t1 has a million rows or more.
Calling all kdb experts for some advice!
EDIT:
here's a full example:
(apologies for the roundabout way of defining t1 and t2)
interval: `long$`time$00:01:00;
hops: til 1+ `int$((`long$(et:`time$17:00)-st:`time$07:00))%interval;
times: st + `long$interval*hops;
dates: .z.D - til .z.D-.z.D-10;
timepoints: ([] date: dates) cross ([] time:times);
syms: ([] sym: 300?`5);
universe: timepoints cross syms;
t1: update datetime: date+time, price:count[universe]?100.0 from universe;
t2: ([] date:dates) cross syms;
/ note here my real life t2, doesn't have a count of 10 weights/factors for each sym, it can vary by sym.
t2: `date`sym xkey update factors: count[t2]#enlist 10?`5, weights: count[t2]#enlist 10?10 from t2;
/ what is slow is the ungroup
select sum weights*price by date, datetime, factors from ungroup t1 lj t2
One approach to avoid the ungroup is to work with matrices (aka lists of lists) and take advantage of the optimised matrix-multiply $ seen here: https://code.kx.com/q/ref/mmu/
In my approach below, instead of joining t2 to t1 to ungroup, I group t1 and join to t2 (thus keeping everything as lists of lists) and then use some matrix manipulation (with a final ungroup at the end on a much smaller set)
q)\ts res:select sum weights*price by date, factors from ungroup t1 lj t2
4100 3035628112
q)\ts resT:ungroup exec first factors,sum each flip["f"$weights]$price by date:date from t2 lj select price by date,sym from t1;
76 83892800
q)(0!res)~`date`factors xasc `date`factors`weights xcol resT
1b
As you can see its much quicker (at least on my machine) and the result is identical save for ordering and column names.
You may still need to modify this solution somewhat to work in your actual use-case (with variable weights etc - in this case perhaps enforce a uniform number of weights across each sym filling with zeros if necessary)

How to find a series of transactions happening in a range of time?

I have a dataset with nodes that are companies linked by transactions.
A company has these properties : name, country, type, creation_date
The relationships "SELLS_TO" have these properties : item, date, amount
All dates are in the following format YYYYMMDD.
I'm trying to find a series of transactions that :
- include 2 companies from 2 distinct countries
- where between the first node in the series and the last one, there is a company that has been created less than 90 days ago
- where the total time between the first transaction and the last transaction is < 15 days
I think I can handle the conditions 1) and 2) but I'm stuck on 3).
MATCH (a:Company)-[r:SELLS_TO]->(b:Company)-[v:SELLS_TO*]->(c:Company)
WHERE NOT(a.country = c.country) AND (b.creation_date + 90 < 20140801)
Basically I don't know how to get the date of the last transaction in the series. Anyone knows how to do that?
jvilledieu,
In answer to your most immediate question, you can access the collections of nodes and relationships in the matched path and get the information you need. The query would look something like this.
MATCH p=(a:Company)-[rs:SELLS_TO*]->(c:Company)
WHERE a.country <> c.country
WITH p, a, c, rs, nodes(p) AS ns
WITH p, a, c, rs, filter(n IN ns WHERE n.creation_date - 20140801 < 90) AS bs
WITH p, a, c, rs, head(bs) AS b
WHERE NOT b IS NULL
WITH p, a, b, c, head(rs) AS r1, last(rs) AS rn
WITH p, a, b, c, r1, rn, rn.date - r1.date AS d
WHERE d < 15
RETURN a, b, c, d, r1, rn
This query finds a chain with at least one :SELLS_TO relationship between :Company nodes and assigns the matched path to 'p'. The match is then limited to cases where the first and last company have different countries. At this point the WITH clauses develop the other elements that you need. The collection of nodes in the path is obtained and named 'ns'. From this, a collection of nodes where the creation date is less than 90 days from the target date is found and named 'bs'. The first node of the 'bs' collection is then found and named 'b', and the match is limited to cases where a 'b' node was found. The first and last relationships are then found and named 'r1' and 'rn'. After this, the difference in their dates is calculated and named 'd'. The match is then limited to cases where d is less than 15.
So that gives you an idea of how to do this. There is another problem though. At least, in the way you have described the problem, you will find that the date math will fail. Dates that are represented as numbers, such as 20140801, are not linear, and thus cannot be used for interval math. As an example, 15 days from 20140820 is 20140904. If you subtract these two date 'numbers', you get 84. One example of how to do this is to represent your dates as days since an epoch date.
Grace and peace,
Jim

Finding smallest set of criteria for uniqueness

I have a collection of objects with properties. I want to find the simplest set of criteria that will specify exactly one of these objects (I do not care which one).
For example, given {a=1, b=1, c=1}, {a=1, b=2, c=1}, {a=1, b=1, c=2}, specifying b==2 (or c==2) will give me an unique object.
Likewise, given {a=1, b=1, c=1}, {a=1, b=2, c=2}, {a=1, b=2, c=1}, specifying b==2 and c==2 (or b==1 && c==1 or b==2 && c==1) will give me an unique object.
This sounds like a known problem, with a known solution, but I haven't been able to find the correct formulation of the problem to allow me to Google it.
It is indeed a known problem in AI - feature selection. There are many algorithms for doing this Just Google "feature selection" "artificial intelligence".
The main issue is that when the samples set is large, you need to use some sort of heuristics in order to reach a solution within a reasonable time.
Feature Selection in Data Mining
The main idea of feature selection is to choose a subset of input
variables by eliminating features with little or no predictive
information.
The freedom of choosing the target is sort of unusual. If the target is specified, then this is essentially the set cover problem. Here's two corresponding instances side by side.
A={1,2,3} B={2,4} C={3,4} D={4,5}
0: {a=0, b=0, c=0, d=0} # separate 0 from the others
1: {a=1, b=0, c=0, d=0}
2: {a=1, b=1, c=0, d=0}
3: {a=1, b=0, c=1, d=0}
4: {a=0, b=1, c=1, d=1}
5: {a=0, b=0, c=0, d=1}
While set cover is NP-hard, however, your problem has an O(mlog n + O(1) poly(n)) algorithm where m is the number of attributes and n is the number of items (the optimal set of criteria has size at most log n), which makes it rather unlikely that an NP-hardness proof is forthcoming. I'm reminded of the situation with the Junta problem (basically the theoretical formulation of feature selection).
I don't know how easily this could be translated into an algoritm but using SQL, which is already set based, it could go like this
construct a table with all possible combinations of columns from the input table
select all combinations that appear equal to the amount of records present in the input table as distinct combinations.
SQL Script
;WITH q (a, b, c) AS (
SELECT '1', '1', '1'
UNION ALL SELECT '1', '2', '2'
UNION ALL SELECT '1', '2', '1'
UNION ALL SELECT '1', '1', '2'
)
SELECT col
FROM (
SELECT val = a, col = 'a' FROM q
UNION ALL SELECT b, 'b' FROM q
UNION ALL SELECT c, 'c' FROM q
UNION ALL SELECT a+b, 'a+b' FROM q
UNION ALL SELECT a+c, 'a+c' FROM q
UNION ALL SELECT b+c, 'b+c' FROM q
UNION ALL SELECT a+b+c, 'a+b+c' FROM q
) f
GROUP BY
col
HAVING
COUNT(DISTINCT (val)) = (SELECT COUNT(*) FROM q)
Your problem can be defined as follows:
1 1 1 -> A
1 2 1 -> B
1 1 2 -> C
.
.
where 1 1 1 is called the feature vector and A is the object class. You can then use decision trees (with pruning) to find a set of rules to classify objects. So, if your objective is to automatically decide the set of criteria to identify object A then, you can observe the path on the decision tree which leads to A.
If you have access to MATLAB, it is really easy to obtain a decision tree for your data.

Leads to find tables correlation

Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.

Turn Multiple (M) Rows of (N) Columns into One Row with (M*N) Columns In Oracle

I've had a lot of trouble finding examples or information on this. I've looked into PIVOT but the examples I've found left me a little confused as to what is actually going on. I'm also not looking to sum or group the data.
Essentially, I have a query that returns 2 rows of 4 columns
A B C D
--------
1 2 4 8
2 4 8 0
And I want that look like
A B C D A2 B2 C2 D2
-------------------
1 2 3 4 2 4 8 0
Is this something I can accomplish without PL/SQL?
EDIT: If there is a way to do this for a fixed number of rows and columns - I'd still welcome the answer to that approach. Ideally, it would also work on SQL Server, but I'd be happy with an Oracle specific solution.
Since you have mentioned that even a solution with fixed number of rows and columns is ok, you can try this...
SELECT MIN(DECODE(rownum, 1, A, null)) A,
MIN(DECODE(rownum, 1, B, null)) B,
MIN(DECODE(rownum, 1, C, null)) C,
MIN(DECODE(rownum, 1, D, null)) D,
MIN(DECODE(rownum, 2, A, null)) A2,
MIN(DECODE(rownum, 2, B, null)) B2,
MIN(DECODE(rownum, 2, C, null)) C2,
MIN(DECODE(rownum, 2, D, null)) D2
FROM <test_table>
I have never seen a mechanism to vary the number of columns in the result set based on the number of rows returned. You might be able to dynamically generate a query based on the result set, then execute the query.
PIVOT tables are more about switching the X and Y coordinates. I don't think this is what you are looking for.
Given you have a fixed maximum you could always return 6*N columns and use decode on rownum to select data into the appropriate columns.
A select statement always needs to have a fixed number of columns determinable at parse time. What you are asking is to be able to determine the columns at fetch time.
Can't be done.

Resources