MySQL query over many-to-many realtion: unions? - performance

In addition to this question SQL query that gives distinct results that match multiple columns
which had very neat solution, I was wondering how the next step would look:
DOCUMENT_ID | TAG
----------------------------
1 | tag1
1 | tag2
1 | tag3
2 | tag2
3 | tag1
3 | tag2
4 | tag1
5 | tag3
So, to get all the document_ids that have tag 1 and 2 we would perform a query like this:
SELECT document_id
FROM table
WHERE tag = 'tag1' OR tag = 'tag2'
GROUP BY document_id
HAVING COUNT(DISTINCT tag) = 2
Now, what would be interesting to know is how we would get all the distinct document_ids that have tags 1 and 2, and in addition to that the ids that have tag 3.
We could imagine making the same query and performing a union between them:
SELECT document_id
FROM table
WHERE tag = "tag1" OR tag = "tag2"
GROUP BY document_id
HAVING COUNT(DISTINCT tag) = 2
UNION
SELECT document_id
FROM table
WHERE tag = "tag3"
GROUP BY document_id
But I was wondering if with that condition added, we could think of another initial query. I am imagining having many "unions" like that with different tags and tag counts.
Wouldn't it be very bad in terms of performance to create chains of unions like that?

This still uses unions of sorts but may be easier to read and control. I am really interested on the speed of this query on a large data set, so please let me know how fast it is. When I put in your small data set it took 0.0001 secs.
SELECT DISTINCT (dt1.document_id)
FROM
document_tag dt1,
(SELECT document_id
FROM document_tag
WHERE tag = 'tag1'
) AS t1s,
(SELECT document_id
FROM document_tag
WHERE tag = 'tag2'
) AS t2s,
(SELECT document_id
FROM document_tag
WHERE tag = 'tag3'
) AS t3s
WHERE
(dt1.document_id = t1s.document_id
AND dt1.document_id = t2s.document_id
)
OR dt1.document_id = t3s.document_id
This will make it easy to add new parameters because you have already specified the result set for each tag.
For example adding:
OR dt1.document_id = t2s.document_id
to the end will also pick up document_id 2

It's possible to do this within a single, however you'll need to promote your WHERE clause into the having clause in order to use a disjunctive.

You're correct, that will get slower and slower as you add new tags you want to look for in additional UNION clauses. Each UNION clause is an additional query that needs to be planned and executed. Plus you won't be able to sort when you are done.
You're looking for a basic data warehousing technique. First, let me recreate your schema with one additional table.
create table a (document_id int, tag varchar(10));
insert into a values (1, 'tag1'), (1, 'tag2'), (1, 'tag3'), (2, 'tag2'),
(3, 'tag1'), (3, 'tag2'), (4, 'tag1'), (5, 'tag3');
create table b (tag_group_id int, tag varchar(10));
insert into b values (1, 'tag1'), (1, 'tag2'), (2, 'tag3');
Table b contains "tag groups". Group 1 includes tag1 and tag2, while group 2 contains tag3.
Now you can modify table b to represent the query you are interested in. When you are ready to query, you create temp tables to store aggregate data:
create temporary table c
(tag_group_id int, count_tags_in_group int, tags_in_group varchar(255));
insert into c
select
tag_group_id,
count(tag),
group_concat(tag)
from b
group by tag_group_id;
create temporary table d (document_id int, tag_group_id int, document_tag_count int);
insert into d
select
a.document_id,
b.tag_group_id,
count(a.tag) as document_tag_count
from a
inner join b on a.tag = b.tag
group by a.document_id, b.tag_group_id;
Now c contains the number of tags for tag group, and d contains the number of tags each document has for each tag group. If a row in c matches a row in d, then that means that document has all of the tags in that tag group.
select
d.document_id as "Document ID",
c.tags_in_group as "Matched Tag Group"
from d
inner join c on d.tag_group_id = c.tag_group_id
and d.document_tag_count = c.count_tags_in_group
One cool thing about this approach is that you could run reports like 'How many documents have 50% or more of the tags in each of these tag groups?'
select
d.document_id as "Document ID",
c.tags_in_group as "Matched Tag Group"
from d
inner join c on d.tag_group_id = c.tag_group_id
and d.document_tag_count >= 0.5 * c.count_tags_in_group

Related

How to compare the column names in one table to the values in another in impala

first is the main table and second is the lookup table.
I need to compare the column names of first table to the values in the second table and if a certain column name is found in any row of the second table then fetch some fields out of second table.
Is it possible to do it in impala?
Table 1
source |location |origin
----------+----------+-------
s1 |india |xxx
Table 2
extractedfrom|lct |lkp_value|map_value
-------------+----------+---------+---------
s1 |location |india |india_x
s1 |origin |xxx |yyyyyy
i need to have something like
final view required
source |location |origin |location_ll|origin_lkp
----------+----------+----------+-----------+----------
s1 |india |xxx |india_x |yyyyy
You should edit your post to be more specific about what you are trying to do and how you wish to join the tables.
The following query should work for you given the example you provided.
SELECT t1.source,
t1.location,
t1.origin,
t2_loc.map_value AS location_lkp,
t2_ori.map_value AS origin_lkp
FROM Table1 t1
JOIN Table2 t2_loc ON t1.source = t2_loc.extractedfrom
AND t1.location = t2_loc.lkp_value
JOIN Table2 t2_ori ON t1.source = t2_ori.extractedfrom
AND t1.origin = t2_ori.lkp_value
WHERE t2_loc.lct = 'location'
AND t2_ori.lct = 'origin'
The trick is that you join to Table2 multiple times - one for each column you wish to match upon.

Restructure Inline view

SELECT MAX(column1)
FROM table1 B , table2 A, table3 H
WHERE B.unit=A.unit
AND B.value=A.value
AND B.unit=H.unit
AND B.value=H.value
AND A.number=1234
Can someone help me to restructure this query in inline view?
SAMPLE
Table1
------
Value Unit
001 A1
002 B1
003 C2
002 A1
Table2
--------
Value Unit Number
001 B4 11
002 B1 1234
004 B1 22
TABLE3
-------
VALUE UNIT NUMBER COLUMN1
001 B4 11 555
002 B1 1234 557
002 B1 1234 559
OUTPUT
------
MAX(C0LUMN1)
-----------
559
In your query there is no need for inlineview :-
if that is rewritten in inlineview it will be like
Select Max(Column1)
From (Select Value,Unit From Table1)B,
(Select Value,Unit,Number From Table2)A,
Table3 as H
Where B.Unit=A.Unit
And B.Value=A.Value
AND B.unit=H.unit
And B.Value=H.Value
AND A.number=1234;
Below is the example when to use inline view hope this help!!!
An inline view is a SELECT statement in the FROM clause. As mentioned in the View section, a view is a virtual table that has the characteristics of a table yet does not hold any actual data. In an inline view construct, instead of specifying table name(s) after the FROM keyword, the source of the data actually comes from a view that is created within the SQL statement. The syntax for an inline view is,
SELECT "column_name" FROM (Inline View);
When should we use inline view? Below is an example:
Assume we have two tables: The first table is User_Address, which maps each user to a ZIP code; the second table is User_Score, which records all the scores of each user. The question is, how to write a SQL query to find the number of users who scored higher than 200 for each ZIP code?
Without using an inline view, we can accomplish this in two steps:
Query 1
CREATE TABLE User_Higher_Than_200
SELECT User_ID, SUM(Score) FROM User_Score
GROUP BY User_ID
HAVING SUM(Score) > 200;
Query 2
SELECT a2.ZIP_CODE, COUNT(a1.User_ID)
FROM User_Higher_Than_200 a1, User_Address a2
WHERE a1.User_ID = a2.ZIP_CODE
GROUP BY a2.ZIP_CODE;
In the above code, we introduced a temporary table, User_Higher_Than_200, to store the list of users who scored higher than 200. User_Higher_Than_200 is then used to join to the User_Address table to get the final result.
We can simplify the above SQL using the inline view construct as follows:
Query 3
SELECT a2.ZIP_CODE, COUNT(a1.User_ID)
FROM
(SELECT User_ID, SUM(Score) FROM
User_Score GROUP BY User_ID HAVING SUM(Score) > 200) a1,
User_Address a2
WHERE a1.User_ID = a2.ZIP_CODE
GROUP BY a2.ZIP_CODE;
There are two advantages on using inline view here:
We do not need to create the temporary table. This prevents the database from having too many objects, which is a good thing as each additional object in the database costs resources to manage.
We can use a single SQL query to accomplish what we want
Notice that we treat the inline view exactly the same as we treat a table. Comparing Query 2 and Query 3, we see that the only difference is we replace the temporary table name in Query 2 with the inline view statement in Query 3. Everything else stays the same.
Inline view is sometimes referred to as derived table. These two terms are used interchangeably.
I need to show column from other table that have the max column value
SELECT MAX( H.column1 ) AS max_column1,
MAX( A.number ) KEEP ( DENSE_RANK LAST ORDER BY H.column1 ) AS max_number
FROM table1 B
INNER JOIN table2 A
ON ( B.unit = A.unit AND B.value = A.value )
INNER JOIN table3 H
ON ( B.unit = H.unit AND B.value = H.value )
WHERE A.number=1234

In Elasticsearch, how can I establish join query with conditions and later perform percentile and count functions?

I have set of tables in my data base like table A which has set of set of categories , table B set of repositeries. A and B are related by categoryid. And then table C which has set of properties for a repoId. Table C and A are associated with repoId.
Table C can have multiple values for a repoId.
The data in C table is like a property say a number string like 12345XXXX (max data of 10 characters) and I have to find the top 6 matching characters of a particular value in table C and the count of repoIds associated with those top 6 value for a particular data in table A (categoryid).
Table A(set of categories ) ---------> Table B (set of repositories, associated with A with categoryid)---------> Table V (set of FMProperties against a repoId)
Now currently, this has been achieved by using joins and substring queries on these tables and it is very slow.
I have to achieve this functionality using Elastic search. I dont have clear view how to start?
Do I create separate documents / indexes for table A , B and C or fetch the info using sql query and create a single document.
And how we can apply this analytics part explained above.
I am very new and amateur in this technology but I am following the tutorials provided at elasticsearch site.
PFB the query in mysql for this logic:-
select 'fms' as fmstype, C.fmscode as fmsCode,
count(C.repoId) as countOffms from tableC C, tableB B
where B.repoId = C.repoId and B.categoryid = 175
group by C.fmscode
order by countOffms desc
limit 1)
UNION ALL
(select 'fms6' as fmstype, t1.fmscode, t2.countOffms from tableC t1
inner join
(
select substring(C.fmscode,1,6) as first6,
count(C.repoId) as countOffms from tableC C, tableB B
where B.repoId = C.repoId and B.categoryid = 175 and length(C.fmscode) = 6
group by substring(C.fmscode,1,6) order by countOffms desc
limit 1 ) t2
ON
substring(t1.fmscode,1,6) = t2.first6 and length(t1.fmscode) = 6
group by t1.fmscode
order by count(t1.fmscode) desc
limit 1)

Hibernate HQL GroupBy in Oracle

I created this query using HQL with Hibernate and Oracle
select c from Cat c
left join c.kittens k
where (c.location= 1 OR c.location = 2)
and (i.activo = 1)
group
by c.id,
c.name,
c.fulldescription,
c.kittens
order by count(e) desc
The problem comes with the fact that in HQL you need to specify all fields in Cat in order to perform a Group By, but fulldescription is a CLOB, and you cannot group by by a CLOB (I get a "Not a Group By Expression" error. I've seen a few solutions around for a pure SQL sentence but none for HQL.
A serious issue GROUP BY of HQL because if you specify your object in GROUP BY and in your SELECT field list behaviours are differents. In GROUP BY has considered only id field but in SELECT field list all fields are considered.
So you can use a subquery with GROUP BY to return only id from your object, so that result becomes an input for the main query, like the follow I write for you.
Pay attention there are some alias table (i and e) not defined, so this query doesn't work, but you know as fixed.
Try this:
select c2 from Cat c2
where c2.id in (
select c.id from Cat c
left join c.kittens k
where (c.location= 1 OR c.location = 2)
and (i.activo = 1) <-- who is i alias??
group by c.id)
order by count(e) desc <-- who is e alias???

How do I write a LINQ query to combine multiple rows into one row?

I have one table, 'a', with id and timestamp. Another table, 'b', has N multiple rows referring to id, and each row has 'type', and "some other data".
I want a LINQ query to produce a single row with id, timestamp, and "some other data" x N. Like this:
1 | 4671 | 46.5 | 56.5
where 46.5 is from one row of 'b', and 56.5 is from another row; both with the same id.
I have a working query in SQLite, but I am new to LINQ. I dont know where to start - I don't think this is a JOIN at all.
SELECT
a.id as id,
a.seconds,
COALESCE(
(SELECT b.some_data FROM
b WHERE
b.id=a.id AND b.type=1), '') AS 'data_one',
COALESCE(
(SELECT b.some_data FROM
b WHERE
b.id=a.id AND b.type=2), '') AS 'data_two'
FROM a first
WHERE first.id=1
GROUP BY first.ID
you didn't mention if you are using Linq to sql or linq to entities. However following query should get u there
(from x in a
join y in b on x.id equals y.id
select new{x.id, x.seconds, y.some_data, y.type}).GroupBy(x=>new{x.id,x.seconds}).
Select(x=>new{
id = x.key.id,
seconds = x.Key.seconds,
data_one = x.Where(z=>z.type == 1).Select(g=>g.some_data).FirstOrDefault(),
data_two = x.Where(z=>z.type == 2).Select(g=>g.some_data).FirstOrDefault()
});
Obviously, you have to prefix your table names with datacontext or Objectcontext depending upon the underlying provider.
What you want to do is similar to pivoting, see Is it possible to Pivot data using LINQ?. The difference here is that you don't really need to aggregate (like a standard pivot), so you'll need to use Max or some similar method that can simulate selecting a single varchar field.

Resources