How to implement tag system - algorithm

I was wondering what the best way is to implement a tag system, like the one used on SO. I was thinking of this but I can't come up with a good scalable solution.
I was thinking of having a basic 3 table solution: having a tags table, an articles tables and a tag_to_articles table.
Is this the best solution to this problem, or are there alternatives? Using this method the table would get extremely large in time, and for searching this is not too efficient I assume. On the other hand it is not that important that the query executes fast.

I believe you'll find interesting this blog post: Tags: Database schemas
The Problem: You want to have a database schema where you can tag a
bookmark (or a blog post or whatever) with as many tags as you want.
Later then, you want to run queries to constrain the bookmarks to a
union or intersection of tags. You also want to exclude (say: minus)
some tags from the search result.
“MySQLicious” solution
In this solution, the schema has got just one table, it is denormalized. This type is called “MySQLicious solution” because MySQLicious imports del.icio.us data into a table with this structure.
Intersection (AND)
Query for “search+webservice+semweb”:
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags LIKE "%semweb%"
Union (OR)
Query for “search|webservice|semweb”:
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"
Minus
Query for “search+webservice-semweb”
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags NOT LIKE "%semweb%"
“Scuttle” solution
Scuttle organizes its data in two tables. That table “scCategories” is the “tag”-table and has got a foreign key to the “bookmark”-table.
Intersection (AND)
Query for “bookmark+webservice+semweb”:
SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3
First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).
Union (OR)
Query for “bookmark|webservice|semweb”:
Just leave out the HAVING clause and you have union:
SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
Minus (Exclusion)
Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2
Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
“Toxi” solution
Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by wordpress.
The queries are quite the same as in the “scuttle” solution.
Intersection (AND)
Query for “bookmark+webservice+semweb”
SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3
Union (OR)
Query for “bookmark|webservice|semweb”
SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
Minus (Exclusion)
Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2
Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.

Nothing wrong with your three-table solution.
Another option is to limit the number of tags that can be applied to an article (like 5 in SO) and add those directly to your article table.
Normalizing the DB has its benefits and drawbacks, just like hard-wiring things into one table has benefits and drawbacks.
Nothing says you can't do both. It goes against relational DB paradigms to repeat information, but if the goal is performance you may have to break the paradigms.

Your proposed three table implementation will work for tagging.
Stack overflow uses, however, different implementation. They store tags to varchar column in posts table in plain text and use full text indexing to fetch posts that match the tags. For example posts.tags = "algorithm system tagging best-practices". I am sure that Jeff has mentioned this somewhere but I forget where.

The proposed solution is the best -if not the only practicable- way I can think of to address the many-to-many relationship between tags and articles. So my vote is for 'yes, it's still the best.' I'd be interested in any alternatives though.

If your database supports indexable arrays (like PostgreSQL, for example), I would recommend an entirely denormalized solution - store tags as an array of strings on the same table. If not, a secondary table mapping objects to tags is the best solution. If you need to store extra information against tags, you can use a separate tags table, but there's no point in introducing a second join for every tag lookup.

I would like to suggest optimised MySQLicious for better performance.
Before that the drawbacks of Toxi (3 table) solution is
If you have millions of questions, and it has 5 tags in each, then there will be 5 million entries in tagmap table. So first we have to filter out 10 thousand tagmap entries based on tag search then again filter out matching questions of those 10 thousand. So while filtering out if the artical id is simple numeric then it is ok, but if it is kind of UUID (32 varchar) then filtering out needs larger comparison though it is indexed.
My solution:
Whenever new tag is created, have counter++ (base 10), and convert that counter into base64. Now each tag name will have base64 id. and pass this id to UI along with name.
This way you will be having maximum of two char id till we have 4095 tags created in our system. Now concatenate these multiple tags into each question table tag column. Add delimiter as well and make it sorted.
So table looks like this
While querying, query on id instead of real tag name.
Since it is SORTED, and condition on tag will be more efficient (LIKE '%|a|%|c|%|f|%).
Note that single space delimiter is not enough and we need double delimiter to differentiate tags like sql and mysql because LIKE "%sql%" will return mysql results as well. Should be LIKE "%|sql|%"
I know the search is non indexed but still you might have indexed on other columns related to article like author/dateTime else will lead to full table scan.
Finally with this solution, no inner join required where million records have to be compared with 5 millions records on join condition.

CREATE TABLE Tags (
tag VARHAR(...) NOT NULL,
bid INT ... NOT NULL,
PRIMARY KEY(tag, bid),
INDEX(bid, tag)
)
Notes:
This is better than TOXI in that it does not go through an extra many:many table which makes optimization difficult.
Sure, my approach may be slightly more bulky (than TOXI) due to the redundant tags, but that is a small percentage of the whole database, and the performance improvements may be significant.
It is highly scalable.
It does not have (because it does not need) a surrogate AUTO_INCREMENT PK. Hence, it is better than Scuttle.
MySQLicious sucks because it cannot use an index (LIKE with leading wild card; false hits on substrings)
For MySQL, be sure to use ENGINE=InnoDB in order to get 'clustering' effects.
Related discussions (for MySQL):
many:many mapping table optimization
ordered lists

Related

How to speed-up a spatial join in BigQuery?

I have a BigQuery table with point registers along a whole country, and I need to assign a "censal zone" to each one of them, which polygons are contained in another table. I've been trying to do so using a query like this one:
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON ST_CONTAINS(zone_polygon, point_geo)
The first table is quite large, so the query performes very inefficiently as it is comparing each possible pairs of (point, censal zone). However, both tables have a column identifier for the municipality in which they are in, so the question is, can rewrite my query in some way that ST_CONTAINS(*) is performed for each (point, censal zone) pair that belongs to the same municipality, hence not comparing all posible censal zones within the country for each point? Can I do this without having to read points_table multiple times?
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)
I'm quite new to BigQuery so I don't really know if a query like this would actually do what I'am expecting, as I couldn't find anything in the documentation.
Thanks!
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)

Most efficient way to select in bulk from a multi million records table

I'm interested in getting and doing some processing on all the entities A returned by a query of the form:
SELECT * FROM A a WHERE a.id not in (select b.id from B)
Where A is a "complex" entity in the sense that it inherits (InheritanceTyped.Joined) from other entities and that several of its attributes are other entities (#OneToOne and #ManyToOne).
The query itself takes a few minutes to yield results hence my desire to execute it as few as possible.
Here are the different approaches i tried to get those A elements as efficiently as possible :
Pagination using setFirstResult/ setMaxResults
Do the job, but pretty slowly as the query seems to be executed everytime.(around 50 elements processed/sec)
Getting IDs first, A objects next
Keeping all the IDs in memory is doable, so I execute once
SELECT a.id FROM A a WHERE a.id not in (select b.id from B)
and then select a from A a WHERE a.id= :id, which goes relatively fast as the id column is indexed. This is currently the solution that is the most efficient with (around 100 elements processed/sec)
Using ScollableResults I had high hope with this solution, but it ended up being slower than other alternatives, leaving me at around 20 elements processed/sec ...
As a neophyte, I don't know what other options to investigate, or if I did something wrong in any of my attempts.
Hence my questions:
Are there (factually) other approaches to efficiently tackle this kind of problem ?
Is it normal that ScrollableResults performed so poorly ? Is there something I should have paid attention to while implementing this solution?
EDIT:
Here's the execution plan

Access 2013: Check if a value is present in another table

I've just discovered Access, having always been an Excel/VBA man... and now I've hit a roadblock!
I'm building an inventory database for my employer. I have 2 tables, one containing one column of 'stockID's (lets call this table 'tblWarehouse'), and another containing two columns: a column of 'orderID's and a column of 'stockID's (lets call this table 'tblOrders'). (For the sake of this question, lets disregard things like quantity, price etc)
We don't keep all the goods we sell in our own warehouse, some are sourced directly from the manufacturer to the customer, which means that not all tblOrders!stockID will be present in the list tblWarehouse!stockID. I need to find out when this is the case!
I want to create a third column in tblOrders containing a dummy variable = 1 if that particular item is in our warehouse. In other words, I want to create a calculated column = 1 if tblOrders!stockID can be found in tblWarehouse!stockID. Can this be done?
I've found that I can't reference another table directly, so I've been trying my hand at queries, user defined functions and relationships, but to no avail. I've also been having trouble with the Access-lingo and veritable forest of different places to input seemingly the same expressions... so please, if u have an answer for me, be sure to specify where things are located!
Much obliged!!
If you are linking the two tables in a query using an inner join, only order records having at least one stock entry will be included in the result. In order to include those with no stock entry at all, create a left outer join.
SELECT O.OrderID, IIf(IsNull(MAX(W.StockID)), 0, 1) AS StockAvailable
FROM
tblOrder O
LEFT JOIN tblWarehouse W
ON O.StockID = W.StockID
GROUP BY O.OrderID
You can also determin the join type in the query designer by right clicking a relation line and selecting "Join Properties" and then select "Include ALL records from tblOrders ...". You can make a grouping query by clicking the big Sigma-symbol in the symbol list.

Oracle query with multiple tables

I am trying to display volunteer information with duty and what performance is allocated.
I want to display this information. However, when I run the query, it did not gather the different date from same performance. And also availability_date is mixed up. Is it right query for it? I am not sure it is right query.
Could you give me some feedback for me?
Thanks.
Query is here.
SELECT Production.name, performance.performance_date, volunteer_duty.availability_date, customer.name "Customer volunteer", volunteer.volunteerid, membership.name "Member volunteer", membership.membershipid
FROM Customer, Membership, Volunteer, volunteer_duty, duty, performance_duty, performance, production
WHERE
Customer.customerId (+) = Membership.customerId AND
Membership.membershipId = Volunteer.membershipId AND
volunteer.volunteerid = volunteer_duty.volunteerid AND
duty.dutyid = volunteer_duty.dutyid AND
volunteer_duty.dutyId = performance_duty.dutyId AND
volunteer_duty.volunteerId = performance_duty.volunteerId AND
performance_duty.performanceId = performance.performanceId AND
Performance.productionId = production.productionId
--Added image--
Result:
The query seems reasonable, in terms of it having what appear to be the appropriate join conditions between all the tables. It's not clear to me what issue you are having with the results; it might help if you explained in more detail and/or showed a relevant subset of the data.
However, since you say there is some issue related to availability_date, my first thought is that you want to have some condition on that column, to ensure that a volunteer is available for a given duty on the date of a given performance. This might mean simply adding volunteer_duty.availability_date = performance.performance_date to the query conditions.
My more general recommendation is to start writing the query from scratch, adding one table at a time, and using ANSI join syntax. This will make it clearer which conditions are related to which joins, and if you add one table at a time hopefully you will see the point at which the results are going wrong.
For instance, I'd probably start with this:
SELECT production.name, performance.performance_date
FROM production
JOIN performance ON production.productionid = performance.productionid
If that gives results that make sense, then I would go on to add a join to performance_duty and run that query. Et cetera.
I suggest that you explicitly write JOINS, instead of using the WHERE-Syntax.
Using INNER JOINs the query you are describing, could look like:
SELECT *
FROM volunteer v
INNER JOIN volunteer_duty vd ON(v.volunteerId = vd.colunteerId)
INNER JOIN performance_duty pd ON(vd.dutyId = pd.dutyId AND vd.volunteerId = pd.colunteerId)
INNER JOIN performance p ON (pd.performanceId = p.performanceId)

How to automatically exclude items already visited in recommendation algorithm?

I'm now using slope One for recommendation.
How to exclude visited items from result?
I can't do it simply by not in (visited_id_list) to filter those visited ones because it will have scalability issue for an old user!
I've come up with a solution without not in:
select b.property,count(b.id) total from propertyviews a
left join propertyviews b on b.cookie=a.cookie
left join propertyviews c on c.cookie=0 and b.property=c.property
where a.property=1 and a.cookie!=0 and c.property is null
group by b.property order by total;
Seriously, if you are using MySQL, look at 12.2.10.3. Subqueries with ANY, IN, and SOME
For example:
SELECT s1 FROM t1 WHERE s1 IN (SELECT s1 FROM t2);
This is available in all versions of MySQL I looked at, albeit that the section numbers in the manual are different in the older versions.
EDIT in response to the OP's comment:
OK ... how about something like SELECT id FROM t1 WHERE ... AND NOT id IN (SELECT seen_id FROM user_seen_ids where user = ? ). This form avoids having to pass thousands of ids in the SQL statement.
If you want to entirely avoid the "test against a list of ids" part of the query, I don't see how it is even possible in theory, let alone how you would implement it.

Resources