How to automatically exclude items already visited in recommendation algorithm?

How to automatically exclude items already visited in recommendation algorithm? - algorithm

I'm now using slope One for recommendation.
How to exclude visited items from result?
I can't do it simply by not in (visited_id_list) to filter those visited ones because it will have scalability issue for an old user!
I've come up with a solution without not in：
select b.property,count(b.id) total from propertyviews a
left join propertyviews b on b.cookie=a.cookie
left join propertyviews c on c.cookie=0 and b.property=c.property
where a.property=1 and a.cookie!=0 and c.property is null
group by b.property order by total;

Seriously, if you are using MySQL, look at 12.2.10.3. Subqueries with ANY, IN, and SOME
For example:
SELECT s1 FROM t1 WHERE s1 IN (SELECT s1 FROM t2);
This is available in all versions of MySQL I looked at, albeit that the section numbers in the manual are different in the older versions.
EDIT in response to the OP's comment:
OK ... how about something like SELECT id FROM t1 WHERE ... AND NOT id IN (SELECT seen_id FROM user_seen_ids where user = ? ). This form avoids having to pass thousands of ids in the SQL statement.
If you want to entirely avoid the "test against a list of ids" part of the query, I don't see how it is even possible in theory, let alone how you would implement it.

Related

Most efficient way to select in bulk from a multi million records table

I'm interested in getting and doing some processing on all the entities A returned by a query of the form:
SELECT * FROM A a WHERE a.id not in (select b.id from B)
Where A is a "complex" entity in the sense that it inherits (InheritanceTyped.Joined) from other entities and that several of its attributes are other entities (#OneToOne and #ManyToOne).
The query itself takes a few minutes to yield results hence my desire to execute it as few as possible.
Here are the different approaches i tried to get those A elements as efficiently as possible :
Pagination using setFirstResult/ setMaxResults
Do the job, but pretty slowly as the query seems to be executed everytime.(around 50 elements processed/sec)
Getting IDs first, A objects next
Keeping all the IDs in memory is doable, so I execute once
SELECT a.id FROM A a WHERE a.id not in (select b.id from B)
and then select a from A a WHERE a.id= :id, which goes relatively fast as the id column is indexed. This is currently the solution that is the most efficient with (around 100 elements processed/sec)
Using ScollableResults I had high hope with this solution, but it ended up being slower than other alternatives, leaving me at around 20 elements processed/sec ...
As a neophyte, I don't know what other options to investigate, or if I did something wrong in any of my attempts.
Hence my questions:
Are there (factually) other approaches to efficiently tackle this kind of problem ?
Is it normal that ScrollableResults performed so poorly ? Is there something I should have paid attention to while implementing this solution?
EDIT:
Here's the execution plan

Oracle duplicate field but still correct

So i built a query for my leadership team that was correct, but i dont understand why oracle gave me the correct answer.
i have 3 tables that i needed to get data out of in order to get the total billed amount.
Here is my query (please forgive me, my 2nd post and im not sure how to properly format my querys)
select b.total_amount_billed as billed from t1.billing_information b
where b.billing_no in
(select h.billing_no
from t1.res_history h where h.res_seq_no in
(Select r.reservation_seq_no
from t1.res r where r.customer_order_no in ('THO40000') ))
so in the deepest select, i take the the sequence number where my customer order number was THO40000, this query returns 2 sequence numbers.
the second sub query returns the billing numbers for my order from the history table where the sequence number match, in this case for this order they both use the same billing number, 312000.
the final select, returns my total billed amount where it matched my billing numbers it found, in my case $110.
the query works, but what i dont understand is why is it not duplicated? why does it not return 110, for each time it found 312000, giving me 2 records of 110? the billing number is a PK in the billing_information table. im not sure why it worked without me using the distinct keyword on the query for the billing number.
anyway thanks for the help, ill do my best to explain if you have questions!

You are being saved because you used IN to get the billing_no values to use, rather than an INNER JOIN between the two tables using b.billing_no = h.billing_no. A join would have duplicated the records, but your IN query is essentially this:
select b.total_amount_billed as billed
from t1.billing_information b
where b.billing_no in (312000, 312000);
If there is a single row in billing_information having billing_no equal to 312000, it is in the list, so the WHERE condition is true and it is included in the results. The fact that it is in the list twice doesn't make the IN condition "more true".

Merging Group by and order by in Oracle

This is what I have and below what I want to do. d. is a table, and act is a view that consists of about 50 combined tables:
SELECT d.extensionaspect1 AS "Exception ID",
d.causedat AS "Date",
max(act.date_) AS "Causedate",
act.prodsteppath AS "Production Step",
--I'm calling other selections on d, but i'm not going to incude
--them for the sake of clarity
FROM pasx.deviationevent d,
pasx.vtemp_prodsteppath act
--There's more here, but I'm not going to include them for the sake of clarity
WHERE act.date_ < d.causedat
--as before, there's a large amount of AND clauses here that make
--sure that I get the timestamps I want. This is a pretty complex query
GROUP BY d.causedat,
ORDER BY d.extensionaspect1;
Running like this gives me ORA-00936; missing expression. I'm very new at this, so I'm probably fundamentally misunderstanding how to use select max(act.date_) and use it to find the timestamp that occurred just before d.causedate. act.date_ contains all the possible timestamps that could lead to throwing an exception, but only the one before the timestamps in d.causedat are relevant to me.
vtemp_prodsteppath is a view and contains no order or group by's. How do I join these two tables (d and act) together? What expression am I missing?

Oracle query with multiple tables

I am trying to display volunteer information with duty and what performance is allocated.
I want to display this information. However, when I run the query, it did not gather the different date from same performance. And also availability_date is mixed up. Is it right query for it? I am not sure it is right query.
Could you give me some feedback for me?
Thanks.
Query is here.
SELECT Production.name, performance.performance_date, volunteer_duty.availability_date, customer.name "Customer volunteer", volunteer.volunteerid, membership.name "Member volunteer", membership.membershipid
FROM Customer, Membership, Volunteer, volunteer_duty, duty, performance_duty, performance, production
WHERE
Customer.customerId (+) = Membership.customerId AND
Membership.membershipId = Volunteer.membershipId AND
volunteer.volunteerid = volunteer_duty.volunteerid AND
duty.dutyid = volunteer_duty.dutyid AND
volunteer_duty.dutyId = performance_duty.dutyId AND
volunteer_duty.volunteerId = performance_duty.volunteerId AND
performance_duty.performanceId = performance.performanceId AND
Performance.productionId = production.productionId
--Added image--
Result:

The query seems reasonable, in terms of it having what appear to be the appropriate join conditions between all the tables. It's not clear to me what issue you are having with the results; it might help if you explained in more detail and/or showed a relevant subset of the data.
However, since you say there is some issue related to availability_date, my first thought is that you want to have some condition on that column, to ensure that a volunteer is available for a given duty on the date of a given performance. This might mean simply adding volunteer_duty.availability_date = performance.performance_date to the query conditions.
My more general recommendation is to start writing the query from scratch, adding one table at a time, and using ANSI join syntax. This will make it clearer which conditions are related to which joins, and if you add one table at a time hopefully you will see the point at which the results are going wrong.
For instance, I'd probably start with this:
SELECT production.name, performance.performance_date
FROM production
JOIN performance ON production.productionid = performance.productionid
If that gives results that make sense, then I would go on to add a join to performance_duty and run that query. Et cetera.

I suggest that you explicitly write JOINS, instead of using the WHERE-Syntax.
Using INNER JOINs the query you are describing, could look like:
SELECT *
FROM volunteer v
INNER JOIN volunteer_duty vd ON(v.volunteerId = vd.colunteerId)
INNER JOIN performance_duty pd ON(vd.dutyId = pd.dutyId AND vd.volunteerId = pd.colunteerId)
INNER JOIN performance p ON (pd.performanceId = p.performanceId)

How to implement tag system

I was wondering what the best way is to implement a tag system, like the one used on SO. I was thinking of this but I can't come up with a good scalable solution.
I was thinking of having a basic 3 table solution: having a tags table, an articles tables and a tag_to_articles table.
Is this the best solution to this problem, or are there alternatives? Using this method the table would get extremely large in time, and for searching this is not too efficient I assume. On the other hand it is not that important that the query executes fast.

I believe you'll find interesting this blog post: Tags: Database schemas
The Problem: You want to have a database schema where you can tag a
bookmark (or a blog post or whatever) with as many tags as you want.
Later then, you want to run queries to constrain the bookmarks to a
union or intersection of tags. You also want to exclude (say: minus)
some tags from the search result.
“MySQLicious” solution
In this solution, the schema has got just one table, it is denormalized. This type is called “MySQLicious solution” because MySQLicious imports del.icio.us data into a table with this structure.
Intersection (AND)
Query for “search+webservice+semweb”:
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags LIKE "%semweb%"
Union (OR)
Query for “search|webservice|semweb”:
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"
Minus
Query for “search+webservice-semweb”
SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags NOT LIKE "%semweb%"
“Scuttle” solution
Scuttle organizes its data in two tables. That table “scCategories” is the “tag”-table and has got a foreign key to the “bookmark”-table.
Intersection (AND)
Query for “bookmark+webservice+semweb”:
SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3
First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).
Union (OR)
Query for “bookmark|webservice|semweb”:
Just leave out the HAVING clause and you have union:
SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
Minus (Exclusion)
Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2
Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
“Toxi” solution
Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by wordpress.
The queries are quite the same as in the “scuttle” solution.
Intersection (AND)
Query for “bookmark+webservice+semweb”
SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3
Union (OR)
Query for “bookmark|webservice|semweb”
SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
Minus (Exclusion)
Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2
Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.

Nothing wrong with your three-table solution.
Another option is to limit the number of tags that can be applied to an article (like 5 in SO) and add those directly to your article table.
Normalizing the DB has its benefits and drawbacks, just like hard-wiring things into one table has benefits and drawbacks.
Nothing says you can't do both. It goes against relational DB paradigms to repeat information, but if the goal is performance you may have to break the paradigms.

Your proposed three table implementation will work for tagging.
Stack overflow uses, however, different implementation. They store tags to varchar column in posts table in plain text and use full text indexing to fetch posts that match the tags. For example posts.tags = "algorithm system tagging best-practices". I am sure that Jeff has mentioned this somewhere but I forget where.

The proposed solution is the best -if not the only practicable- way I can think of to address the many-to-many relationship between tags and articles. So my vote is for 'yes, it's still the best.' I'd be interested in any alternatives though.

If your database supports indexable arrays (like PostgreSQL, for example), I would recommend an entirely denormalized solution - store tags as an array of strings on the same table. If not, a secondary table mapping objects to tags is the best solution. If you need to store extra information against tags, you can use a separate tags table, but there's no point in introducing a second join for every tag lookup.

I would like to suggest optimised MySQLicious for better performance.
Before that the drawbacks of Toxi (3 table) solution is
If you have millions of questions, and it has 5 tags in each, then there will be 5 million entries in tagmap table. So first we have to filter out 10 thousand tagmap entries based on tag search then again filter out matching questions of those 10 thousand. So while filtering out if the artical id is simple numeric then it is ok, but if it is kind of UUID (32 varchar) then filtering out needs larger comparison though it is indexed.
My solution:
Whenever new tag is created, have counter++ (base 10), and convert that counter into base64. Now each tag name will have base64 id. and pass this id to UI along with name.
This way you will be having maximum of two char id till we have 4095 tags created in our system. Now concatenate these multiple tags into each question table tag column. Add delimiter as well and make it sorted.
So table looks like this
While querying, query on id instead of real tag name.
Since it is SORTED, and condition on tag will be more efficient (LIKE '%|a|%|c|%|f|%).
Note that single space delimiter is not enough and we need double delimiter to differentiate tags like sql and mysql because LIKE "%sql%" will return mysql results as well. Should be LIKE "%|sql|%"
I know the search is non indexed but still you might have indexed on other columns related to article like author/dateTime else will lead to full table scan.
Finally with this solution, no inner join required where million records have to be compared with 5 millions records on join condition.

CREATE TABLE Tags (
tag VARHAR(...) NOT NULL,
bid INT ... NOT NULL,
PRIMARY KEY(tag, bid),
INDEX(bid, tag)
)
Notes:
This is better than TOXI in that it does not go through an extra many:many table which makes optimization difficult.
Sure, my approach may be slightly more bulky (than TOXI) due to the redundant tags, but that is a small percentage of the whole database, and the performance improvements may be significant.
It is highly scalable.
It does not have (because it does not need) a surrogate AUTO_INCREMENT PK. Hence, it is better than Scuttle.
MySQLicious sucks because it cannot use an index (LIKE with leading wild card; false hits on substrings)
For MySQL, be sure to use ENGINE=InnoDB in order to get 'clustering' effects.
Related discussions (for MySQL):
many:many mapping table optimization
ordered lists

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio