Correct syntax on how to use the div operator correctly in relational algebra? - relation

My question is also based off How to find all pizzerias that serve every pizza eaten by people over 30?
This is the database I am using for my queries
https://class.stanford.edu/c4x/DB/RA/asset/pizzadata.html
the syntax for writing out relational algebra queries is based off http://www.cs.duke.edu/~junyang/ra/ .
My objective is to "Find all pizzerias that serve only pizzas eaten by people over 30.
". I have the fist couple of steps down but I am struggling with finding pizzerias that have a complete match. Based on the last thread I looked at for the question , it seemed like the div operator was necessary to find the pizzeria that has a complete match with a given list of pizzas. However from http://www.cs.duke.edu/~junyang/ra/, I couldn't find such a div operator. Is there a work around for the div operator? The other answers on that thread were too convoluted and didn't work.
This is my query so far
\project_{pizzeria, pizza} (
\select_{age > 30} (
Person \join Eats
)
\join
Serves
)
What I have here is a query that first locates people over 30 and joins those people with the pizzas they eat. then I join with Serves so I can see the pizzerias that serve those pizzas. I understand I need a complete match - pizzas that serve only what people over 30 eat. That is why I feel like the div operator would be very helpful.

Find all pizzeria's that serve every pizza eaten by people over ...
and
Find all pizzeria's that serve only pizzas eaten by people over ...
are different problems already.
The first should return all the pizzeria's whose set of served pizza's is a superset of the set of pizza's eaten by anyone over ...
The second should return all the pizzeria's whose set of served pizza's is a subset of the set of pizza's eaten by anyone over ...
Yet another problem would be to find the pizzaria's whose set of served pizza's is equal to the set of pizza's eaten by anyone over ...
This aspect of the problem is probably the major reason why RA systems (included SQL processors) offer no direct support for the divide operator : relational divide can come in many flavours and it is hard to tell which one the user will be needing.
Rephrase the problem such that it states the equivalent, then translate that to RA or SQL, e.g. something like "find all the pizzeria's Z such that there does not exist any pizza P and/or person PRS that is over ... and PRS eats P, and P is not served by Z".
Or something like "find all the pizzeria's that are not in the set of those that do not serve some pizza that is eaten by any person over 30".

Related

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues
I have a huge list of street names in Indonesia ( > 100k rows ) stored in database,
Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test
I have tried to solve it by doing these steps:
a. Putting the street names into a Key,Value Hash
b. Split each sentences into words
c. Test whether words are in the hash
This is fast, but will not work with multiple words
Another alternatives that I thought of is to do these steps:
a. Split each sentences into words
b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )
c. If query returned a row, it means that the sentence contains street names
Now, this solution is going to be a very IO intensive.
So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts
============EDIT 1 =================
Will this be periodical ?
Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database
A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).
This algorithm should be fairly easy to implement and should perform pretty good too.
Using nlp, you can determine the proper noun in a sentence. Please refer to the link below.
http://nlp.stanford.edu/software/lex-parser.shtml
The standford parser is accurate in its calculation. Once you have the proper noun, you can decide the approach to follow.
So you have a document and want to seach if it contains any of your list of streetnames?
Turbo Boyer-Moore is a good starting point for doing that.
Here is more information on turbo boyer moore
But, i strongly believe, you will have to do something about the organisation of your list of street names. there should be some bucket access to it, i.e. you can easily filter for street names:
Here an example:
Street name: Asia-Pacific-street
You can access your list by:
A (getting a starting point for all that start with an A)
AS (getting a starting point for all that start with an AS)
and so on...
I believe you should have lots of buckets for that, at least 26 (first letter) * 26 (second letter)
more information about bucketing
The Aho-Corasick algorithm could be pretty useful. One of it's advantages is that it's run time is independent of how many words you are searching for (only how long the text is you are searching through). It will be especially useful if your list of street names is not changing frequently.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

Multi Attribute Matching of Profiles

I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)

Algorithm to find all possible results using text search

I am currently making a web crawler to crawl all the possible characters on a video game site (Final Fantasy XIV Lodestone).
My interface for doing this is using the site's search. http://lodestone.finalfantasyxiv.com/rc/search/characterForm
If the search finds more than 1000 characters it only returns the first 1000. The text search does not seem to understand either *, ? or _.
If a search for the letter a, I'm getting all the characters that have a in their names rather than all characters that start with a.
I'm guessing I could do searches for all character combination aa, ab, ba, etc. But that doesn't guarantee me:
I will never get more than 1000 result
It doesn't seem very efficient has many characters would appear multiple times and would need to be filtered out.
I'm looking for an algorithm on how to construct my search text.
Considered as a practical problem, have you asked Square Enix for some kind of API access or database dump? They might prefer this to having you scrape their search results.
Considered purely in the abstract, it's not clear that any search strategy will succeed in finding all the results. For suppose there were a character called "Ar", how would you find it? If you search for "ar", the results only go as far as at Ak—. If you search for "a" or "r", the situation is even worse. Any other search fails to find this character. (In practice you might be able to find "Ar" by guessing its world and/or main skill, but in theory there might be so many characters with that skill on that world that this remains ineffective.)
Main question here is what are you planning to do with all those characters. What is the purpose of your program? Putting that aside, you can search for single letter, and filter by both main skill and world (using double loop). It is highly unlikely that you will ever have more that 1000 hits that way for any consonant. If you want to search for name starting with vowel then use bigraph vowel-other_letter in a loop that iterates other_letter from A to Z.
Additional optimization is to try to guess at what page the list with needed letter will start. If you have total number of pages (TNOP) then your list will start somewhere near page TNOP * LETTER / 27, where LETTER is the order of the letter in the alphabet.

Categorizing Words and Category Values

We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.
Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.
We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.
i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.
If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.
Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?
Even a keyword DB or something might be helpful ? Anyone know of any free ones?
First of all you need sample text to analyze, to get the relationship of words.
A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.
A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.
Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...
For example (results first):
weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado
With code (TODO: add threading ;-p):
static void Main() {
string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
string[] categories = { "sport", "fashion", "weather" };
using(WebClient client = new WebClient()){
foreach(string word in words) {
var bestCategory = categories.OrderByDescending(
cat => Rank(client, word, cat)).First();
Console.WriteLine("{0}: {1}", bestCategory, word);
}
}
}
static int Rank(WebClient client, string word, string category) {
string s = client.DownloadString("http://www.google.com/search?q=%2B" +
Uri.EscapeDataString(word) + "+%2B" +
Uri.EscapeDataString(category));
var match = Regex.Match(s, #"of about \<b\>([0-9,]+)\</b\>");
int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
return rank;
}
Maybe you are all making this too hard.
Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.
Or not...depends on the teacher.
So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.
In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.
Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:
Start by more or less randomly assigning name-value pairs.
Output the results.
Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
Input the results of the AMT evaluation back into the system along with the random assignments.
If everything was approved, then we are done.
Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
Return to step 3.
Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.
You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.
Create a set of categorization rules like the one above and see how high an accuracy you get.
EDIT:
Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.
This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.
Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.
I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.
The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?
I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.
Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?
Fire this teacher.
The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.
Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.
If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.
Google is forbidden, but they have almost a perfect solution - Google Sets.
Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.
Yeah I'd go for the wordnet approach.
Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem.
Hope this helps,
X.
Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)
Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,
# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
+ direct hypernym / inherited hypernym / sister term
# S: (n) **sport**, athletics
(an active diversion requiring physical exertion and competition)
What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.
Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)
As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/
My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)
That might be considered cheating, though.
Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong...
I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match.
Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair.
for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe).
Based on that you can introduce any number of words and any number of categories. And You'll win your challenge.
I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....
Good luck
Harel
Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.
Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.
Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.
Be creative. There is other data out there besides Google.
My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.
My naive approach:
Create a huge text file like this (read the article for inspiration)
For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
The word is likely to belong in the category with the greatest counter.
Scrape delicious.com and search for each word, looking at collective tag counts, etc.
Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.
The knowledge is in the tags.
As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think.
Nevertheless I would do something like this (told in a very simplistic way)
Build up a Neuronal Network which you give some input (a (e)book, some (e)books)
=> no google needed
this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports').
After some training of the neural network it should "link" you the words to the categories.
You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.
You could implement a learning algorithm to do this using a monte carlo method and human feedback. Have the system randomly categorize words, then ask you to vote them as "match" or "not match." If it matches, the word is categorized and can be eliminated. If not, the system excludes it from that category in future iterations since it knows it doesn't belong there. This will get very accurate results.
This will work for the 100 word problem fairly easily. For the larger problem, you could combine this with educated guessing to make the process work faster. Here, as many people above have mentioned, you will need external sources. The google method would probably work the best, since google's already done a ton of work on it, but barring that you could, for example, pull data from your facebook account using the facebook apis and try to figure out which words are statistically more likely to appear with previously categorized words.
Either way, though, this cannot be done without some kind of external input that at some point came from a human. Unless you want to be cheeky and, for example, define the categories by some serialized value contained in the ascii text for the name :P

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

Resources