Why in my queries I must use the key order, but in contain, or in Table I must use sort? It's a little confusing me and moreover this is not indicated in the fr doc, maybe in the eng doc?
Does someone know why these choices of naming?
I will start quoting an answer from stack exchange for english grammar:
"Sort" and "order" are generally interchangeable, and in your example, they are perfect synonyms. The one difference is that "order" can only be used for things that actually have a pre-defined ordering, like alphabetical or numeric. I could ask you to sort buttons by color, but asking you to order them by color wouldn't make much sense.
It's more natural when you use in your queries the word "order" because table columns are meant to store values with a pre-defined order like dates, ids, last names.
And sort is more generic and that is my theory of why used in other parts of the framework.
Not sure if was something planed or just was the natural development of the framework, maybe you could try to ask one of the core contributors and ask him about the decision.
I think "sort" is used primarily for pagination and "order" is used for the initial query perhaps. In standard MYSQL there is no "SORT" so it's just how Cake decides to differentiate the two.
Related
How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.
First and foremost, no I'm not asking please tell me how Google is built in two sentences. What I am asking is slightly different. I have a database filled with textual data that users input. We also give them the functionality to search for this data later. The problem is, we do a simple full text search now and return the results in any order. I'd like to return the results based on a weight, a weight of how often the user types in something. An an example a user might type in the following:
"foo"
"bo"
"bob"
"bob"
"bob"
"bo"
"foo2"
Based on the above data, a search on 'b' should return bo and bob, but bob should be listed first. It is the most relevant based on usage.
Curious, what algorithm should I research to build this in an effective fashion? Any books based on common web algorithms (I know this isn't just web specific) out there that will explain this?
there is various search algorithms out there.
Here's a little guidepost to some of them:
http://en.wikipedia.org/wiki/Search_algorithm
not an expert myself in this area, so I cannot recommend a specific one.
I don't know how you'd do this in the context of a database, but here's one way to go about it:
Use a trie to store each unique word and the count of how often it was used. When your user starts typing, the trie allows you to efficiently grab all the string with the given prefix, which you can then sort using the words' counts as keys.
We use apache solr for our search.
In this technology, I think, this is normally done via boosting. So index your data and every day or so then boost individual documents based on user queries.
My application may have strings comprised of different alphabets / languages in a single list. I can't seem to find any information on what the correct method for sorting these should be or any indication that ICU supports this functionality.
Example List:
Apple
яблоко
μήλο
Baby
βρέφος
ребенок
There is no sensible way to do this well. There is no universal sort for all languages, even within the same alphabet. Different languages (cultures, basically) have come up with different collation rules for how words should be sorted.
The only way to do this consistently at all, I think, is to use plain old codepoint sorting (e.g. in Java, String.compareTo).
You could come up with some heuristics, depending on what your data represents. You could group the strings based on guesses about the alphabet and language, and then use locale-specific sorting for each group. But you'd have to do this the hard way (code it yourself), I think, because you would guess differently depending on the terms (e.g. is 'mar' the English verb or the Spanish noun?). It's conceivable that you would end up with a worse result than the naive Unicode numerical sort, in terms of unpredictable "errors".
As with anything else, it depends on how much you can afford to put into the solution, and what kind of performance you need.
This suggestion is not the answer you're looking for: if there's any way to identify the locale when initially storing the strings, you should do so, and record it as part of the string's metadata. Then you won't have this problem.
Withe all the caveats above, here is one "standard universal multilingual sorting" : the unicode collation algorithm (UCA), which is NOT the codepoint order. From a cursory glance at this page, ICU seems to handle the mixture of UCA and local preference.
As mentioned by #Zac there is no universal sort. A code point sort will be consistent, but may not be what the user expects.
So you should probably use the preferred sort order for the user's selected locale. Any code points not defined in that sort order will be grouped together.
You could transliterate into your 'target' language (all in one script) and then sort. But languages have conflicting rules for sorting.
Imagine you have some products, or items, or just anything that you want to see in some order of importance. Like you want in a search engine. Like websites. And you don't know how to sort them. But you have some criteria that give you a clue. You have a bag of criteria, and according to each, you can find a sorting, but you cannot aggregate them to one preference list.
Well, I can. It's part of my thesis and I'd like to show the practical usefulness. I would appreciate suggestions on what to sort here and which criteria to use.
I thought about things like: A DVD store sorting DVDs according to: quality of the medium, match with the query string, user votes.
So, I would enjoy to have a real-world problem including data, where users would tell me if they like my sorting. And where I can see if the obtained sorting is useful. That's kind of the point: is this better than the standard algorithms.
cheers,
niko
You could sort programming questions by date, preferred and disliked tags, number of answers, votes,... to find the most interesting ones :)
The netflix prize? See here: http://www.netflixprize.com/ Maybe it is more about clustering than sorting.
I have got a simple contacts database but I'm having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicated data that is being entered is not exactly the same. For example, names are incorrectly spelled or one person will put in 'Bill Smith' and another will put in 'William Smith' for the same person.
So is there some sort of algorithm that can give a percentage for how similar an entry is to another?
So is there some sort of algorithm
that can give a percentage for how
similar an entry is to another?
Algorithms as Soundex and Edit distances (as suggested in a previous post) can solve some of your problems. However, if you are serious about cleaning your data, this will not be enough. As others have stated "Bill" does not sound anything like "William".
The best solution I have found is to use a reduction algorithm and table to reduce the names to it's root name.
To your regular Address table, add Root-versions of the names, e.g
Person (Firstname, RootFirstName, Surname, Rootsurname....)
Now, create a mapping table.
FirstNameMappings (Primary KEY Firstname, Rootname)
Populate your Mapping table by:
Insert IGNORE (select Firstname, "UNDEFINED" from Person) into FirstNameMappings
This will add all firstnames that you have in your person table together with the RootName of "UNDEFINED"
Now, sadly, you will have to go through all the unique first names and map them to a RootName. For example "Bill", "Billl" and "Will" should all be translated to "William"
This is very time consuming, but if data quality really is important for you I think it's one of the best ways.
Now use the newly created mapping table to update the "Rootfirstname" field in your Person table. Repeat for surname and address. Once this is done you should be able to detect duplicates without suffering from spelling errors.
You can compare the names with the Levenshtein distance. If the names are the same, the distance is 0, else it is given by the minimum number of operations needed to transform one string into the other.
I imagine that this problem is well understood but what occurs to me on first reading is:
compare fields individually
count those that match (for a possibly loose definition of match, and possibly weighing the fields differently)
present for human intervention any cases which pass some threshold
Use your existing database to get a good first guess for the threshold, and correct as you accumulate experience.
You may prefer a fairly strong bias toward false positives, at least at first.
While I do not have an algorithm for you, my first action would be to take a look at the process involved in entering a new contact. Perhaps users do not have an easy way to find the contact they are looking for. Much like on Stack Overflow's new question form, you could suggest contacts that already exist on the new contact screen.
If you have access SSIS check out the Fuzzy grouping and Fuzzy lookup transformation.
http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services
http://msdn.microsoft.com/en-us/library/ms137786.aspx
If you have a large database with string fields, you can very quickly find a lot of duplicates by using the simhash algorithm.
This may or may not be related but, minor misspellings might be detected by a Soundex search, e.g., this will allow you to consider Britney Spears, Britanny Spares, and Britny Spears as duplicates.
Nickname contractions, however, are difficult to consider as duplicates and I doubt if it is wise. There are bound to be multiple people named Bill Smith and William Smith, and you would have to iterate that with Charles->Chuck, Robert->Bob, etc.
Also, if you are considering, say, Muslim users, the problems become more difficult (there are too many Muslims, for example, that are named Mohammed/Mohammad).
I'm not sure it will work well for the names vs nicknames problem, but the most common algorithm in this sort of area would be the edit distance / Levenshtein distance algorithm. It's basically a count of the number of character changes, additions and removals required to turn one item into another.
For names, I'm not sure you're ever going to get good results with a purely algorithmic approach - What you really need is masses of data. Take, for example, how much better Google spelling suggestions are than those in a normal desktop application. This is because Google can process billions of web queries and look at what queries lead to each other, what 'did you mean' links actually get clicked etc.
There are a few companies which specialise in the name matching problem (mostly for national security and fraud applications). The one I could remember, Search Software America seems to have been bought out by these guys http://www.informatica.com/products_services/identity_resolution/Pages/index.aspx, but I suspect any of these sorts of solutions would be far to expensive for a contacts application.
FullContact.com has API's that can solve this for you, see their documentation here: http://www.fullcontact.com/developer/docs/?category=name.
They have APIs for Name Normalization (Bill into William), Name Deducer (for raw text), and Name Similarity (comparing two names).
All APIs are free at the moment, it could be a good way to get started.
You might also want to look into probabilistic matching.
For those wandering around the web and end up here, might I suggest that you try using a Google Sheet add-on I created called Flookup.
It's particularly good with names and it has a couple of other awesome features which I'll describe below:
Say you have a list of names and there are 2 people called "John Smith". You can use the rank parameter from Flookup to instruct the algorithm to return the 1st, 2nd, 3rd or nth best match. This is helpful if you have additional information that you can use to identify the "John Smith" you want.
Say you have an additional database/list of apartment numbers. You an specify which "John Smith" you want by typing: John Smith & Apartment A or John Smith & Apartment B as the lookup parameter to help distinguish between the two names.
I hope you find Flookup as beneficial as others have.