I'm implementing phrase and keyword search together (most likely this kind of search has a name, but I don't know it). To exemplify, the search I like turtles should match:
I like turtles
He said I like turtles
I really like turtles
I really like those reptiles called turtles
Turtles is what I like
In short, a string must contain all keywords to match.
Then comes the problem of sorting the search results.
Naively, I'm assuming that the closest the matches are to the beginning of the result AND to the original query, the better the result. How can I express this code?
My first approach was to assign a score for each keyword in each result based on how close the keyword is to an expected position, based in the original query. In pseudo-code:
score(result,query) {
keywords = query.split(" ");
score = 0
for i to keywords.length() {
score += score(result,query,keywords,i)
}
return score
}
score(result,query,keywords,i) {
index = text.indexOf(keywords[i])
if (i == 0) return index;
previousIndex = text.indexOf(keywords[i-1])
indexInSearch = query.indexOf(keywords[i])
previousIndexInSearch = query.indexOf(keywords[i-1])
expectedIndex = previousIndex + (indexInSearch - previousIndexInSearch)
return abs(index - expectedIndex)
}
The lower the score the better the result. The scores for the above examples seem decent enough:
I like turtles = 0
I really like turtles = 7
He said I like turtles = 8
I really like those reptiles called turtles = 38
Turtles is what I like = 39
Is this a viable approach to sort search results?
Leaving any kind of semantic analysis aside, what else could I be considering to improve it?
Related
Are there any ways to fulfill this scenario below?
Example text: "I want to eat"
I will try to match "ant to" which matches partially with the phrase "want to".
What I want is like below:
My ideal search result would be (tokenized space positions)
startToken = 1 ("want" is in index token 1)
startChar = 1 ("a" is in index char 1 with index token is 1)
endToken = 2
endChar = 1
but it seems to not be native in Elasticsearch, can Elasticsearch give me the result at least like this? (text positions full, zero-based index position)
startChar = 3
endChar = 8
After I searched the Internet I got some clue to use highlighting but after I tried, it failed for partial searching.
Can you give me some best practices for this scenario in Elasticsearch?
I have a couple of questions about scripting in elasticsearch, I hope someone can help me. I need to add several parameters from the document to _score and sort by the total value. First, I will describe the data that I have and which need to be added:
rating - a number from 1 to 9,
duration_bucket is a number from 0 to 2,
rating_adj [
{
text - text, if the passed parameter matches this value, the result will be changed to the next value.
adj - the number by which the result will be changed.
}]
Well, the score itself, usually this value ranges from 1 to 4.
Initially, I just had a sort in this order:
score
rating
duration_bucket
But this gave a slightly different result.
Therefore, a small script was written that would add all these values.
def found = null;
if (params.text != null) {
found = params._source['rating_adj'].find(item -> item.text == params.text);
}
def res = _score + params._source['duration_bucket'] + params._source['rating'];
if (found != null) {
return res + found.adj
}
return res;
And the first question. I've tried two options.
Through function score and already sorted by this score.
Directly via script sort
I did not notice the difference in performance, are there any significant differences in these approaches?
And the second question. When using this script, the processor is fully loaded, in contrast to the usual sorting. Are there any ways to optimize scripts or is it all about hardware?
I need to find objects, which pass a certain test, and among all the objects that passed the first test, the one with the lowest value in another test. Say, I have to find the Swedish female with the lowest score in Tetris, from a sample group of people (assuming everyone has played Tetris(of course they have)).
I'm obviously doing a for loop, and do the tests, comparing the Tetris scores to the lowest score so far. But what should be the score to compare the first one to?
Normally I could also just take the first one and compare everything to that one afterwards, but they have to pass the first test too. I could also take an arbitrarily big number, but that's just wrong.
I could also make two loops and just gather all Swedish females on the first round, and then the scores on the second, but is there a shorter and more simple way?
Mock-up in C#:
bool AreYouSwedishFemale(Human patient)
{
if(patient.isFemale && patient.isSwedish) {return true;}
else {return false;}
}
int PlayTetris(Human patient)
{
return someInt;
}
void myMainLoop()
{
Human[] testSubjects = {humanA, humanB, humanC};
Human dreamGirl;
int lowestScoreSoFar; //What should this be?
//Loop through testSubjects
foreach(Human testSubject in testSubjects)
{
//Check if it's a Swedish Female
if(AreYouSwedishFemale(testSubject))
{
//If so, compare her score to the lowest score so far
if(PlayTetris(testSubject) < lowestScoreSoFar) //Error, uninitialized variable
{
//If the lowest, save the object to a variable
dreamGirl = testSubject;
//And save the score, to compare the remaining scores to it
lowestScoreSoFar = PlayTetris(testSubject);
}
}
}
//In the end we have the result
dreamGirl.marry();
}
Yea, I'm not really looking for girls to beat in Tetris, I'm coding in Unity, but tried to keep this independent of the context.
You could just do a initialized check for "lowest score so far" before PlayTetris() check. Assuming that the lowest score is 0, you can initialize the lowest score to -1. Then edit your loop as such
//Loop through testSubjects
foreach(Human testSubject in testSubjects)
{
//Check if it's a Swedish Female
if(AreYouSwedishFemale(testSubject))
{
if( lowestScoreSoFar < 0 || PlayTetris(testSubject) < lowestScoreSoFar)
{
//If the lowest, save the object to a variable
dreamGirl = testSubject;
//And save the score, to compare the remaining scores to it
lowestScoreSoFar = PlayTetris(testSubject);
}
}
}
Basically if your "lowest score so far" hasn't been set, then the first swedish female you find will set it.
If for some reason the score is arbitrary, instead of doing -1 you could also just have a "lowestWasSet" bool that trips when the first girl is found.
Even better, you could also just do (dreamGirl == null) instead of (lowestScoreSoFar < 0) because your dream girl is null until you find the first Swedish female. C# short circuits its OR checks, so the first condition to pass will immediately jump into the block. So dreamGirl == null will pass as true and PlayTetris() < lowestScoreSoFar won't throw an uninitialized error.
What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");
Imagine I have a situation where I need to index sentences. Let me explain it a little bit deeper.
For example I have these sentences:
The beautiful sky.
Beautiful sky dream.
Beautiful dream.
As far as I can imagine the index should look something like this:
alt text http://img7.imageshack.us/img7/4029/indexarb.png
But also I would like to do search by any of these words.
For example, if I do search by "the" It should show give me connection to "beautiful".
if I do search by "beautiful" it should give me connections to (previous)"The", (next)"sky" and "dream". If I search by "sky" it should give (previous) connection to "beautiful" and etc...
Any Ideas ? Maybe you know already existing algorithm for this kind of problem ?
Short Answer
Create a struct with two vectors of previous/forward links.
Then store the word structs in a hash table with the key as the word itself.
Long Answer
This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.
I went to the park basketball court.
Would you park the car.
Your linking algorithm will create sentences like:
I went to the park the car.
Would you park basketball court.
I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.
I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position). You would then store your sentences as arrays or linked lists. Your example would look like this:
sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];
inverted_index =
{
'the': {(0,0)},
'beautiful': {(0,1), (1,0), (2,0)},
'sky' : {(0,2),(1,1)},
'dream':{(1,2), (2,1)}
};
Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.
Hope this helps.
You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.
Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.
That looks like it could be stored in a very simple database with the following tables:
Words:
Id integer primary-key
Word varchar(20)
Following:
WordId1 integer foreign-key Words(Id) indexed
WordId2 integer foreign-key Words(Id) indexed
Then, whenever you parse a sentence, just insert the ones that aren't already there, as follows:
The beautiful sky.
Words (1,'the')
Words (2, 'beautiful')
Words (3,, 'sky')
Following (1, 2)
Following (2, 3)
Beautiful sky dream.
Words (4, 'dream')
Following (3, 4)
Beautiful dream.
Following (2, 4)
Then you can query to your hearts content on what words follow or precede other words.
This oughta get you close, in C#:
class Program
{
public class Node
{
private string _term;
private Dictionary<string, KeyValuePair<Node, Node>> _related = new Dictionary<string, KeyValuePair<Node, Node>>();
public Node(string term)
{
_term = term;
}
public void Add(string phrase, Node previous, string [] phraseRemainder, Dictionary<string,Node> existing)
{
Node next= null;
if (phraseRemainder.Length > 0)
{
if (!existing.TryGetValue(phraseRemainder[0], out next))
{
existing[phraseRemainder[0]] = next = new Node(phraseRemainder[0]);
}
next.Add(phrase, this, phraseRemainder.Skip(1).ToArray(), existing);
}
_related.Add(phrase, new KeyValuePair<Node, Node>(previous, next));
}
}
static void Main(string[] args)
{
string [] sentences =
new string [] {
"The beautiful sky",
"Beautiful sky dream",
"beautiful dream"
};
Dictionary<string, Node> parsedSentences = new Dictionary<string,Node>();
foreach(string sentence in sentences)
{
string [] words = sentence.ToLowerInvariant().Split(' ');
Node startNode;
if (!parsedSentences.TryGetValue(words[0],out startNode))
{
parsedSentences[words[0]] = startNode = new Node(words[0]);
}
if (words.Length > 1)
startNode.Add(sentence,null,words.Skip(1).ToArray(),parsedSentences);
}
}
}
I took the liberty of assuming you wanted to preserve the actual initial phrase. At the end of this, you'll have a list of words in the phrases, and in each one, a list of phrases that use that word, with references to the next and previous words in each phrase.
Using an associative array will allow you to quickly parse sentences in Perl. It is much faster than you would anticipate and it can be effectively dumped out in a tree like structure for subsequent usage by a higher level language.
Tree Search Algorithms (like BST, ect)