How to get positions in Elasticsearch from partial search query keywords? - elasticsearch

Are there any ways to fulfill this scenario below?
Example text: "I want to eat"
I will try to match "ant to" which matches partially with the phrase "want to".
What I want is like below:
My ideal search result would be (tokenized space positions)
startToken = 1 ("want" is in index token 1)
startChar = 1 ("a" is in index char 1 with index token is 1)
endToken = 2
endChar = 1
but it seems to not be native in Elasticsearch, can Elasticsearch give me the result at least like this? (text positions full, zero-based index position)
startChar = 3
endChar = 8
After I searched the Internet I got some clue to use highlighting but after I tried, it failed for partial searching.
Can you give me some best practices for this scenario in Elasticsearch?

Related

Fast search algorithm

Let's have tons of posts.
As a user, I want to find all posts containing the words "hello" and "world".
Let's say there is a post with this text "Hello world, this place is beautiful".
Now:
a) Find the text if the user searches for "hello",
b) Find the text
if the user searches for "hello", "world",
c) Don't find the text if the user searches for "hello", "world", "funny".
To reduce the quantity of possible candidates I was thinking about this:
for each post (
if number_of_search_words == number_of_post_words -> proceed with search logic
if number_of_search_words < number_of_post_words -> proceed with search logic
if number_of_search_words > number_of_post_words -> don't proceed with search logic
)
but that would also require an number containing the quantity of words of each post, which leads to more complexity.
Is there an elegant way of solving this?
You must to use bit containers, for example, BitMagic.
Initially, you assign to each post some sequenced integer ID, postID.
Thereafter, create N bit containers (N = quantity of search words), each size is maximal postID.
Thereafter, build indices: parse each post, and for each term from the post, set bit1 in the term-associated container, with postID as index.
To search:
get bit containers for your words "hello", "word".
AND those bit containers.
Result container will contains bit 1's for PostIDs, contains both search terms.

Phrase matching with Sitecore ContentSearch API

I am using Sitecore 7.2 with a custom Lucene index and Linq. I need to give additional (maximum) weight to exact matches.
Example:
A user searches for "somewhere over the rainbow"
Results should include items which contain the word "rainbow", but items containing the exact and entire term "somewhere over the rainbow" should be given maximum weight. They will displayed to users as the top results. i.e. An item containing the entire phrase should weigh more heavily than an item which contains the word "rainbow" 100 times.
I may need to handle ranking logic outside of the ContentSearch API by collecting "phrase matches" separately from "wildcard matches", and that's fine.
Here's my existing code, truncated for brevity. The code works, but exact phrase matches are not treated as I described.
using (var context = ContentSearchManager.GetIndex("sitesearch-index").CreateSearchContext())
{
var pred = PredicateBuilder.False<SearchResultItem>();
pred = pred
.Or(i => i.Name.Contains(term)).Boost(1)
.Or(i => i["Field 1"].Contains(term)).Boost(3)
.Or(i => i["Field 2"].Contains(term)).Boost(1);
IQueryable<SearchResultItem> query = context.GetQueryable<SearchResultItem>().Where(pred);
var hits = query.GetResults().Hits;
...
}
How can I perform exact phrase matching and is it possible with the Sitecore.ContentSearch.Linq API?
Answering my own question. The problem was with the parenthesis syntax. It should be
.Or(i => i.Name.Contains(term).Boost(1))
rather than
.Or(i => i.Name.Contains(term)).Boost(1)
The boosts were not being observed.
I think if you do the following it will solve this:
Split your search string on space
Create a predicate for each split with an equal boost value,
Create an additional predicate with the complete search string and
with higher boost value
combine all these predicates in one "OR" predicate.
Also I recommend you to check the following:
Sitecore Solr Search Score Value
http://sitecoreinfo.blogspot.com/2015/10/sitecore-solr-search-result-items.html

Sorting search results

I'm implementing phrase and keyword search together (most likely this kind of search has a name, but I don't know it). To exemplify, the search I like turtles should match:
I like turtles
He said I like turtles
I really like turtles
I really like those reptiles called turtles
Turtles is what I like
In short, a string must contain all keywords to match.
Then comes the problem of sorting the search results.
Naively, I'm assuming that the closest the matches are to the beginning of the result AND to the original query, the better the result. How can I express this code?
My first approach was to assign a score for each keyword in each result based on how close the keyword is to an expected position, based in the original query. In pseudo-code:
score(result,query) {
keywords = query.split(" ");
score = 0
for i to keywords.length() {
score += score(result,query,keywords,i)
}
return score
}
score(result,query,keywords,i) {
index = text.indexOf(keywords[i])
if (i == 0) return index;
previousIndex = text.indexOf(keywords[i-1])
indexInSearch = query.indexOf(keywords[i])
previousIndexInSearch = query.indexOf(keywords[i-1])
expectedIndex = previousIndex + (indexInSearch - previousIndexInSearch)
return abs(index - expectedIndex)
}
The lower the score the better the result. The scores for the above examples seem decent enough:
I like turtles = 0
I really like turtles = 7
He said I like turtles = 8
I really like those reptiles called turtles = 38
Turtles is what I like = 39
Is this a viable approach to sort search results?
Leaving any kind of semantic analysis aside, what else could I be considering to improve it?

XPath :: running counter two levels

Using the count(preceding-sibling::*) XPath expression one can obtaining incrementing counters. However, can the same also be accomplished in a two-levels deep sequence?
example XML instance
<grandfather>
<father>
<child>a</child>
</father>
<father>
<child>b</child>
<child>c</child>
</father>
</grandfather>
code (with Saxon HE 9.4 jar on the CLASSPATH for XPath 2.0 features)
Trying to get an counter sequence of 1,2 and 3 for the three child nodes with different kinds of XPath expressions:
XPathExpression expr = xpath.compile("/grandfather/father/child");
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0 ; i < nodes.getLength() ; i++) {
Node node = nodes.item(i);
System.out.printf("child's index is: %s %s %s, name is: %s\n"
,xpath.compile("count(preceding-sibling::*)").evaluate(node)
,xpath.compile("count(preceding-sibling::child)").evaluate(node)
,xpath.compile("//child/position()").evaluate(doc)
,xpath.compile(".").evaluate(node));
}
The above code prints:
child's index is: 0 0 1, name is: a
child's index is: 0 0 1, name is: b
child's index is: 1 1 1, name is: c
None of the three XPaths I tried managed to produce the correct sequence: 1,2,3. Clearly it can trivially be done using the i loop variable but I want to accomplish it with XPath if possible. Also I need to keep the basic framework of evaluating an XPath expression to get all the nodes to visit and then iterating on that set since that's the way the real application I work on is structured. Basically I visit each node and then need to evaluate a number of XPath expressions on it (node) or on the document (doc); one of these XPAth expressions is supposed to produce this incrementing sequence.
Use the preceding axis with a name test instead.
count(preceding::child)
Using XPath 2.0, there is a much better way to do this. Fetch all <child/> nodes and use the position() function to get the index:
//child/concat("child's index is: ", position(), ", name is: ", text())
You don't say efficiency is important, but I really hate to see this done with O(n^2) code! Jens' solution shows how to do that if you can use the result in the form of a sequence of (position, name) pairs. You could also return an alternating sequence of strings and numbers using //child/(string(.), position()): though you would then want to use the s9api API rather than JAXP, because JAXP can only really handle the data types that arise in XPath 1.0.
If you need to compute the index of each node as part of other processing, it might still be worth computing the index for every node in a single initial pass, and then looking it up in a table. But if you're doing that, the simplest way is surely to iterate over the result of //child and build a map from nodes to the sequence number in the iteration.

Lucene SpanQuery weak spots

I have a ~20GB index of documents that have words with several attributes associated with them, e.g.:
WORD: word_1 word_2 ... word_n
POS: pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2
Field tokens separated by ':' are ambiguous, i.e. they correspond to the same position in the document.
An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds only to lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when searching for pos1_1 & lemma1_3 at the same position.
I handle ambiguous tokens position with standard positionIncrement = 0, and attribute number correspondence with token payloads. Say, lemma1_1 has payload = 1, lemma1_2 - 2; pos1_1 - 1, pos1_2 - 2, and so on. And while searching for token attributes at the same position I use payload filter that checks if the payloads of all tokens matched are the same.
And that's it: SpanNearQueries run super slow on that index (10's of seconds). The majority of documents in the index matches to a regular query.
I don't know actually how SpanQueries work in-depth, but is there some inefficiency in them by design? Or is payload retrieval so expensive?
I'm just wondering if I'm missing something obvious that slows down the entire search.

Resources