Crate fulltext query syntax - full-text-search

I'm thinking about migration from Sphinx to Crate, but I can't find any documentation for fulltext query syntax. In Sphinx I can search:
("black cat" -catalog) | (awesome creature)
this stands for EITHER exact phrase "black cat" and no term "catalog" in document OR both "awesome" and "creature" at any position in document
black << big << cat
this requires document to contain all "black", "big" and "cat" terms and also requires match position of "black" be less than match position of "big" and so on.
And I need to search at specific place in the document. In sphinx I was able to use proximity operator as follows
hello NEAR/10 (mother|father -dear)
this requires document to contain "hello" term and "mother" or "father" term at most 10 terms away from "hello" and also term "dear" must not be closer than 10 terms to "hello"
The last construction with NEAR is heavily used in my application. Is it all possible in Crate?

Unfortunately I cannot comment on how it compares to Sphinx, but I will stick to your questions :)
Crate's fulltext search comes with SQL and Lucene's matching power and therefore should be able to handle complex queries. I'll just provide the queries matching your output I think it should be quite readable.
("black cat" -catalog) | (awesome creature)
select *
from mytable
where
(match(indexed_column, 'black cat') using phrase
and not match(indexed_column, 'catalog'))
or match(indexed_column, 'awesome creature') using best_fields with (operator='and');
black << big << cat
select *
from mytable
where
match(indexed_column, 'black big cat') using phrase with (slob=100000);
This one is tricky, there doesn't seem to be an operator that does exactly the same as in Sphinx, but it could be adjusted with a "slop" value. Depending on the use case there might be another (better) solution as well...
hello NEAR/10 (mother|father -dear)
select *
from mytable
where
(match(indexed_column, 'hello mother') using phrase with (slop=10)
or match(indexed_column, 'hello father') using phrase with (slop = 10))
and not match(indexed_column, 'hello dear') using phrase with (slop = 10)
They might look a bit clunky compared to Sphinx's language, but they work fine :)
Performance wise, they should still be super fast, thanks to Lucene..
Cheers, Claus

Related

Best Data Structure For Text Processing

Given a sentence like this, and i have the data structure (dictionary of lists, UNIQUE keys in dictionary):
{'cat': ['feline', 'kitten'], 'brave': ['courageous', 'fearless'], 'little': ['mini']}
A courageous feline was recently spotted in the neighborhood protecting her mini kitten
How would I efficiently process these set of text to convert the word synonyms of the word cat to the word CAT such that the output is like this:
A fearless cat was recently spotted in the neighborhood protecting her little cat
The algorithm I want is something that can process the initial text to convert the synonyms into its ROOT word (key inside dictionary), the keywords and synonyms would get longer as well.
Hence, first, I want to inquire if the data structure I am using is able to perform efficiently and whether there are more efficient structure.
For now, I am only able to think of looping through each list inside the dictionary, searching for the synonym's then mapping it back to its keyword
edit: Refined the question
Your dictionary is organised in the wrong way. It will allow you to quickly find a target word, but that is not helpful when you have an input that does not have the target word, but some synonyms of it.
So organise your dictionary in the opposite sense:
d = {
'feline': 'cat',
'kitten': 'cat'
}
To make the replacements, you could create a regular expression and call re.sub with a callback function that will look up the translation of the found word:
import re
regex = re.compile(rf"\b(?:{ '|'.join(map(re.escape, d)) })\b")
s = "A feline was recently spotted in the neighborhood protecting her little kitten"
print(regex.sub(lambda match: d[match[0]], s))
The regular expression makes sure that the match is with a complete word, and not with a substring -- "cafeline" as input will not give a match for "feline".

Solr query conundrum

I've recently swapped from using Lucene for Sitecore to Solr.
For the most part it has been smooth, but the way I was writing some queries (using Sitecore.ContentSearch.Linq) abstraction now don't seem to be compatible.
Specifically, I have a situation where I've got "global" content and "regional" content, like so:
Home (000)
X
Y
Z
Regions (ID: 111)
Region 1 (ID: 221)
A
B
Region 2 (ID: 222)
D
My code worked on Lucene, but now doesn't on Solr. It should find all "global" and a single region's content, excluding all other region's content. So as an example, if the user's current region was Region 1, I'd want the query to return content X, Y, Z, A, B.
Sitecore's Item Crawler has a field for each item in the index called "_path" which is a multivalued string field of IDs, so as an example, Region 1's _path field value would be [000, 111, 221 ].
When I write this using the Linq abstraction it comes out as below which doesn't return results.
-_path:(111) OR _path:(221)
But _path:(111) does return result. Mind blown.
When I use the Solr interface and wrap each side of the OR in extra brackets like below (which I'd consider redundant) it works! Mind blown v2.
(-_path:(111)) OR (_path:(221))
Firstly, what's the difference between those queries?
Secondly, my real problem is I can't add these extra brackets as I'm working in an abstraction Linq so the brackets will be "optimized" out.
Any advice would be awesome! Cheers.
The problem here is, lucene's negative queries don't work like you think they do. They only remove results from what has been found. -_path:111 doesn't find all documents which aren't in 111, it doesn't find anything at all. It only removes results. So you are finding all results with path "221", then removing any that also have path "111", which from your heirarchy, I assume is all of them. See my answer here for a bit more on that topic.
The OR makes it seem like it ought to work, but really -_path:(111) OR _path:(221) is the same as -_path:(111) _path:(221). The moral here is: Don't use Lucene's AND/OR/NOT syntax, if you can help it. Use +/-. +/- syntax actually expresses how the query operates, AND/OR/NOT doesn't. It attempts to shoehorn it into a different, SQL-like retrieval model and leads to some unexpected behavior like this.
So, what about: (-_path:(111)) OR (_path:(221))
Well, first, does it actually work? Or does it just get some results?
If it just gets some results, but just seems to get the same results as _path:221: The reason is -_path:111 gets no results, so your query is, in practice, something like: (nothing) OR (_path:221), which is equivalent to _path:221
If it really does get the results you expect (I'm guessing it probably does): Something is translating your query into something like: (*:* -_path:111) (_path:221). Solr does have some logic along these lines, though I'm not quite sure in this case. Essentially, it puts a match-all in front of any lonely negative queries it finds, allowing them to do what you were expecting. If the implicit *:* makes you nervous about performance, well, it should. But lucene is an inverted index, it does well with finding matches on a term quickly. Getting everything that doesn't match goes against the grain of that retrieval model, and will pretty much have to do a full scan of the index.

RethinkDB text search?

I am trying to study some rethinkdb for my next project. My backend is in Haskell and rethink db haskell driver looks a bit better then mongodb. So I want to try it.
My question is how do you do simple text search with rethinkdb?
Nothing too complex. Just find field which value contains these words.
I assume this should be built in as even a smallest blog app needs a search facility of some kind, right?.
So I am looking for a mongodb equivalent of:
var search = { "$text": { "$search": "some text" } };
Thank you.
EDIT
I am not looking for regular expressions and the match function.
It is extremely slow for more or less large sets.
I does not have any notion of indexes.
It does not have any notion of stemming.
With the rethinkdb driver documented here
run h $ table "table" # R.filter (\row -> match "some text" (row ! "field"))

Fuzzy matching of words with phrases with ruby

I want to match a bunch of data with a short number of services
My data would look something like this
{"title" : "blorb",
"category" : "zurb"
"description" : "Massage is the manipulation of superficial and deeper layers of muscle and connective tissue using various techniques, to enhance function, aid in the healing process, decrease muscle reflex activity..."
}
and I have to match it with
["Swedish Massage", "Haircut"]
Clearly the "Swedish Massage" would be the winner, but running a benchmark shows that "Haircut" is:
require 'amatch'
arr = [:levenshtein_similar, :hamming_similar, :pair_distance_similar, :longest_subsequence_similar, :longest_substring_similar, :jaro_similar, :jarowinkler_similar]
arr.each do |method|
["Swedish Massage", "Haircut"].each do |sh|
pp ">>> #{sh} matched with #{method.to_s}"
pp sh.send(method, description)
end
end and nil
result:
">>> Swedish Massage matched with jaro_similar"
# 0.5246896118183247
">>> Haircut matched with jaro_similar"
# 0.5353606789250354
">>> Swedish Massage matched with jarowinkler_similar"
# 0.5246896118183247
">>> Haircut matched with jarowinkler_similar"
# 0.5353606789250354
The rest of the indices are well below 0.1
What would be a better approach to solving this problem?
Search is a constant battle between precision and recall. One thing you could try is splitting your input by words - this will result in a much stronger match on Massage but with the consequence of broadening out the result set. You will now find sentences returned with only words close to Swedish. You could then try to control that broadening by averaging the results for multiple words, using stop lists to avoid common words like and, boosts for finding tokens close to each other etc, but you will never see truly perfect results. If you're really interested in fine tuning this I recommend ElasticSearch - relatively easy to learn and powerful.

How to scan a string for an exact keyword match?

I'm scanning names and descriptions of different items in order to see if there are any keyword matches.
In the code below it will return things like 'googler' or 'applecobbler', when what I'm trying to do is get exact matches only:
[name, description].join(" ").downcase.scan(/apple|microsoft|google/)
How should I do this?
My regex skills are pretty weak, but I think you need to use a word boundary:
[name, description].join(" ").downcase.scan(/\b(apple|microsoft|google)\b/)
Rubular example
Depends on what information you want, but if you just want exact match, you do not need regex for the comparing part. Just compare the relevant strings.
splitted_strings = [name, description].join(" ").downcase.split(/\b/)
splitted_strings & %w[apple microsoft google]
# => the words that match given in the order of appearance
Add proper boundaries entities in your regexp (\b). You can also use #grep method. instead of joining:
array.grep(your_regexp)
Looking at the question, and the situation I'd want to do those things, here's what I'd do for an actual program, where I had lists of sources, and their associated texts, and wanted to know the hits, I'd probably write something like this:
require 'pp'
names = ['From: Apple', 'From: Microsoft', 'From: Google.com']
descriptions = [
'"an apple a day..."',
'Microsoft Excel flight simulator... according to Microsoft',
'Searches of Google revealed multiple hits for "google"'
]
targets = %w[apple microsoft google]
regex = /\b(?:#{ Regexp.union(targets).source })\b/i
names.zip(descriptions) do |n,d|
name_hits, description_hits = [n, d].map{ |s| s.scan(regex) }
pp [name_hits, description_hits]
end
Which outputs:
[["Apple"], ["apple"]]
[["Microsoft"], ["Microsoft", "Microsoft"]]
[["Google"], ["Google", "google"]]
This would let me know the letter-case of the words, so I could try to differentiate the apple fruit from Apple the company, and get word counts, helping to show relevance of the text.
The regex looks like:
/\b(?:apple|microsoft|google)\b/i
It's case insensitive but scan will returns words in their original case.
names, descriptions and targets could all come from a database or separate files, helping to separate the data from the code and the need to modify the code as the targets change.
I'd use a list of target words and use Regexp.union to quickly build the pattern.

Resources