Stemming and partial search using MongoDB 2.4 - ruby

What is the correct way of doing full text search and partial searches in MongoDB?
E.g. the norwegian word "sokk" (sock).
When searching for "sokk" I want to match on "sokker" (sock in plural), "sokk" and "sokkepose"
A search for "sokker" should match "sokk" and "sokker".
I get the wanted result by using this ruby snippet:
def self.search(q)
result = []
# Full text search first
result << Ad.text_search(q).to_a
# Then search for parts of the word
result << Ad.any_of({ title: /.*#{q}.*/i }, { description: /.*#{q}.*/i} ).to_a
result.flatten!
result.uniq
end
Any suggestions? :)
Cheers,
Martin Stabenfeldt

Martin,
A few suggestions / recommendations / corrections:
Full Text Search in 2.4 is not production ready and should not be deployed in production without knowing the tradeoffs being made. You can find more details at - http://docs.mongodb.org/manual/tutorial/enable-text-search/
For Text Search to work, you need to provide appropriate language for the document while adding it (or specific fields in 2.6). This ensures the words are appropriately stemmed and stopped words are removed from indexing that field.
Specify language while searching for a specific field so that it is appropriately stemmed and top words removed for searching and ranking the results appropriately. You can find more details about both indexing and searching at http://docs.mongodb.org/manual/reference/command/text/ . You can also see the languages that are supported by the MongoDB FTS on that webpage.
Ideally you would not be using regular expressions while doing a full text search, but rather specify the words / strings that you are looking for along with the language.

Related

How to search emoticon/emoji in elasticsearch?

I am trying to search emoticon/emoji containing text in elasticsearch. Earlier, I have inserted tweets in ES. Now I want to search for example smile or sad faces related tweets. I tried the following
1) used equivalent of unicode values of smile, but didn't work. No results were returned.
GET /myindex/twitter_stream/_search
{
"query": {
"match": {
"text": "\u1f603"
}
}
}
How to set up emoji search in elasticsearch? Do, I have to encode raw tweets before ingesting into elasticsearch? What would be the query ? Any experienced approaches? Thanks.
The specification explain how to search for emoji:
Searching includes both searching for emoji characters in queries, and
finding emoji characters in the target. These are most useful when
they include the annotations as synonyms or hints. For example, when
someone searches for ⛽︎ on yelp.com, they see matches for “gas
station”. Conversely, searching for “gas pump” in a search engine
could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would
expect a search for ⛽︎ to result in matches for “Tankstelle”.
You can keep the real unicode char, and expand it to it annotation in each language you aim to support.
This can be done with a synonym filter. But Elasticsearch standard tokenizer will remove the emoji, so there is quite a lot of work to do:
remove emoji modifier, clean everything up;
tokenize via whitespace;
remove undesired punctuation;
expand the emoji to their synonyms.
The whole process is described here: http://jolicode.com/blog/search-for-emoji-with-elasticsearch (disclaimer: I'm the author).
The way I have seen emoticons work is actually a string is stored in place of there image counterparts when you are storing them in a database. For eg. A smile is stored as :smile:. You can verify that in your case. If this is the case, you can add a custom tokenizer which does not tokenize on colons so that an exact match for the emoticons can be made. Then while searching you just need to convert the emoticon image in search to appropriate string and elasticsearch will be able to find it. Hope it helps

Sphinx-doc: Search tool not returning expected results?

Sphinx-doc is great in helping us generate HTML pages for our Python libraries and make the documentation searchable! However, the search tool is not returning results that I expect...
For example:
def foo():
"""
show ip address
"""
pass
If I search for just "show" or "address" using the html search box, it matched the foo() function (which is expected). If I search for "ip", no match. If I search all three words, no match. I noticed that "ip" is not in the searchindex.js and it appeared that words less than or equal to 3 characters are not indexed. This is a problem since anyone using a search string that contains a word that is less than or equal to 3 characters will not get the expected result (this is worse than just ignoring those words during search).
Is there any way around it? Is this a bug?

ElasticSearch Nest AutoComplete based on words split by whitespace

I have AutoComplete working with ElasticSearch (Nest) and it's fine when the user types in the letters from the begining of the phrase but I would like to be able to use a specialized type of auto complete if it's possible that caters for words in a sentence.
To clarify further, my requirement is to be able to "auto complete" like such:
Imagine the full indexed string is "this is some title". When the user types in "th", this comes back as a suggestion with my current code.
I would also like the same thing to be returned if the user types in "som" or "title" or any letters that form a word (word being classified as a string between two spaces or the start/end of the string).
The code I have is:
var result = _client.Search<ContentIndexable>(
body => body
.Index(indexName)
.SuggestCompletion("content-suggest" + Guid.NewGuid(),
descriptor =>
descriptor
.OnField(t => t.Title.Suffix("completion"))
.Text(searchTerm)
.Size(size)));
And I would like to see if it would be possible to write something that matches my requirement using SuggestCompletion (and not by doing a match query).
Many thanks,
Update:
This question already has an answer here but I leave it here since the title/description is probably a little easier to search by search engines.
The correct solution to this problem can be found here:
Elasticsearch NEST client creating multi-field fields with completion
#Kha i think it's better to use the NGram Tokenizer
So you should use this tokenizer when you create the mapping.
If you want more info, and maybe an example write back.

Amazon Cloudsearch not searching with partial string

I'm testing Amazon Cloudsearch for my web application and i'm running into some strange issues.
I have the following domain indexes: name, email, id.
For example, I have data such as: John Doe, John#example.com, 1
When I search for jo I get nothing. If I search for joh I still get nothing, But if I search for john then I get the above document as a hit. Why is it not getting when I put partial strings? I even put suggestors on name and email with fuzzy matching enabled. Is there something else i'm missing? I read the below on this:
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-compound-queries.html
I'm doing the searches using boto as well as with the form on AWS page.
What you're trying to do -- finding "john" by searching "jo" -- is a called a prefix search.
You can accomplish this either by searching
(prefix field=name 'jo')
or
q=jo*
Note that if you use the q=jo* method of appending * to all your queries, you may want to do something like q=jo* |jo because john* will not match john.
This can seem a little confusing but imagine if google gave back results for prefix matches: if you searched for tort and got back a mess of results about tortoises and torture instead of tort (a legal term), you would be very confused (and frustrated).
A suggester is also a viable approach but that's going to give you back suggestions (like john, jordan and jostle rather than results) that you would then need to search for; it does not return matching documents to you.
See "Searching for Prefixes in Amazon CloudSearch" at http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
Are your index field types "Text"? If they are just "Literals", they have to be an exact match.
I think you must have your name and email fields set as the literal type instead of the text type, otherwise a simple text search of 'jo' or 'Joh' should've found the example document.
While using a prefix search may have solved your problem (and that makes sense if the fields are set as the literal type), the accepted answer isn't really correct. The notion that it's "like a google search" isn't based on anything in the documentation. It actually contradicts the example they use, and in general muddies up what's possible with the service. From the docs:
When you search text and text-array fields for individual terms, Amazon CloudSearch finds all documents that contain the search terms anywhere within the specified field, in any order. For example, in the sample movie data, the title field is configured as a text field. If you search the title field for star, you will find all of the movies that contain star anywhere in the title field, such as star, star wars, and a star is born. This differs from searching literal fields, where the field value must be identical to the search string to be considered a match.

Highlighting a query word in a document

I have a document and a query term. I want to
find the query term in the document.
Pad each occurrence of the query term with a certain text marker.
For example
Text: I solemnly swear that I am upto no good.
Query: swear
Output: I solemnly MATCHSTART swear MATCHEND that I am upto no good.
Assuming that I have multiple query words and a large document, now can I do this efficiently.
I did go over various links on the internet but couldn't find anything very conclusive or definite. Moreover, this is just a programming question and has nothing to do with search engine development or information retrieval.
Any help would be appreciated. Thanks.
If each your query is word (some substring, does not contains SP/TAB/NL, etc), and allowed with very low probability false positive (when you mark some word, omitted in the query set) - you can use Bloom filter: http://en.wikipedia.org/wiki/Bloom_filter
First, load your query words into Bloom filter, and thereafter scan document, and match each word in the filter. If search result is positive - mark this word.
You can use my implementation of bloom filter: http://olegh.cc.st/src/bloom.c.txt
In Python:
text = "I solemnly swear I am up to no good" #read in however you like.
query = input("Query: ")
text.replace(" "+query" "," MATCHSTART "+query+" MATCHEND ")
OUTPUT:
'I solemnly MATCHSTART swear MATCHEND that I am up to no good.'
You could also use regex, but that's slower, so I just used string concat to add whitespace to the beginning and end of the word (so as not to match "swears" or "swearing" or "sportswear". This is easily translatable to whatever language you prefer.

Resources