How to search emoticon/emoji in elasticsearch? - elasticsearch

I am trying to search emoticon/emoji containing text in elasticsearch. Earlier, I have inserted tweets in ES. Now I want to search for example smile or sad faces related tweets. I tried the following
1) used equivalent of unicode values of smile, but didn't work. No results were returned.
GET /myindex/twitter_stream/_search
{
"query": {
"match": {
"text": "\u1f603"
}
}
}
How to set up emoji search in elasticsearch? Do, I have to encode raw tweets before ingesting into elasticsearch? What would be the query ? Any experienced approaches? Thanks.

The specification explain how to search for emoji:
Searching includes both searching for emoji characters in queries, and
finding emoji characters in the target. These are most useful when
they include the annotations as synonyms or hints. For example, when
someone searches for ⛽︎ on yelp.com, they see matches for “gas
station”. Conversely, searching for “gas pump” in a search engine
could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would
expect a search for ⛽︎ to result in matches for “Tankstelle”.
You can keep the real unicode char, and expand it to it annotation in each language you aim to support.
This can be done with a synonym filter. But Elasticsearch standard tokenizer will remove the emoji, so there is quite a lot of work to do:
remove emoji modifier, clean everything up;
tokenize via whitespace;
remove undesired punctuation;
expand the emoji to their synonyms.
The whole process is described here: http://jolicode.com/blog/search-for-emoji-with-elasticsearch (disclaimer: I'm the author).

The way I have seen emoticons work is actually a string is stored in place of there image counterparts when you are storing them in a database. For eg. A smile is stored as :smile:. You can verify that in your case. If this is the case, you can add a custom tokenizer which does not tokenize on colons so that an exact match for the emoticons can be made. Then while searching you just need to convert the emoticon image in search to appropriate string and elasticsearch will be able to find it. Hope it helps

Related

Maching two words as a single word

Consider that I have a document which has a field with the following content: 5W30 QUARTZ INEO MC 3 5L
A user wants to be able to search for MC3 (no space) and get the document; however, search for MC 3 (with spaces) should also work. Moreover, there can be documents that have the content without spaces and that should be found when querying with a space.
I tried indexing without spaces (e.g. 5W30QUARTZINEOMC35L), but that does not really work as using a wildcard search I would match too much, e.g. MC35 would also match, and I only want to match two exact words concatenated together (as well as exact single word).
So far I'm thinking of additionally indexing all combinations of two words, e.g. 5W30QUARTZ, QUARTZINEO, INEOMC, MC3, 35L. However, does Elasticsearch have a native solution for this?
I'm pretty sure what you want can be done with the shingle token filter. Depending on your mapping, I would imagine you'd need to add a filter looking something like this to your content field to get your tokens indexed in pairs:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":2,
"min_shingle_size":2,
"output_unigrams":"true"
}
Note that this is also already the default configuration, I just added it for clarity.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

ElasticSearch Nest AutoComplete based on words split by whitespace

I have AutoComplete working with ElasticSearch (Nest) and it's fine when the user types in the letters from the begining of the phrase but I would like to be able to use a specialized type of auto complete if it's possible that caters for words in a sentence.
To clarify further, my requirement is to be able to "auto complete" like such:
Imagine the full indexed string is "this is some title". When the user types in "th", this comes back as a suggestion with my current code.
I would also like the same thing to be returned if the user types in "som" or "title" or any letters that form a word (word being classified as a string between two spaces or the start/end of the string).
The code I have is:
var result = _client.Search<ContentIndexable>(
body => body
.Index(indexName)
.SuggestCompletion("content-suggest" + Guid.NewGuid(),
descriptor =>
descriptor
.OnField(t => t.Title.Suffix("completion"))
.Text(searchTerm)
.Size(size)));
And I would like to see if it would be possible to write something that matches my requirement using SuggestCompletion (and not by doing a match query).
Many thanks,
Update:
This question already has an answer here but I leave it here since the title/description is probably a little easier to search by search engines.
The correct solution to this problem can be found here:
Elasticsearch NEST client creating multi-field fields with completion
#Kha i think it's better to use the NGram Tokenizer
So you should use this tokenizer when you create the mapping.
If you want more info, and maybe an example write back.

Elastic Search - Exact phrase search with wildcards

I am looking for help on exact phrase search with wild card.
QueryBuilders.multiMatchQuery("Java Se", "title", "subtitle")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX);
The above query, returns the following results.
1) Java Search
2) Elastic Java Search
Trailing wildcard works.
But, When i search like the below query,
QueryBuilders.multiMatchQuery("ava Se", "title", "subtitle")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX);
It does not return anything as nothing matches exactly "ava Se".
I was expecting the same result as above.
Leading wildcard does not work.
Is there anyway to achieve this?
Thanks,
Baskar.S
If you have a look at the javadoc for "Type.PHRASE_PREFIX" you will see that only the last term in the string is used as a prefix, thus only "Se" in your case.
I tried this query in my index and it worked:
.setQuery(QueryBuilders.matchQuery("body", "(.*?)ing the").type(MatchQueryBuilder.Type.PHRASE_PREFIX))
It returned documents that contain phrases like "We are strengthening the proposals..", "By using the.."
You need to use nGram analyzer or even edgeNGram would be a better idea.
Once you have done that , your index might be a bit heavy but affix search will work fine without wild cards.

Stemming and partial search using MongoDB 2.4

What is the correct way of doing full text search and partial searches in MongoDB?
E.g. the norwegian word "sokk" (sock).
When searching for "sokk" I want to match on "sokker" (sock in plural), "sokk" and "sokkepose"
A search for "sokker" should match "sokk" and "sokker".
I get the wanted result by using this ruby snippet:
def self.search(q)
result = []
# Full text search first
result << Ad.text_search(q).to_a
# Then search for parts of the word
result << Ad.any_of({ title: /.*#{q}.*/i }, { description: /.*#{q}.*/i} ).to_a
result.flatten!
result.uniq
end
Any suggestions? :)
Cheers,
Martin Stabenfeldt
Martin,
A few suggestions / recommendations / corrections:
Full Text Search in 2.4 is not production ready and should not be deployed in production without knowing the tradeoffs being made. You can find more details at - http://docs.mongodb.org/manual/tutorial/enable-text-search/
For Text Search to work, you need to provide appropriate language for the document while adding it (or specific fields in 2.6). This ensures the words are appropriately stemmed and stopped words are removed from indexing that field.
Specify language while searching for a specific field so that it is appropriately stemmed and top words removed for searching and ranking the results appropriately. You can find more details about both indexing and searching at http://docs.mongodb.org/manual/reference/command/text/ . You can also see the languages that are supported by the MongoDB FTS on that webpage.
Ideally you would not be using regular expressions while doing a full text search, but rather specify the words / strings that you are looking for along with the language.

Resources