Solr: How to search for a full match on a text field? Is there a hidden equal() operator? - full-text-search

it is too simple to describe:
q=mydynamicfield_txt:"video"
I want only hits when mydynamicfield is exact "video.
Other way round, how to supress hits, where "video" is only part of the field (like "home video").
Is this supported with Solr3.1 out of the box, or do I have to add my own special brackets like "SOLRSTARTSOLR video SOLRENDSOLR" in my index, to retrieve later my term between "START" and "END". Kind of manual regex anchoring.
This is PITA cause it needs special handling in index/gui and breaks highlighting.
Where is the way to go?
regards
Peter
(=PA=)

One solution to create untokenized(KeywordAnalyzed) field and search within it - all your text will be distinct token in Solr index.
Other solution is to write filter which will read token count from index and compare to query tokens i.e. filter entities where doc_tokens > query_tokens assuming that all query tokens are matched.

Related

How to search exact word in a test in Elastic Search

Let's say I have two texts:
Text 1 - "The fox has been living in the wood cabin for days."
Text 2 - "The wooden hammer is a dangerous weapon."
And I would like to search for the word "wood", without it matching me "wooden hammer". How would I do that in Elastic Search or nest?
Term query is used for exact matches search. However it's not recommended to use it against text fields, the following quote from term query documentation:
To better search text fields, the match query also analyzes your
provided search term before performing a search. This means the match
query can search text fields for analyzed tokens rather than an exact
term.
The term query does not analyze the search term. The term query only
searches for the exact term you provide. This means the term query may
return poor or no results when searching text fields.
The problem with text exact matches, as described in the Term query documentation:
By default, Elasticsearch changes the values of text fields as part of
analysis. This can make finding exact matches for text field values
difficult.
So, the documents data is modified (i.e., analyzed) before indexing. This depends on the index mapping definition for each field, defaults to the default index analyzer, or the standard analyzer.
But the default standard analyzer will not change the token "Wooden" to "Wood", this might happen if you used stemming for this field.
This means, if you don't use a different analyzer or stemming, querying with "Wood" shouldn't match "Wooden" token.
To summarize: Indexed data is modified/analyzed before indexing (based on the field mapping definition). Match query analyze the search query, while Term query doesn't analyze the search query. So you have to properly chose the field mapping and the search query to better suit your use case
For some use cases, like storing email addressed, phone numbers or keyword fields that always have the same value, consider using the Keyword type, which is suitable for exact matches in these use cases. However, ES recommends:
Avoid using keyword fields for full-text search. Use the text field
type instead.
So for better visibility and practical solution for your use case, it's better to elaborate more the field mapping you use and what you want to achieve.

Avoid part of a string search in elasticsearch

I have a scenario where i want to search for 'bank of india' and documents retrieved have hits for 'reserve bank of india', 'state bank of india', etc. Basically the search string named entity is part of another named entity as well.
What are the ways to avoid it in elasticsearch?
If you use keyword type instead of text as the mapping for your entity field you will no longer have those partial matches. keyword says treat this text like a single unit (named entities are like this), while text says treat each word as a unit and consider the field as a bag of words, So the query looks for the most word matches, regardless of order or if all of the words are there. There are different queries that can get at that requiring order (match_phrase) and requiring all words to be matches (minimum_should_match parameter), but I like to use the term query if you follow the keyword mapping strategy. Does that make sense?

Elastic search giving strange results

I am following this tutorial on elastic search.
Two employees have 'about' value as:
"about": "I love to go rock climbing"
"about": "I like to collect rock albums"
I run following query:
GET /megacorp/employee/_search {"query":{"match":{"about":"rock coll"}}}
Both above entries are returned, but surprisingly wit same score:
"_score": 0.2876821
Shouldn't the second one must have higher score as it has 'about' value containing both 'rock' and 'coll' while first one only contains 'rock'?
That totally depends on what analyzer you are using. if you are using standard or english analyzer this result is correct. I recommend you to spend some time working with elasticsearch's Analyze API to get familiar how each analyzer affect your text.
By the way, if you want second document to have higher score, take a look at Partial matching.
When we search on a full-text field, we need to pass the query string through the same analysis process as we have when we index a document, to ensure that we are searching for terms in the same form as those that exist in the index.
Analysis process usually consists of normalization and tokenization (the string is tokenized into individual terms by a tokenizer).
As for match Query:
If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search. It just looks for the words that are specified.
So, in your match query Elasticsearch will look for occurrences of the whole separate words: rock or/and coll.
Your 2nd document doesn't contain a separate word coll but was matched by the word rock.
Conclusion: the 2 documents are equivalent in their _score value (they were matched by the same word rock)
Elasticsearch analyzes each text field before storing it. The default analyzer (standard analyzer) splits the text based on whitespaces and lowercases it. The output of analysis process is a list of tokens which are used to match your query tokens. If any of the tokens match exactly the relevant document is returned. That's being said, your second document doesn't contain the token col and that's why you are having the same score for both documents.
Even if you build your custom analyzer and use stemming, the word collect won't be stemmed as coll.
You can build custom analyzers in which you can specify that tokens should be of length 1 character, then Elasticsearch will consider each single character as a token and you can search for the existence of any character in your documents.

Get top 10 most used words in text fields

I have an index containing thousands of documents, each one of them having a full text field.
I want to search through all those fields and fetch the 10 most common words that come back most often.
I would also like a way of visualizing it on Kibana if that's possible.
The most common way to achieve that is to duplicate your full text field with a keyword datatype. That will get you able to make terms aggregation on that field - doc here. Maybe you could consider to do a significant term aggregation - doc here, thus to avoid the presence of stopwords and common words. In ES 6.x you could use also the significant text aggregation - doc here, without create the keyword field, but i never try it, i don't know how it works. Instead if you need to retrieve the frequency of the words for each document, you should use the termvector - doc here

Exact phrase search using lucene without increasing number of fields

For a phrase search, we want to bring up results only if there's an exact match (without ignoring stopwords). If it's a non-phrase search, we are fine displaying results even if the root form of the word matches etc.
We currently pass our data through standardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter. Due to this when user wants to search for "password management", search brings up results containing "password manager".
If I remove StemFilter, then I will not be able to match for the root form of the word for non-phrase queries. I was thinking if I should index the same data as part of two fields in document.
I have asked same question at Different indexing and search strategies on same field without doubling index size?. However folks at office are not happy about indexing the same data as part of two fields. (we currently have around 20 text fields in lucene document). Is there any way to support both the cases I listed above using TokenFilters?
Say, for a StopFilter, make changes so that it emits both the input token and ? (for ignored word) with same position increments. Similarly for StemFilter, it emits both the input token and stemmed token with same position increments. Basically input and output tokens (even ignored ones) have same positions.
Is it safe to go ahead with this approach? Has anyone else faced the requirements listed here? Are there any Filters readily available which do something similar to what I mentioned in my approach?
Thanks
I don't understand what you mean by "input and output tokens." Are you storing the data twice - once as stemmed and once non-stemmed?
If you aren't storing it twice, I don't think your method will work. Suppose the stored word is jumping and they search for jumped. Your query parser can emit jump and jumped but it still won't match jumping unless you have a value stored as jump.
And if you're going to store the value once as stemmed and once as non-stemmed, then why not just store it in two fields? Then you won't have to deal with weird tokenizer changes.

Resources