Solr : conserve hyphen word for suggest - filter

I use Solr 3.3. and I need to use suggest component to make an autocomplete.
I would like to conserve word with hyphen to make suggestion (for example : "Wi-fi")
For differents field type configuration I have word "wifi" or "wi" .
Someone knows which filter can make this.
Thanks

How does your schema look like (the autocomplete type)?
You could use solr.WhitespaceTokenizerFactory. It doesn't tokenize on extraneous characters like hyphens.
If you want to remove these characters, you need to use solr.PatternReplaceFilterFactory, solr.PatternReplaceCharFilterFactory or even creating your own custom Tokenizer.

Related

Is there any way to ignore mapping character filter?

I setup the mapping character filter with mappings_path.
Is it possible to ignore the mapping in some special case when searching?
Thanks!
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html#analysis-mapping-charfilter
If I understand the question correctly you may use multi-fields. You can use multiple analyzers for them.Please check this document. It would be great if you can give more details.

Searching for a term as both a single string and multi worded string

I'm setting up my elastic instance in a schema-less manner (no up front mappings) and the application requires users be able to search against a field that contains a word that may or may not be tokenized into multiple strings. For example, the field may contain the word "ONETWO". The spec requires that a user should be able to search "ONETWO", "ONE", and "TWO" and retrieve that same document. There doesn't seem any easy way to accomplish this even with a custom tokenizer (and I don't think there SHOULD be an easy way to do this -- or any way at all). Just want to confirm my thoughts.
Its very easy to cater your requirement using the custom analyzer which uses the n-gram tokenizer, You can even pass it to a lowercase token filter, so that in your case even your text was ONETWO but if user searches for one, One, ONE he should get a result. Although for this you need to apply a different analyzer search time read more about it https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html.
Refer https://devticks.com/how-to-improve-your-full-text-search-in-elasticsearch-with-ngram-tokenizer-e346f29f8ddb for more information and let me know if you need any information.

Wild card searches with query_string

Is it possible to enable wild card queries by default using query_string?
I'm having to manually append * to each of the terms. I had a look at the documentation but couldn't find anything.
No there is no way to enable it. You can enable/disable using wildcards "allow_leading_wildcard" the way how it works, that ES try to match tokens. So if you search for car it will match car until you search car* then it will match cars (sure it depends on analysis but further there is link for you to read).
I dont know case what you want to do but you should look to dealing with language. It should help also note that using leading wildcard could have performance issues that is why sometimes is better to disable it.

ElasticSearch What analyzer to use for searching code

I'm writing a search tool for searching code but I'm having a hard time finding the right analyzer to use. I've tried doing a whitespace analyzer but you end up with issues where you might have dbo.My_Procedure and searching "my_procedure" should work as well as searching ".My_Procedure". My idea is to split on special characters but store them into their own tokens as well. But then if you write my_procedure as a search it will just look for my, _ and procedure anywhere in the file unless you wrap it in quotes (even though to the user it looks like it's just one word). What approach have people taken for analyzing code?
If your code is in Java, according to Java naming conventions your methods and classes should be camel-case so you should not run into names like my_search but rather mySearch.
If that is the case - you can use the (default) standard analyzer which uses word boundaries as delimiters for split.
That said, if there's no way around it and you have to consider names like my_search in the tokenizing part, you can implement your own custom analyzer.
This answer shows an example of setting a custom-analyzer.

How to configure standard tokenizer in elasticsearch

I have a multi language data set and a Standard analyzer that takes care of the tokenizing for this data set very nicely. The only bad part is that it removes the special characters like #, #, :, etc.. Is there any way that I can use the standard tokenizer and still be able to search on the special characters?
I have already looked into combo analyzer plugin which did not work as I had hoped.Apparently the combination of analyzers do not work in a chain like the token filters. They work independently which is not useful for me.
Also I looked into the char mapping filter in order to process the data before tokenizing it, but it does not work like the word delimiter token filter where we can specify "type_table" to convert a special character into an ALPHANUM. It just maps one word to another word. As a result I won't be able to search on the special characters.
Also, I have looked into the pattern analyzers, which would work for the special characters but they are not recommended for a multi language data set.
Can anybody point me in the right direction in order to solve this problem?
Thanks in advance!

Resources