I have a multi language data set and a Standard analyzer that takes care of the tokenizing for this data set very nicely. The only bad part is that it removes the special characters like #, #, :, etc.. Is there any way that I can use the standard tokenizer and still be able to search on the special characters?
I have already looked into combo analyzer plugin which did not work as I had hoped.Apparently the combination of analyzers do not work in a chain like the token filters. They work independently which is not useful for me.
Also I looked into the char mapping filter in order to process the data before tokenizing it, but it does not work like the word delimiter token filter where we can specify "type_table" to convert a special character into an ALPHANUM. It just maps one word to another word. As a result I won't be able to search on the special characters.
Also, I have looked into the pattern analyzers, which would work for the special characters but they are not recommended for a multi language data set.
Can anybody point me in the right direction in order to solve this problem?
Thanks in advance!
Related
I'm setting up my elastic instance in a schema-less manner (no up front mappings) and the application requires users be able to search against a field that contains a word that may or may not be tokenized into multiple strings. For example, the field may contain the word "ONETWO". The spec requires that a user should be able to search "ONETWO", "ONE", and "TWO" and retrieve that same document. There doesn't seem any easy way to accomplish this even with a custom tokenizer (and I don't think there SHOULD be an easy way to do this -- or any way at all). Just want to confirm my thoughts.
Its very easy to cater your requirement using the custom analyzer which uses the n-gram tokenizer, You can even pass it to a lowercase token filter, so that in your case even your text was ONETWO but if user searches for one, One, ONE he should get a result. Although for this you need to apply a different analyzer search time read more about it https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html.
Refer https://devticks.com/how-to-improve-your-full-text-search-in-elasticsearch-with-ngram-tokenizer-e346f29f8ddb for more information and let me know if you need any information.
We have an index with mixed Greek, English data for an ATG-Endeca application. Indexed Greek data have words with accents. If the search terms are without accents they don't match to any data (or they match due to autoccorection that happens for the character without the accent to the character withthe accent and this is not desired functionality). Dgidx flag --diacritic folding configuration doesn't include mapping for Greek caracters (https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html).
Is it possible to extend this oob functionality thought a properties file in Endeca side or nucleous or code?
In the documentation you provide it states:
Dgidx supports mapping Latin1, Latin extended-A, and Windows CP1252 international characters to their simple ASCII equivalents during indexing.
This suggests that Greek is not supported since it doesn't fall into any of these character sets (I believe Greek is Latin-7). That said, you could try setting a language flag at a record level (since you indicate that your data includes both English and Greek) assuming that each language has its own record or try to implement a global language using the dgidx and dgraph parameters but this will affect things like stemming for records or properties not in the global language.
dgidx --lang el
dgraph --lang el
Though I'm not sure it will work based on the original statement.
Alternatively, you can implement a process of diacritic removal using a custom Accessor, which extends the atg.repository.search.indexing.PropertyAccessorImpl class (an option since you refer to Nucleus, so I assume you are using ATG/Oracle Commerce). Using this you specify a normalised searchable field in your index that duplicates the searchable fields in your current index but now with all diacritics removed. The same logic you apply in the Accessor then needs to be applied as a preprocessor on your search terms so that you normalise the input to match the indexed values. Lastly make your original fields in the index (with the accentuated characters) display-only and the normalised fields searchable (but don't display them).
The result will be matching your normalised text but the downside is you have duplicated data so your index will be bigger. Not a big issue with small data sets. There may also be an impact on how the OOTB functionality, like stemming, behaves with the normalised data set. You'll have to do some testing with various scenarios in Greek and English to see if the precision and recall is adversely affected.
I'm writing a search tool for searching code but I'm having a hard time finding the right analyzer to use. I've tried doing a whitespace analyzer but you end up with issues where you might have dbo.My_Procedure and searching "my_procedure" should work as well as searching ".My_Procedure". My idea is to split on special characters but store them into their own tokens as well. But then if you write my_procedure as a search it will just look for my, _ and procedure anywhere in the file unless you wrap it in quotes (even though to the user it looks like it's just one word). What approach have people taken for analyzing code?
If your code is in Java, according to Java naming conventions your methods and classes should be camel-case so you should not run into names like my_search but rather mySearch.
If that is the case - you can use the (default) standard analyzer which uses word boundaries as delimiters for split.
That said, if there's no way around it and you have to consider names like my_search in the tokenizing part, you can implement your own custom analyzer.
This answer shows an example of setting a custom-analyzer.
I have a string field with values like PA2456U or PA23U-RB and I would like to do a partial match, so that I can search for PA24 and I would get the first result, or search PA23U-RB and find the second result (so that would be a full match.
I tried using ngram, but it ignores the numeric values, so, if I enter pa111 it returns anything that starts with pa
See this gist for an example.
This may be a separate question, or related, but searching for 12345001 should also match 12345-001
Thanks
Update
The final analyzer I used is here: https://gist.github.com/3803180
Making ngrams looks like a good choice based on your requirements, but I think edge_ngrams should be enough. This way your index would grow a little bit slower since you'd be indexing less terms. Anyway the problem is that you don't need to apply the same analyzer to the query too, otherwise querying for pa111 would mean querying for all the ngrams that you can make out of it, which would lead you to a lot more matches that you'd expect.
You just need to change your search_analyzer to an analyzer which doesn't make ngrams. You can use the same you already have and remove the ngram token filter (only for the search_analyzer, the index_analyzer is fine).
Regarding the dash question, have a look at the Word delimiter token filter. You need to configure it to make it work as you expect. I guess the generate_number_parts=false, generate_word_parts=false and split_on_numerics=false options should make it work as you want. That way the dash won't be indexed. You need to apply the token filter at both index time and query time.
I use Solr 3.3. and I need to use suggest component to make an autocomplete.
I would like to conserve word with hyphen to make suggestion (for example : "Wi-fi")
For differents field type configuration I have word "wifi" or "wi" .
Someone knows which filter can make this.
Thanks
How does your schema look like (the autocomplete type)?
You could use solr.WhitespaceTokenizerFactory. It doesn't tokenize on extraneous characters like hyphens.
If you want to remove these characters, you need to use solr.PatternReplaceFilterFactory, solr.PatternReplaceCharFilterFactory or even creating your own custom Tokenizer.