Lucene: how to keep lithuanian language symbols in StandardAnalyzer? - utf-8

I have done my own analyzer for unneccessary data and stop-words removal with Lucene (version 4.3.0).
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43, new CharArraySet(Version.LUCENE_43, stopWords, true));
Everything works as expected, but my language is
lithuanian, so i would like to keep lithuanian language symbols: 'ĄČĘĖĮŠŲŪŽąčęėįšųūž'. The main problem that lithuanian language don't have own analyzer..
At the moment, words are truncated (without ĄČĘĖĮŠŲŪŽąčęėįšųūž symbols).
Any suggestions how to override the format method/ keep these symbols ? I don't need the stemming tool.

My bad.. Yes StandardAnalyzer is not the problem here, I was reading data in wrong unicode format (UTF-8), which was written in Windows-1257. This produced unneccessary symbols, which were interpreted as rubbish. So changing it to the right unicode solved this isssue :)

Related

Extend Endeca's diacritic folding mapping

We have an index with mixed Greek, English data for an ATG-Endeca application. Indexed Greek data have words with accents. If the search terms are without accents they don't match to any data (or they match due to autoccorection that happens for the character without the accent to the character withthe accent and this is not desired functionality). Dgidx flag --diacritic folding configuration doesn't include mapping for Greek caracters (https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html).
Is it possible to extend this oob functionality thought a properties file in Endeca side or nucleous or code?
In the documentation you provide it states:
Dgidx supports mapping Latin1, Latin extended-A, and Windows CP1252 international characters to their simple ASCII equivalents during indexing.
This suggests that Greek is not supported since it doesn't fall into any of these character sets (I believe Greek is Latin-7). That said, you could try setting a language flag at a record level (since you indicate that your data includes both English and Greek) assuming that each language has its own record or try to implement a global language using the dgidx and dgraph parameters but this will affect things like stemming for records or properties not in the global language.
dgidx --lang el
dgraph --lang el
Though I'm not sure it will work based on the original statement.
Alternatively, you can implement a process of diacritic removal using a custom Accessor, which extends the atg.repository.search.indexing.PropertyAccessorImpl class (an option since you refer to Nucleus, so I assume you are using ATG/Oracle Commerce). Using this you specify a normalised searchable field in your index that duplicates the searchable fields in your current index but now with all diacritics removed. The same logic you apply in the Accessor then needs to be applied as a preprocessor on your search terms so that you normalise the input to match the indexed values. Lastly make your original fields in the index (with the accentuated characters) display-only and the normalised fields searchable (but don't display them).
The result will be matching your normalised text but the downside is you have duplicated data so your index will be bigger. Not a big issue with small data sets. There may also be an impact on how the OOTB functionality, like stemming, behaves with the normalised data set. You'll have to do some testing with various scenarios in Greek and English to see if the precision and recall is adversely affected.

ElasticSearch What analyzer to use for searching code

I'm writing a search tool for searching code but I'm having a hard time finding the right analyzer to use. I've tried doing a whitespace analyzer but you end up with issues where you might have dbo.My_Procedure and searching "my_procedure" should work as well as searching ".My_Procedure". My idea is to split on special characters but store them into their own tokens as well. But then if you write my_procedure as a search it will just look for my, _ and procedure anywhere in the file unless you wrap it in quotes (even though to the user it looks like it's just one word). What approach have people taken for analyzing code?
If your code is in Java, according to Java naming conventions your methods and classes should be camel-case so you should not run into names like my_search but rather mySearch.
If that is the case - you can use the (default) standard analyzer which uses word boundaries as delimiters for split.
That said, if there's no way around it and you have to consider names like my_search in the tokenizing part, you can implement your own custom analyzer.
This answer shows an example of setting a custom-analyzer.

How to configure standard tokenizer in elasticsearch

I have a multi language data set and a Standard analyzer that takes care of the tokenizing for this data set very nicely. The only bad part is that it removes the special characters like #, #, :, etc.. Is there any way that I can use the standard tokenizer and still be able to search on the special characters?
I have already looked into combo analyzer plugin which did not work as I had hoped.Apparently the combination of analyzers do not work in a chain like the token filters. They work independently which is not useful for me.
Also I looked into the char mapping filter in order to process the data before tokenizing it, but it does not work like the word delimiter token filter where we can specify "type_table" to convert a special character into an ALPHANUM. It just maps one word to another word. As a result I won't be able to search on the special characters.
Also, I have looked into the pattern analyzers, which would work for the special characters but they are not recommended for a multi language data set.
Can anybody point me in the right direction in order to solve this problem?
Thanks in advance!

Things to take into account when internationalizing web app to handle chinese language

I have a MVC3 web app with i18n in 4 latin languages... but I would like to add CHINESE in the future.
I'm working with standard resource file.
Any tips?
EDIT: Anything about reading direction? Numbers? Fonts?
I would start with these observations:
Chinese is a non-character-based language, meaning that a search engine (if needed) must not use only punctuation and whitespace to find words (basically, each character is a word); also, you might have mixed Latin and Chinese words
make sure to use UTF-8 for all your HTML documents (.resx files are UTF-8 by default)
make sure that your database collation supports Chinese - or use a separate database with an appropriate collation
make sure you don't reverse strings or do other unusual text operations that might break with multi-byte characters
make sure you don't call ToLower and ToUpper to check user-input text because again this might break with other alphabets (or rather scripts) - aka the Turkey Test
To test for all of the above and other possible issues, a good way is pseudolocalization.

Validating Kana Input

I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?
It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.
Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)
oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.

Resources