We have an index with mixed Greek, English data for an ATG-Endeca application. Indexed Greek data have words with accents. If the search terms are without accents they don't match to any data (or they match due to autoccorection that happens for the character without the accent to the character withthe accent and this is not desired functionality). Dgidx flag --diacritic folding configuration doesn't include mapping for Greek caracters (https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html).
Is it possible to extend this oob functionality thought a properties file in Endeca side or nucleous or code?
In the documentation you provide it states:
Dgidx supports mapping Latin1, Latin extended-A, and Windows CP1252 international characters to their simple ASCII equivalents during indexing.
This suggests that Greek is not supported since it doesn't fall into any of these character sets (I believe Greek is Latin-7). That said, you could try setting a language flag at a record level (since you indicate that your data includes both English and Greek) assuming that each language has its own record or try to implement a global language using the dgidx and dgraph parameters but this will affect things like stemming for records or properties not in the global language.
dgidx --lang el
dgraph --lang el
Though I'm not sure it will work based on the original statement.
Alternatively, you can implement a process of diacritic removal using a custom Accessor, which extends the atg.repository.search.indexing.PropertyAccessorImpl class (an option since you refer to Nucleus, so I assume you are using ATG/Oracle Commerce). Using this you specify a normalised searchable field in your index that duplicates the searchable fields in your current index but now with all diacritics removed. The same logic you apply in the Accessor then needs to be applied as a preprocessor on your search terms so that you normalise the input to match the indexed values. Lastly make your original fields in the index (with the accentuated characters) display-only and the normalised fields searchable (but don't display them).
The result will be matching your normalised text but the downside is you have duplicated data so your index will be bigger. Not a big issue with small data sets. There may also be an impact on how the OOTB functionality, like stemming, behaves with the normalised data set. You'll have to do some testing with various scenarios in Greek and English to see if the precision and recall is adversely affected.
Related
Based on the our client requirements we configure our oracle (version - 12c) deployments to support single or multi-byte data (through character set setting). There is a need to cache third party multi-byte data(json) for performance reasons. We found that we could encode the data in UTF-8 and persist it (after converting it to bytes) in a BLOB column of an Oracle table. This is a hack that allows us to store multi-byte data in single byte deployments. There are certain limitations that come with this approach like
The data cannot be queried or updated through SQL code (stored procedures).
Search operation using for e.g. LIKE operators could not be performed.
Marshaling and unmarshaling overhead for every operation at the application layer (java)
Assuming we compromise with these limitations are there any other drawbacks that we should be aware off?
Thanks.
Ok, I am summarizing my comments in a proper answer.
You have two possible solutions:
store them in NVARCHAR2/NCLOB columns
re-encode JSON values in order to use only ASCII characters
1. NCLOB/NVARCHAR
The "N" character in "NVARCHAR2" stands for "National": this type of column has been introduced exactly to store characters that can't be represented in the "database character set".
Oracle actually supports TWO character sets:
"Database Character Set" it is the one used for regular varchar/char/clob fields and for the internal data-dictionary (in other words: it is the character set you can use for naming tables, triggers, columns, etc...)
"National Character Sets": the character set used for storing NCLOB/NCHAR/NVARCHAR values, which is supposed to be used to be able to store "weird" characters used in national languages.
Normally the second one is a UNICODE character set, so you can store any kind of data in there, even in older installations
2. encode JSON values using only ASCII characters
It is true that the JSON standard is designed with UNICODE in mind, but it is also true that it allows characters to be expressed as escape sequences using the exadecimal representation of their code points.. and if you do so for every character having a code point greater than 127, you can express ANY unicode object using only ASCII character.
This ASCII JSON string: '{"UnicodeCharsTest":"ni\u00f1o"}' represents the very same object of this other one: '{"UnicodeCharsTest" : "niƱo"}'.
Personally I prefer this second approach because it permits me to share easily these json strings also with systems using antiquate legacy protocols and also it allows me to be sure that the json strings are read correctly by any client regardless of its national settings (the oracle client protocol can try to translate strings into the character used by the client... and this is a complication I don't want to deal with. By the way: this might be the reason of the problems you are experiencing with SQL clients)
How to index mixed language contents in Elasticsearch. Let's say that we have a system where people submit contents from various parts of the world. Countries ranges from US, Canada, Europe, Japan, Korea, India, China, Kenya, Arabs, Russia to all other parts of the world.
Contents can be in any language that we can't know beforehand and can even be in mixed language. We don't want to guess the language of the contents and create multiple language specific indexes for each of the inputted language, we believe this is unmanageable.
We need an easy solution to index those contents efficiently in Elasticsearch with full text search capability as well as fuzzy string searching. Can anyone help in this regard?
What is the target you want to achieve? Do you want to have hits only in the language used at query time? Or would you also accept hits in any other language?
One approach would be to run all of elasticsearch's different language analyzers on the input and store the result in separate fields, for instance suffixed by the language of the current analyzer.
Then, at query time, you would have to search in all of these fields if you have no method to guess the most relevant ones.
However, this is likely to explode since you create a multitude of unused duplicates. This is IMHO also less elegant than having separate indices.
I would strongly recommend to evaluate if you really do not know the number of languages you will see during production. Having a distinct index per language would give you much more control over the input/output and enable you to fine tune your engine to the actual use case.
Alternatively, you may start with a simple whitespace tokenizer and evaluate the quality of the search results (per use case).
You will not have language specific stemming but at least token streams for most languages.
First, this question - filter design documents from all_docs - already seemed to be solved like described here:
https://plus.google.com/+JasonDeRose/posts/1iP5tu3wVqw
/mydb/_all_docs?endkey=%22_%22
and worked in first place. However, suddenly in a different setup (actually just different deploy), the query only returns an empty collection []. It seems like the ordering changed, without endkey="_" the full collection is returned (including design documents). I tried various combinations of endkey/startkey but cannot achieve to filter the design documents again.
Finally I added a filter and switched to _changes?include_docs=true to load the initial documents. I also thought about defining a view, but don't like that this results in data replication and some inconveniences with the changes feed (needed in another context). The filter on the other hand will be executed for every document.
Is it a bug that endkey=%22_%22 doesn't work anymore and is there a more convenient, still working way?
/_all_docs is a special case for CouchDB. Instead of the normal Unicode Collation, it uses ASCII collation.
The '_' character in ASCII order shows up between uppercase letters and lowercase letters. So if your doc id starts with lowercase letters (default behaviour), they will show up after any design docs. If your doc ids start with uppercase letters, they will show up before design docs.
Try creating a document with an id of: "ABC" You will see it show up before the design doc and your trick to filter design docs would work in this case.
However, I recommend you stop using the `_all_docs view altogether. Instead use the normal view functionality. When you create a view, CouchDB automatically skips design docs for you. So if your view looked like:
function(doc){
emit(doc._id, null);
}
You could query this with no start or end key, and get all docs without design docs.
Also, please look at Unicode Collation order, this is the order all your other views will be in, and it's important to understand as you work with CouchDB. You can read all about it here:
http://docs.couchdb.org/en/stable/ddocs/views/collation.html
I have a multi language data set and a Standard analyzer that takes care of the tokenizing for this data set very nicely. The only bad part is that it removes the special characters like #, #, :, etc.. Is there any way that I can use the standard tokenizer and still be able to search on the special characters?
I have already looked into combo analyzer plugin which did not work as I had hoped.Apparently the combination of analyzers do not work in a chain like the token filters. They work independently which is not useful for me.
Also I looked into the char mapping filter in order to process the data before tokenizing it, but it does not work like the word delimiter token filter where we can specify "type_table" to convert a special character into an ALPHANUM. It just maps one word to another word. As a result I won't be able to search on the special characters.
Also, I have looked into the pattern analyzers, which would work for the special characters but they are not recommended for a multi language data set.
Can anybody point me in the right direction in order to solve this problem?
Thanks in advance!
I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.