I am doing a web application, which uses Elasticsearch for search and was designed to support multiple languages. In the mapping, I have a few fields that look like this:
"myfield": {"properties": {"en": {}, "zh_TW": {}, "ar": {}, ....}}
However, when it is launched, it will support only one language (English). The support for other languages will be added later, and we plan to add support for many languages in the future.
Should I add all possible language codes (such as "en", "zh_TW", ...) in the mapping now (obviously this is a very long list)? Or should I add a language code only when its language is introduced into the system?
For the second approach, what is the extra work or operational impact? Do I have re-index of all documents? What other things I have to know now?
Thanks for any input!
Having to reindex the documents is not a concern because you'll have to update the documents anyway to add the content in the new language, won't you?
So, put the mapping for each newly supported language just before putting the first text in that language.
I suggest that you duplicate your fields for each supported language, like:
"myfield_en" : ...,
"myfield_zh_TW" : ...,
"myfield_ar" : ...
beacuse it's easier to put the mappings.
When you start supporting , say, German, put a mapping with German Analyzer for a new field "myfield_de". After that, every time you insert or update a document with German translation the German field will be analyzed.
If you have documents without German translation, they're not going to need a reindex.
Conclusion: pointless to put a mapping for a field when you don't have any text to write there yet.
Related
I'm setting up my elastic instance in a schema-less manner (no up front mappings) and the application requires users be able to search against a field that contains a word that may or may not be tokenized into multiple strings. For example, the field may contain the word "ONETWO". The spec requires that a user should be able to search "ONETWO", "ONE", and "TWO" and retrieve that same document. There doesn't seem any easy way to accomplish this even with a custom tokenizer (and I don't think there SHOULD be an easy way to do this -- or any way at all). Just want to confirm my thoughts.
Its very easy to cater your requirement using the custom analyzer which uses the n-gram tokenizer, You can even pass it to a lowercase token filter, so that in your case even your text was ONETWO but if user searches for one, One, ONE he should get a result. Although for this you need to apply a different analyzer search time read more about it https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html.
Refer https://devticks.com/how-to-improve-your-full-text-search-in-elasticsearch-with-ngram-tokenizer-e346f29f8ddb for more information and let me know if you need any information.
How to index mixed language contents in Elasticsearch. Let's say that we have a system where people submit contents from various parts of the world. Countries ranges from US, Canada, Europe, Japan, Korea, India, China, Kenya, Arabs, Russia to all other parts of the world.
Contents can be in any language that we can't know beforehand and can even be in mixed language. We don't want to guess the language of the contents and create multiple language specific indexes for each of the inputted language, we believe this is unmanageable.
We need an easy solution to index those contents efficiently in Elasticsearch with full text search capability as well as fuzzy string searching. Can anyone help in this regard?
What is the target you want to achieve? Do you want to have hits only in the language used at query time? Or would you also accept hits in any other language?
One approach would be to run all of elasticsearch's different language analyzers on the input and store the result in separate fields, for instance suffixed by the language of the current analyzer.
Then, at query time, you would have to search in all of these fields if you have no method to guess the most relevant ones.
However, this is likely to explode since you create a multitude of unused duplicates. This is IMHO also less elegant than having separate indices.
I would strongly recommend to evaluate if you really do not know the number of languages you will see during production. Having a distinct index per language would give you much more control over the input/output and enable you to fine tune your engine to the actual use case.
Alternatively, you may start with a simple whitespace tokenizer and evaluate the quality of the search results (per use case).
You will not have language specific stemming but at least token streams for most languages.
I'm creating a multi-page document in two different languages (English and French) with possibly other languages to be added. The url of a given document will take the form of prefix/en/name.html or prefix/fr/name.html i.e. only the "en" or "fr" part will be different. Is is possible to include some code in the main template (layout.html ... or elsewhere?) that would take the url of the current (English) document, replace "/en/" by "/fr/" and insert it as a link to the "French" version? Something like
automatically retrieve:
prefix/en/this_document.html
transform into:
French
I essentially found the answer I needed in this post: https://groups.google.com/forum/#!topic/sphinx-users/Xmbs5AbnVKY
Basically, what I do is insert the following:
{{"English version"}}
where needed.
I'm working on a system that downloads articles from various news sites and performs various NLP analyses on the texts. I want to store multiple versions and aspects of each article, including
The raw HTML
A cleaned-up text-only version
CoreNLP output of the article.
Since I want to store the text-only version on Elasticsearch, I thought about storing everything else on Elasticsearch, as well. I have no Elasticsearch experience, so I can't tell what's a better way to store these:
Have one record per article, with the HTML, text and CoreNLP outputs as properties of that article : {html: '....', text: '....', CoreNLP: '....'}
Store each type of information in its own type: /articles/html/1, /articles/text/1, /articles/corenlp/1, etc...
Which one is more common? Is there a third, better option?
Depends on where you want to do the COreNLP, the html tidy up, etc. If you want to do this in elastic I would use the multi field types:
https://www.elastic.co/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
If you do it outside of elastic, which would not be common since this is a good task for elastic, you could use the multiple fields approach.
I am currently trying to figure out analysis schemes for my ElasticSearch cluster. I am using ES to index pdf, word, powerpoint and excel documents. I am using Apache Tika to extract the text.
My problem is that I do not know before hand what languages to expect the file contents to be. They could be written in any language.
My question is, is there a way to make ES analyze text regardless of the language? Or should I have a pre-defined field for each language with its own tokenizer, analyzer and stopwords?
I suggest taking a look at the ElasticSearch plugin elasticsearch-mapper-attachments. I used it to build document search functionality.
When it comes to supporting multiple languages, we have had the best experience with one index per language. If you can identify the language before indexing you can insert the document into the appropriate index. This makes it easier to add new languages vs. a field per language approach.
One thing to remember is the Don't use Types for Languages note at the bottom of one language per document page. Doing that can mess up search in a very difficult to debug way.
If you need to detect the language, there are two options mentioned at the bottom of the Pitfalls of Mixing Languages page.