What is the best way to index Couchbase data on Elastic Search - elasticsearch

I work with Couchbase DB and I want to index part of its data on Elastic Search (ES).
The data from Couchbase should be synced, i.e. if the document on CB changes, it should change the document on ES.
I have several questions about what is the best way to do it:
What is the best way to sync the data ? I saw that there is a CB plugin for ES (http://www.couchbase.com/couchbase-server/connectors/elasticsearch), but it that the recommended way ?
I don't want to store all the CB document on ES, but only part of it, e.g. some of the fields I want to store and some not - how can I do it ?
My documents may have different attributes and the difference may be big (e.g. 50 different attributes/fields). Assuming I want to index all these attributes to ES, will it effect the performance because I have a lot of fields indexed ?
10x,

Given the doc link, I am assuming you are using Couchbase and not CouchDB.
You are following the correct link for use of Elastic Search with Couchbase. Per the documentation, configure the Cross Data Center Replication (XDCR) capabilities of Couchbase to push data to ES automatically as mutations occur.
Without a defined mapping file, ES will create a default mapping. You can provide your own mapping file (or alter the one it generates) to control which fields get indexed. Refer to the enabled property in the ES documentation at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html.
Yes, indexing all fields will affect performance. You can find some performance management tips for the Couchbase integration at http://docs.couchbase.com/couchbase-elastic-search/#managing-performance. The preferred approach to the integration is perform the search in ES and only get keys back for the matched documents. You then make a multiget call against the Couchbase cluster to retrieve the document details themselves. So while ES will index many fields, you do not store all fields there nor do you retrieve their values from ES. The in-memory multiget against Couchbase is the fastest way to retrieve the matching documents, using the IDs from ES.

Lot of questions..!
Let me answer one by one:
1)The best way and already available solution to use river plugin to dynamically sync the data.And also it ll index the changed document alone..It ll help a lot in performance.
2)yes you can restrict the field to be indexed in river plugin. refer
The documents of plugin is available in couchbase website itself.
Refer: http://docs.couchbase.com/couchbase-elastic-search/
Github river is still in development.,but you can use the code and modify as your need.
https://github.com/mschoch/elasticsearch-river-couchbase
3)If you index all the fields, yes there will be some lag in performance.So better to index the needed fields alone. if you need to store some field just to store, then mention in mapping as not analyzed to specific.It will decrease indexing time and also searching time.
HOpe it helps..!

You might find this additional explanation regarding Don Stacy's answer to question 2 useful:
When replicating from Couchbase, there are 3 ways in which you can interfere with Elasticsearch's default mapping (before you start XDCR) and thus, as desired, not store certain fields by setting "store" = false:
Create manual mappings on your index
Create a dynamic template
Edit couchbase_template.json
Hints:
Note that when we do XDCR from Couchbase to Elasticsearch, Couchbase wraps the original document in a "doc" field. This means that you have to take this modified structure into account when you create your mapping. It would look something like this:
curl -XPUT 'http://localhost:9200/test/couchbaseDocument/_mapping' -d '
{
"couchbaseDocument": {
"_source": {
"enabled": false
},
"properties": {
"doc": {
"properties": {
"your_field_name": {
"store": true,
...
},
...
}
}
}
}
}'
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Including/Excluding fields from _source: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/dynamic-templates.html
https://forums.couchbase.com/t/about-elasticsearch-plugin/2433
https://forums.couchbase.com/t/custom-maps-for-jsontypes-with-elasticsearch-plugin/395

Related

How about including JSON doc version? Is it possible for elastic search, to include different versions of JSON docs, to save and to search?

We are using ElasticSearch to save and manage information on complex transactions. We might need to add more information for every transaction, on the near future.
How about including JSON doc version?
Is it possible for elastic search, to include different versions of JSON docs, to save and to search?
How does this affects performance on ElasticSearch?
It's completely possible, By default elastic uses the dynamic mappings for every new documents such as your JSON documents to index them. For each field in your documents elastic creates a table called inverted_index and the search queries executed against them so regardless of your field variation as long as you know which field you want to execute query the data throughput and performance will not be affected.

Is there any tool out there for generating elasticsearch mapping

Mostly what I do is to assemble the mapping by hand. Choosing the correct types myself.
Is there any tool which facilitates this?
For example which will read a class (c#,java..etc) and choosing the closest ES types accordingly.
I've never seen such a tool, however I know that ElasticSearch has a REST API over HTTP.
So you can create a simple HTTP query with JSON body that will depict your object with your fields: field names + types (Strings, numbers, booleans) - pretty much like a Java/C# class that you've described in the question.
Then you can ask the ES to store the data in the non-existing index (to "index" your document in ES terms). It will index the document, but it will also create an index, and the most importantly for your question, will create a mapping for you "dynamically", so that later you will be able to query the mapping structure (again via REST).
Here is the link to the relevant chapter about dynamically created mappings in the ES documentation
And Here you can find the API for querying the mapping structure
At the end of the day you'd still want to retain some control over how your mapping is generated. I'd recommend:
syncing some sample documents w/o a mapping
investigating what mapping was auto generated and
dropping the index & using dynamic_templates to pseudo-auto-generate / update the mapping as new documents come in.
This GUI could help too.
Currently, there is no such tool available to generate the mapping for elastic.
It is a kind of similar thing as we have to design a database in MySQL.
But if we want such kind of thing then we use Mongo DB which requires no predefined schema.
But Elastic comes with its very dynamic feature, which allows us to play around it. One of the most important features of Elasticsearch is that it tries to get out of your way and let you start exploring your data as quickly as possible like the mongo schema which can be manipulated dynamically.
To index a document, you don’t need to first define a mapping or schema and define your fields along with their data type .
You can just index a document and the index, type, and fields will be created automatically.
For further details you can go through the below documentation:
Elastic Dynamic Mapping

ElasticSearch1.5 : Add new field in existing working Index

I have an existing index named as "MyIndex", which I am using to store a kind of data in ElasticSearch. That same index has millions of records. I am using ElasticSearch 1.5 version.
Now I have a new requirement for which I want to add two more fields in the same document which I am storing in "MyIndex" Index. Now I want to use both new schema and old schema documents in future.
What Can I do?
Can I inset new document in the same Index?
Are we need some changes in ElasticSearch mapping?
If we don't change anything, Is it affect on existing search capability?
Please help me to conclude this issue with your opinions.
Thanks in advance.
You can add new fields to existing index by updating mapping, but in many cases it would be just ok to index documents with new fields directly, and let ES infer types (although not always recommended) - but this will depend on what type of data you're indexing, and do you need special analyzers for strings or not.

Can we migrate non stored Index data in SOLR to Elastic search?

We are currently using SOLR for full-text search. Now we are planning to move from SOLR to ElasticSearch. When we were in this process i have read somewhere that there are some plugins available which will migrate data from SOLR-ElasticSearch. But it won't be able to migrate those records which are not stored in SOLR. So is there a plugin available which will migrate non-stored index data from SOLR to elastic search if so please let me know.
Currently am using SOLR-to-ES plugin, but it won't migrate the non-stored index data.
Thanks
If the field is not stored, then you don't have the original value. If you have it indexed, what's is in there is the value after it has gone through the analysis chain, and so is probably different than the original one (has no stopwords, is probably lowercased, maybe stemmed...stuff like that).
There are a couple of possibilities that might allow you to have the original content when not stored:
indexed field: if it has been analyzed with just the keyword tokenizer: then the indexed value is the original value.
field has docValues=true then the original value is also stored. This feature was introduced later, so your index might not be using it.
The issue is, the common plugings might not take advantage of those cases where stored=true is not totally necessary. You need to check them.

Elasticsearch remove "one level" from the mapping

I need to destructurate my index mapping.
My index has the following mapping
"A": {
"properties": {
"B": {
"properties": {
-c
-d
-e
}
}
}
}
What I need is to delete "one level" in order to have a mapping like this:
"A": {
"properties": {
-c
-d
-e
}
}
Is it possible to obtain this result without reindexing all my data?
Short answer, No.
Longer answer, also No. This question has been asked so many times. The answer will always be no and this is why :
You can only find that which is stored in your index. In order to make your data searchable, your database needs to know what type of data each field contains and how it should be indexed. If you switch a field type from e.g. a string to a date, all of the data for that field that you already have indexed becomes useless. One way or another, you need to reindex that field.
This applies not just to Elasticsearch, but to any database that uses indices for searching. And if it isn't using indices then it is sacrificing speed for flexibility.
Elasticsearch (and Lucene) stores its indices in immutable segments — each segment is a “mini" inverted index. These segments are never updated in place. Updating a document actually creates a new document and marks the old document as deleted. As you add more documents (or update existing documents), new segments are created. A merge process runs in the background merging several smaller segments into a new big segment, after which the old segments are removed entirely.
Typically, an index in Elasticsearch will contain documents of different types. Each _type has its own schema or mapping. A single segment may contain documents of any type. So, if you want to change the field definition for a single field in a single type, you have little option but to reindex all of the documents in your index.
If you are interested with more info, you can read the rest of the excerpt here by Clinton Gormley.
I also suggest the following readings :
Elasticsearch Zero Downtime Reindexing – Problems and Solutions
The SO question : Is there a smarter way to reindex elasticsearch?
You have to create a new index with the updated (one level deleted) mapping. You cannot updated the same mapping to achieve what you want.

Resources