Migrating data from RDBMS to ElasticSearch using Apache NIfi - elasticsearch

We are trying to migrate data from RDBMS to elastic search using Apache Nifi. We have created pipelines in Nifi and are able to transfer data but are facing some issues and wanted to check if someone already got over them.
Please provide inputs on the below items.
1.How to avoid auto-generating _id in elastic search. We want this to be set from a DB column. We tried providing the column name in the "Identifier Record Path" attribute in the PutElasticSearchHTTPRecord processor but were getting an error that the attribute name is not valid. Can you please let us know the acceptable format.
How to load nested objects into the index using NIfi? We are looking to maintain one to many relationships in the index using nested objects but were unable to find a configuration to do so. Do we have any processors to do this in Nifi? Please let us know.
Thanks in Advance!

It needs to be a RecordPath statement like /myidfield
You need to manually create nested fields in Elasticsearch. This is not a NiFi thing, but how Elasticsearch works. If you were to post a document with cURL, you would run into the same issues.

Related

Multi Cluster Insertion Spring Data Elastic Search

I am trying to insert records in multiple cluster (not nodes). I am making use of spring-data-elastic search(version 2.0.4.version). I have tried to create 2 elasticsearchoperation instances in configuration with different bean name. And i am trying to insert using that object with index(IndexQuery indexQuery) method. But when i am trying to insert i am not able to keep mapping of fields(non analyze field type). Can someone please help me how to keep mapping also when inserting entity to elasticsearch.
Finally i was able to achieve it. I have created 2 instances of elasticsearchtemplate based on client instance, instead of elasticsearchoperations. And was able to insert to 2 different cluster.

Can Beats update existing documents in Elasticsearch?

Consider the following use case:
I want the information from one particular log line to be indexed into Elasticsearch, as a document X.
I want the information from some log line further down the log file to be indexed into the same document X (not overriding the original, just adding more data).
The first part, I can obviously achieve with filebeat.
For the second, does anyone have any idea about how to approach it? Could I still use filebeat + some pipeline on an ingest node for example?
Clearly, I can use the ES API to update the said document, but I was looking for some solution that doesn't require changes to my application - rather, it is all possible to achieve using the log files.
Thanks in advance!
No, this is not something that Beats were intended to accomplish. Enrichment like you describe is one of the things that Logstash can help with.
Logstash has an Elasticsearch input that would allow you to retrieve data from ES and use it in the pipeline for enrichment. And the Elasticsearch output supports upsert operations (update if exists, insert new if not). Using both those features you can enrich and update documents as new data comes in.
You might want to consider ingesting the log lines as is to Elasticearch. Then using Logstash, build a separate index that is entity specific and driven based on data from the logs.

Does ElasticSearch store a duplicate copy of each record?

I started looking into ElasticSearch, and most examples of creating and reading involve POSTing data to the ElasticSearch server and then doing a GET to retrieve them.
Is this data that is POSTed stored separately by the ElasticSearch server? So, if I want to use ElasticSearch with MongoDB, does the raw data, not including the search indices, get stored twice (once copy for MongoDB and one for ElasticSearch)?
In conjunction with an answer to this question, a description or a link to a description of how ElasticSearch and the primary data store interact would be very helpful.
Yes, ElasticSearch can only search within its own data store, so a separate copy will be there.
You can use the mongodb connector to keep the data in elastic in sync with the mongo database: https://github.com/mongodb-labs/mongo-connector

Elasticsearch: Indexing tweets - mapping, template or ETL

I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.
I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.

elasticsearch-hadoop for Spark. Send documents from a RDD in different index (by day)

I work on a complex workflow using Spark (parsing, cleaning, Machine Learning ...).
At the end of the workflow I want to send aggregated results to elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to send document to elasticsearch with the saveJsonToEs(myindex/mytype) method.
The target is to have an index by day using the proper template that we built.
AFAIK you could not add consideration of a feature in a document to send it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step using spark and bulk so that each executor send documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I tried to send JSON to _bulk using saveJsonToEs("_bulk") but the pattern has to follow index/type
Thanks to Costin Leau, I have the solution.
Simply use dynamic indexing with something like saveJsonToEs("my-index-{date}/my-type"). "date" have to be a feature in the document that has to be send.
Discussion on elasticsearch google group: https://groups.google.com/forum/#!topic/elasticsearch/5-LwjQxVlhk
Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-write-dyn
You can use : ("custom-index-{date}/customtype") to create dynamic index. This could be any field in given rdd.
If you want format the date : ("custom-index-{date:{YYYY.mm.dd}}/customtype")
[Answered to question ask by Amit_Hora in the comment, as I don't have enough privilege to comment, I am adding this here]

Resources