Using elasticsearch generated ID's in kafka elasticsearch connector - elasticsearch

I noticed that documents indexed in elasticsearch using the kafka elasticsearch connector have their ids in the following format topic+partition+offset.
I would prefer to use id's generated by elasticsearch. It seems topic+partition+offset is not usually unique so I am loosing data.
How can I change that?

As Phil says in the comments -- topic-partition-offset should be unique, so I don't see how this is causing data loss for you.
Regardless - you can either let the connector generate the key (as you are doing), or you can define the key yourself (key.ignore=false). There is no other option.
You can use Single Message Transformations with Kafka Connect to derive a key from the fields in your data. Based on your message in the Elasticsearch forum it looks like there is an id in your data - if that's going to be unique you could set that as your key, and thus as your Elasticsearch document ID too. Here's an example of defining a key with SMT:
# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey, ExtractId
# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=id
# `ExtractField`: convert key from an object to a plain field
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=id
(via https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/)

#Robin Moffatt, as much as I see it, topic-partition-offset can cause duplicates in case that upgrade your kafka cluster, but not in rolling upgrade fashion but just replace cluster with cluster (which is sometime easier to replace). In this case you will experience data loss because of overwriting data.
Regarding to your excellent example, this can be the solution for many of the cases, but I'd add another option. Maybe you can add epoc timestamp element to the topic-partition-offset so this will be like this topic-partition-offset-current_timestamp.
What do you think?

Related

ElasticSearch Long ID and search performance

Using ElasticSearch in Amazon as search engine. Lately discussed with one of developers tactics for Upsert.
In my view (i am not an well experienced ES Developer) it's ok to have a complex key as _id, e.g. Result-1, Data-2, etc. It will help on Upsert and data deduplication. But concern was raised about key datatype. Long key, such as string, Sha1-digest, hex, etc — could affect search performance, and better to have some short keys or pass it to ES without predefined _id and deduplicate with document body or some specific properties.
I haven't read anything about ID performance — from Official docs to medium/blogs.
Is the concern right and I should follow it?
Thank you!
The concern about using custom ID fields is on the indexing phase because with the auto generated ones Elasticsearch can safely index the document without querying for uniqueness. If you are OK with your indexing rate then you should be fine.
If you look in the docs on the Tune for Search speed , there is no advice about using auto generated ids.
Relevant reads.
Choosing a fast unique identifier (UUID) for Lucene
Tune Elasticsearch for Search Speed

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

Implement created_on and updated_on logic on client side

From elasticsearch > 2, there is no _timestamp field. we have to explicitly populate time fields like created_on and updated_on
One way i know to populate these fields is check item to be populated is already existing in Database using uid (assume uid generated on client side using some item properties). If item exists in Database, update all fields except created_on. If item does not exist, create entry in database with item and created_on equal to current time.
My questions are:
* Isn't checking every time i create/update redundant ??
* Is there any better way to implement created_on and updated_on logic on client side without redundant (without querying elasticsearch) ??
Using a "middleware" for this is a good way to avoid having this kind of logic in the client, once you change the design, you would need to perform changes on every client implementation, so I think is a good use case for ingesting pipelines and there is an example in the doc.
Accessing Ingest Metadata Fields:
Beyond metadata fields and source fields, ingest also adds ingest metadata to the documents that it processes. These metadata properties are accessible under the _ingest key. Currently ingest adds the ingest timestamp under the _ingest.timestamp key of the ingest metadata. The ingest timestamp is the time when Elasticsearch received the index or bulk request to pre-process the document.
If you need more intelligent middleware, mind the Script Processor which allows inline and stored scripts to be executed within ingest pipelines.

Changing live data coming into Elasticsearch?

I've been given a set up where I have a program creating live data and posting them into Elasticsearch.
I am trying to visualise this data in Kibana, but I'm coming across many problems such as numbers for a field being of type string instead of integers or there being certain missing fields.
But mainly for now certain fields being integer instead of string would be useful. How do I go about this? Is it possible?
I have no access to source code of the system creating the live events data.
Thanks in advance.
Update: I should also mention additionally that for now I am restricted to Elasticsearch version 2.4
If your data is coming straight into Elasticsearch, your options are limited.
The best option is to have the program that is creating the data send valid, properly formatted data.
If that's not an option, you can set your Elasticsearch mapping to force the field to be numeric. This will have the side-effect of dropping all documents where this field is not numeric.
There is also the elasticsearch injest node, which allows for some (logstash-like) transformations of the data. Converting the type is one such allowed "processor".

Elasticsearch: Indexing tweets - mapping, template or ETL

I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.
I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.

Resources