unable to map topics to specified index in kafka elasticsearch connector - elasticsearch

tried to map topic "name: localtopic" to index "name:indexoftopic" ,
its creating two new index in elastic search "localtopic and indexoftopic" and data of topic visible only in topic name index "localtopic", no errors in connector shown ( distributed mode )
my config is
"config" : {
"connector.class" : "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max" : "1",
"topics" : "localtopic",
"topic.index.map" : "localtopic:indexoftopic",
"connection.url" : "aws elasticsearch url",
"type.name" : "event",
"key.ignore" : "false",
"schema.ignore" : "true",
"schemas.enable" : "false",
"transforms" : "InsertKey,extractKey",
"transforms.InsertKey.type" : "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.InsertKey.fields" : "event-id",
"transforms.extractKey.type" : "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKey.field" : "event-id"
}
index name:indexoftopic is created in elasticsearch but data is seen by index_name:localtopic
kafkaversion:2.3 connectorversion:5 elasticsearchversion:3.2.0
even in logs INFO -- topics.regex = "", I don't know ihis option, can anyone suggest. how to use this ???

Below worked for me, but, I was mapping only 1 topic to a different index name
"transforms": "addSuffix",
"transforms.addSuffix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.addSuffix.regex": "topic1.*",
"transforms.addSuffix.replacement": "index1",
so, with above transforms any topic such as topic1, topic1-test, topic1<anystring>
will be mapped to index1
alternately, you can use the topic name in replacement as well, by changing the last 2 lines as below, this will pick
"transforms.addSuffix.regex": "topic.*",
"transforms.addSuffix.replacement": "index$1",
basically, you can replace partial or full topic name using regex.

It is advised that you use RegexRouter transform instead, if you look at the config options
topic.index.map
This option is now deprecated. A future version may remove it completely. Please use single message transforms, such as RegexRouter, to map topic names to index names.

Related

Conditional indexing in metricbeat using Ingest node pipeline creates a datastream

I am trying to achieve conditional indexing for namespaces in elastic using ingest node pipelines. I used the below pipeline but the index getting created when I add the pipeline in metricbeat.yml is in form of datastreams.
PUT _ingest/pipeline/sample-pipeline
{
"processors": [
{
"set": {
"field": "_index",
"copy_from": "metricbeat-dev",
"if": "ctx.kubernetes?.namespace==\"dev\"",
"ignore_failure": true
}
}
]
}
Expected index name is metricbeat-dev but i am getting the value in _index as .ds-metricbeat-dev.
This works fine when I test with one document but when I implement it in yml file I get the index name starting with .ds- why is this happening?
update for the template :
{
"metricbeat" : {
"order" : 1,
"index_patterns" : [
"metricbeat-*"
],
"settings" : {
"index" : {
"lifecycle" : {
"name" : "metricbeat",
"rollover_alias" : "metricbeat-metrics"
},
If you have data streams enabled in the index templates it has potential to create a datastream. This would depend upon how you configure the priority. If priority is not mentioned then it would create legacy index but if priority higher than 100 is mentioned in the index templates. Then this creates a data stream(legacy index has priority 100 so use priority value more than 100 if you want index in form of data stream).
If its create a data stream and its not expected please check if there is a template pointing to index you are writing where data stream is enabled! This was the reason in my case.
Have been working with this for few months and this is what I have observed!

Too many schemas created using Debezium mongodb CDC with Avro serializing

I'm using Debezium mongodb connector to stream changes from a 30GB collection in mongo.
this is my configuration:
"config": {
"connector.class" : "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max" : "1",
"mongodb.hosts" : "",
"mongodb.name" : "",
"mongodb.user" : "",
"mongodb.password" : "",
"database.whitelist" : "mydb",
"collection.whitelist" : "mydb.activity",
"database.history.kafka.bootstrap.servers" : "kafka:9092",
"transforms": "unwrap",
"transforms.unwrap.type" : "io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope",
"key.converter" : "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url" : "http://schema-registry:8081",
"value.converter" : "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url" : "http://schema-registry:8081",
"internal.key.converter" : "org.apache.kafka.connect.json.JsonConverter",
"internal.value.converter" : "org.apache.kafka.connect.json.JsonConverter",
"schema.compatibility" : "NONE"
}
At first, i got a "Too many schemas created for subject", so i've added
"value.converter.max.schemas.per.subject" : "100000"
now kafka-connect slows down drastically after many schemas are created in schema-registry for the topic value.
I use this topic in a kafka-streams application so moving the SMT to the sink is not possible (there is no sink connector)
The schema is changed between the collection items but not more then 500 times and is also backward compatible, so i don't understand why so much schemas are being created.
any advice will help
I ended up writing a SMT that keeps the schema in cache, to achieve 2 objectives:
1. keep the order of the fields.
2. have all fields as optional in each record schema.
This way the schema evolves until it contains all options of the fields, and then does not change any more unless a new field is added.

Elasticsearch querying alias with routing giving partial results

In an effort to create multi-tenant architecture for my project.
I've created an elasticsearch cluster with an index 'tenant'
"tenant" : {
"some_type" : {
"_routing" : {
"required" : true,
"path" : "tenantId"
},
Now,
I've also created some aliases -
"tenant" : {
"aliases" : {
"tenant_1" : {
"index_routing" : "1",
"search_routing" : "1"
},
"tenant_2" : {
"index_routing" : "2",
"search_routing" : "2"
},
"tenant_3" : {
"index_routing" : "3",
"search_routing" : "3"
},
"tenant_4" : {
"index_routing" : "4",
"search_routing" : "4"
}
I've added some data with tenantId = 2
After all that, I tried to query 'tenant_2' but I only got partial results, while querying 'tenant' index directly returns with the full results.
Why's that?
I was sure that routing is supposed to query all the shards that documents with tenantId = 2 resides on.
When you have created aliases in elasticsearch, you have to do all operations using aliases only. Be it indexing, update or search.
Try reindexing the data again and check if possible (If it is a test index, I hope so).
Remove all the indices.
curl -XDELETE 'localhost:9200/' # Warning:!! Dont use this in production.
Use this command only if it is test index.
Create the index again. Create alias again. Do all the indexing, search and delete operations on alias name. Even the import of data should also be done via alias name.

MongoDB embedded vs reference schema for large data documents

I am designing my first MongoDB database schema for log managment system and I would like to store information from log file to mongoDB and I can't decide what schema should I
use for large documents (embedded vs reference).
Note: project has many sources and source has many logs (in some cases over 1 000 000 logs)
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"source" : [{
"name" : "sourceName",
"log" : [{
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}]
}]
}
My focuse is on performance during the reading data from database (NOT WRITING) e.g. Filtering, Searching, Pagination etc. User can filter source log by date, status etc (so I want to focus on reading performance when user search or filtering data)
I know that MongoDB has a 16Mbyte document size limit so I am worried if I gonna have 1 000 000 logs for one source how this gonna work (as I can have many sources for one project and sources can have many logs). What is better solutions when I gonna work with large documents and I want to have good reading performance, should I use embedded or reference schema? Thanks
The answer to your question is neither. Instead of embedding or using references, you should flatten the schema to one doc per log entry so that it scales beyond whatever can fit in the 16MB doc limit and so that you have access to the full power and performance of MongoDB's querying capabilities.
So get rid of the array fields and move everything up to top-level fields using an approach like:
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"_id" : ObjectId("5141e051e2f56cbb680b77fa"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}
I think having logs in an array might get messy..If project and source entities don't have any attributes(keys) other than a name and logs are not to be stored for long, you may use a capped collection having one log per document:
{_id: ObjectId("5141e051e2f56cbb680b77f9"),
p: "project_name",
s: "source_name",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"}
Refer this as well: http://docs.mongodb.org/manual/use-cases/storing-log-data/
Capped collections maintain natural order. So you don't need an index on timestamp to return the logs in natural order. In your case, you may want to retrieve all logs from a particular source/project. You can create index{p:1,s:1}to speed up this query.
But I'd recommend you do some 'benchmarking' to check performance. Try the capped collection approach above. And also try bucketing of documents with the fully embedded schema that you have suggested. This technique is used in the classic blog-comments problem. Hence you only store so many logs of each source inside a single document and overflow to a new document whenever the custom-defined size exceeds.

Index huge data into Elasticsearch

I am new to elasticsearch and have huge data(more than 16k huge rows in the mysql table). I need to push this data into elasticsearch and am facing problems indexing it into it.
Is there a way to make indexing data faster? How to deal with huge data?
Expanding on the Bulk API
You will make a POST request to the /_bulk
Your payload will follow the following format where \n is the newline character.
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
...
Make sure your json is not pretty printed
The available actions are index, create, update and delete.
Bulk Load Example
To answer your question, if you just want to bulk load data into your index.
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
The first line contains the action and metadata. In this case, we are calling create. We will be inserting a document of type type1 into the index named test with a manually assigned id of 3 (instead of elasticsearch auto-generating one).
The second line contains all the fields in your mapping, which in this example is just field1 with a value of value3.
You will just concatenate as many as these as you'd like to insert into your index.
This may be an old thread but I though I would comment anyway for anyone who is looking for a solution to this problem. The JDBC river plugin for Elastic Search is very useful for importing data from a wide array of supported DB's.
Link to JDBC' River source here..
Using Git Bash' curl command I PUT the following configuration document to allow for communication between ES instance and MySQL instance -
curl -XPUT 'localhost:9200/_river/uber/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://localhost:3306/elastic",
"user" : "root",
"password" : "root",
"sql" : "select * from tbl_indexed",
"poll" : "24h",
"max_retries": 3,
"max_retries_wait" : "10s"
},
"index": {
"index": "uber",
"type" : "uber",
"bulk_size" : 100
}
}'
Ensure you have the mysql-connector-java-VERSION-bin in the river-jdbc plugin directory which contains jdbc-river' necessary JAR files.
Try bulk api
http://www.elasticsearch.org/guide/reference/api/bulk.html

Resources