Configure Logstash to create an Elasticsearch document with nested arrays

Configure Logstash to create an Elasticsearch document with nested arrays - elasticsearch

I'm indexing my PostgreSQL data for Elasticsearch using the Logstash JDBC Input Plugin. I have two tables called REQUEST and ASSIGNMENT listed below.
How can I use Logstash to index the two tables into one Elasticsearch document of type REQUEST with a nested arrays for all child ASSIGNMENT records?
Table: REQUEST
REQUEST_ID | POC
---------- | ----------------
1234 | Jon Snow
1256 | Tyrion Lannister
Table: ASSIGNMENT
ASSIGN_ID | REQUEST_ID | STATUS | CREATED
--------- | ---------- | ------- | ----------
2345 | 1234 | New | 2017-01-06
2364 | 1234 | Working | 2017-03-12
2399 | 1234 | Working | 2017-05-20
5736 | 1256 | New | 2017-06-28
This is what I want my Elasticsearch document to look like. It is a sample of the _source value of the search result:
"_source": {
"request_id": "1234",
"poc": "Jon Snow",
"assignments": [
{
"assign_id": "2345",
"status": "New",
"created": "2017-01-06"
},
{
"assign_id": "2364",
"status": "Working",
"created": "2017-03-12"
},
{
"assign_id": "2399",
"status": "Working",
"created": "2017-05-20"
}
]
}

Related

Kafka jdbc sink connector creates data types that do not matching the original

I am using Kafka and Kafka Connect to replicate MS SQL Server database to MySQL using debezium sql server CDC source connector and confluent JDBC sink connector. The "auto.create" is set to true and the sink connector did create the tables, but some of the data types do not match. In SQL Sever, I have
CREATE TABLE employees (
id INTEGER IDENTITY(1001,1) NOT NULL PRIMARY KEY,
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL UNIQUE,
start_date DATE,
salary INT,
secret FLOAT,
create_time TIME
);
but in MySQL, it created the following:
mysql> desc employees;
+-------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| id | int | NO | PRI | NULL | |
| first_name | text | NO | | NULL | |
| last_name | text | NO | | NULL | |
| email | text | NO | | NULL | |
| start_date | int | YES | | NULL | |
| salary | int | YES | | NULL | |
| secret | double | YES | | NULL | |
| create_time | bigint | YES | | NULL | |
| messageTS | datetime(3) | YES | | NULL | |
+-------------+-------------+------+-----+---------+-------+
ignore messgeTS, that's an extra field I added in the SMT.
The data types for first_name, last_name, email, start_date and create time all do not match. It
converts VARCHAR(255) to text, DATE to int, and TIME to bigint.
Just wondering if anything is misconfigured?
I'm running SQL Server 2019 and MySQL 9.0.28 using docker.
I've also tried the suggestion of disabling autocreate and autoevolve and pre-create the tables with the proper data types.
mysql> desc employees;
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| first_name | varchar(255) | NO | | NULL | |
| last_name | varchar(255) | NO | | NULL | |
| email | varchar(255) | NO | | NULL | |
| start_date | date | NO | | NULL | |
| salary | int | NO | | NULL | |
| secret | double | NO | | NULL | |
| create_time | datetime | NO | | NULL | |
| messageTS | datetime | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
But it gives the following exceptions when trying to insert into the database:
kafka-connect | [2022-03-04 19:55:07,331] INFO Setting metadata for table "employees" to Table{name='"employees"', type=TABLE columns=[Column{'first_name', isPrimaryKey=false, allowsNull=false, sqlType=VARCHAR}, Column{'secret', isPrimaryKey=false, allowsNull=false, sqlType=DOUBLE}, Column{'salary', isPrimaryKey=false, allowsNull=false, sqlType=INT}, Column{'start_date', isPrimaryKey=false, allowsNull=false, sqlType=DATE}, Column{'email', isPrimaryKey=false, allowsNull=false, sqlType=VARCHAR}, Column{'id', isPrimaryKey=true, allowsNull=false, sqlType=INT}, Column{'last_name', isPrimaryKey=false, allowsNull=false, sqlType=VARCHAR}, Column{'messageTS', isPrimaryKey=false, allowsNull=false, sqlType=DATETIME}, Column{'create_time', isPrimaryKey=false, allowsNull=false, sqlType=DATETIME}]} (io.confluent.connect.jdbc.util.TableDefinitions)
kafka-connect | [2022-03-04 19:55:07,382] WARN Write of 4 records failed, remainingRetries=0 (io.confluent.connect.jdbc.sink.JdbcSinkTask)
kafka-connect | java.sql.BatchUpdateException: Data truncation: Incorrect date value: '19055' for column 'start_date' at row 1
The value of the message is
{"id":1002,"first_name":"George","last_name":"Bailey","email":"george.bailey#acme.com","start_date":{"int":19055},"salary":{"int":100000},"secret":{"double":0.867153569942739},"create_time":{"long":1646421476477}}
The schema of the message for the start_date field is
{
"name": "start_date",
"type": [
"null",
{
"type": "int",
"connect.version": 1,
"connect.name": "io.debezium.time.Date"
}
],
"default": null
}
It looks like it does not know how to convert an io.debezium.time.Date to a Date and treated it as an int instead.
Any pointers on this are greatly appreciated.
Source Config:
{
"name": "SimpleSQLServerCDC",
"config":{
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max":1,
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"confluent.topic.bootstrap.servers":"kafka:29092",
"database.hostname" : "sqlserver",
"database.port" : "1433",
"database.user" : "sa",
"database.password" : "",
"database.dbname" : "testDB",
"database.server.name" : "corporation",
"database.history.kafka.topic": "dbhistory.corporation",
"database.history.kafka.bootstrap.servers" : "kafka:29092",
"topic.creation.default.replication.factor": 1,
"topic.creation.default.partitions": 10,
"topic.creation.default.cleanup.policy": "delete"
}
}
Sink Config:
{
"name": "SimpleMySQLJDBC",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:mysql://mysql:3306/sinkdb",
"connection.user": "user",
"connection.password": "",
"tasks.max": "2",
"topics.regex": "corporation.dbo.*",
"auto.create": "true",
"auto.evolve": "true",
"dialect.name": "MySqlDatabaseDialect",
"insert.mode": "upsert",
"pk.mode": "record_key",
"pk.fields":"id",
"delete.enabled": "true",
"batch.size": 1,
"key.converter":"io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"transforms":"unwrap,dropPrefix,insertTS",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"corporation.dbo.(.*)",
"transforms.dropPrefix.replacement":"$1",
"transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones":"false",
"transforms.unwrap.delete.handling.mode":"drop",
"transforms.insertTS.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertTS.timestamp.field": "messageTS",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.tolerance":"all",
"errors.deadletterqueue.topic.name":"dlq-mysql",
"errors.deadletterqueue.context.headers.enable": "true",
"errors.deadletterqueue.topic.replication.factor":"1"
}
}

converts VARCHAR(255) to text
The character limit of the fields is not carried through the Connect API datatypes. Any String-like data will become TEXT column types.
DATE to int, and TIME to bigint
I think, by default, datetime values are converted into Unix epoch. You can use the TimestampConverter transform to convert to a different format
Overall, if you want to accurately preserve types, disable the auto-creation of tables from the sink connector and pre-create tables with the types you want.

I just made an SMT that converts all timestamp fields to strings. hopefully, it could help.
https://github.com/FX-HAO/kafka-connect-debezium-tranforms

You need to make 2 changes
In Source Connector add "time.precision.mode":"connect"
In sink connector add
"transforms": "TimestampConverter",
"transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.TimestampConverter.target.type": "Timestamp",
"transforms.TimestampConverter.field": "dob",

ElasticSearch Search for specified value within `FromX` and `ToY` fields

I want to query by specified value within range that made by value of two FromX and ToY fields, and search Title as text field by query_string query.
This example shows my goal:
Id | FromX | ToY | Title
-----------------------------
1 | 1 | 7 | Mohammad
2 | 2 | 3 | Ali
3 | 1 | 6 | MohammadAli
4 | 2 | 5 | MohammadReza
5 | 1 | 2 | AliReza
6 | 2 | 2 | Sayed Ali
example query:
value: 2 AND title: *Ali*
result for query:
Id | FromX | ToY | Title
-----------------------------
2 | 2 | 3 | Ali
3 | 1 | 6 | MohammadAli
5 | 1 | 2 | AliReza
6 | 2 | 2 | Sayed Ali
Update 1:
Add last record with Id=6 in the sample data and result.

The following query should give you what you expect:
{
"query": {
"bool": {
"filter": [
{
"range": {
"FromX": {
"lte": 2
}
}
},
{
"range": {
"ToY": {
"gte": 2
}
}
},
{
"query_string": {
"query": "*ali*"
}
}
]
}
}
}
However, not that prefix wildcards should be avoided at all cost as they will penalize the performance of your query. You should probably analyze your title field using ngrams and do normal match queries on the Title field, instead.

Undefined binding(s) detected when compiling SELECT query

I am following a tutorial for strapi and am stuck at a part where I query for dishes belonging to a restaurant. I'm sure everything is set up properly with a one(restaurant) to many(dishes) relationship defined but the query doesn't work. I've traced it to the actual query which is:
query {
restaurant(id: "1") {
id
name
dishes {
name
description
}
}
}
which returns an error when I run it in playground. The query doesn't show any issues while I write it and doesn't allow me to write anything like:
query {
restaurant(where:{id: "1"}) {
id
name
dishes {
name
description
}
}
}
My database is mysql and the two tables look like this:
mysql> describe dishes;
+-------------+---------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------+------+-----+-------------------+-----------------------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | YES | MUL | NULL | |
| description | longtext | YES | | NULL | |
| price | decimal(10,2) | YES | | NULL | |
| restaurant | int(11) | YES | | NULL | |
| created_at | timestamp | NO | | CURRENT_TIMESTAMP | |
| updated_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+-------------+---------------+------+-----+-------------------+-----------------------------+
7 rows in set (0.00 sec)
mysql> describe restaurants;
+-------------+--------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+-------------------+-----------------------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | YES | MUL | NULL | |
| description | longtext | YES | | NULL | |
| created_at | timestamp | NO | | CURRENT_TIMESTAMP | |
| updated_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+-------------+--------------+------+-----+-------------------+-----------------------------+
5 rows in set (0.00 sec)
These tables where auto generated by strapi.
The full error in playground is this:
{
"errors": [
{
"message": "Undefined binding(s) detected when compiling SELECT query: select `restaurants`.* from `restaurants` where `restaurants`.`id` = ? limit ?",
"locations": [
{
"line": 2,
"column": 3
}
],
"path": [
"restaurant"
],
"extensions": {
"code": "INTERNAL_SERVER_ERROR",
"exception": {
"stacktrace": [
"Error: Undefined binding(s) detected when compiling SELECT query: select `restaurants`.* from `restaurants` where `restaurants`.`id` = ? limit ?",
" at QueryCompiler_MySQL.toSQL (/Users/redqueen/development/deliveroo/server/node_modules/knex/lib/query/compiler.js:85:13)",
" at Builder.toSQL (/Users/redqueen/development/deliveroo/server/node_modules/knex/lib/query/builder.js:72:44)",
" at /Users/redqueen/development/deliveroo/server/node_modules/knex/lib/runner.js:37:34",
"From previous event:",
" at Runner.run (/Users/redqueen/development/deliveroo/server/node_modules/knex/lib/runner.js:33:30)",
" at Builder.Target.then (/Users/redqueen/development/deliveroo/server/node_modules/knex/lib/interface.js:23:43)",
" at runCallback (timers.js:705:18)",
" at tryOnImmediate (timers.js:676:5)",
" at processImmediate (timers.js:658:5)",
" at process.topLevelDomainCallback (domain.js:120:23)"
]
}
}
}
],
"data":
Any idea why this is happening?

It seems this was a bug with the alpha.v20 and alpha.v21 versions of strapi. A bug fix has been published to solve it, an issue thread on github is here.

Elasticsearch query time boosting produces result in inadequate order

The ES search result for the given search keyword one two three seems to be wrong after applying boost feature per keyword. Please help me modifying my "faulty" query in order to accomplish "expected result" below as I described. I'm on ES 1.7.4 with LUCENE 4.10.4
Boosting criteria -three is regarded as the most important keyword:
word - boost
---- -----
one 1
two 2
three 3
ES index content - just showing MySQL dump to make the post shorter
mysql> SELECT id, title FROM post;
+----+-------------------+
| id | title |
+----+-------------------+
| 1 | one |
| 2 | two |
| 3 | three |
| 4 | one two |
| 5 | one three |
| 6 | one two three |
| 7 | two three |
| 8 | none |
| 9 | one abc |
| 10 | two abc |
| 11 | three abc |
| 12 | one two abc |
| 13 | one two three abc |
| 14 | two three abc |
+----+-------------------+
14 rows in set (0.00 sec)
Expected ES query result - The user is searching for one two three. I'm not fussed about the order of equally scored records. I mean if record 6 and 13 switches places, I don't mind.
+----+-------------------+
| id | title | my scores for demonstration purposes
+----+-------------------+
| 6 | one two three | (1+2+3 = 6)
| 13 | one two three abc | (1+2+3 = 6)
| 7 | two three | (2+3 = 5)
| 14 | two three abc | (2+3 = 5)
| 5 | one three | (1+3 = 4)
| 4 | one two | (1+2 = 3)
| 12 | one two abc | (1+2 = 3)
| 3 | three | (3 = 3)
| 11 | three abc | (3 = 3)
| 2 | two | (2 = 2)
| 10 | two abc | (2 = 2)
| 1 | one | (1 = 1)
| 9 | one abc | (1 = 1)
| 8 | none | <- This shouldn't appear
+----+-------------------+
14 rows in set (0.00 sec)
Unexpected ES query result - Unfortunately, This is what I get.
+----+-------------------+
| id | title | _score
+----+-------------------+
| 6 | one two three | 1.0013864
| 13 | one two three abc | 1.0013864
| 4 | one two | 0.57794875
| 3 | three | 0.5310148
| 7 | two three | 0.50929534
| 5 | one three | 0.503356
| 14 | two three abc | 0.4074363
| 11 | three abc | 0.36586377
| 12 | one two abc | 0.30806428
| 10 | two abc | 0.23231897
| 2 | two | 0.12812772
| 1 | one | 0.084527075
| 9 | one abc | 0.07408653
+----+-------------------+
ES query
curl -XPOST "http://127.0.0.1:9200/_search?post_dev" -d'
{
"query": {
"bool": {
"must": {
"match": {
"title": {
"query": "one two three"
}
}
},
"should": [
{
"match": {
"title": {
"query": "one",
"boost": 1
}
}
},
{
"match": {
"title": {
"query": "two",
"boost": 2
}
}
},
{
"match": {
"title": {
"query": "three",
"boost": 3
}
}
}
]
}
},
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"from": "0",
"size": "100"
}'
Some more test queries:
This query doesn't produce any result.
This query doesn't order correctly as seem here.

# Index some test data
curl -XPUT "localhost:9200/test/doc/1" -d '{"title": "one"}'
curl -XPUT "localhost:9200/test/doc/2" -d '{"title": "two"}'
curl -XPUT "localhost:9200/test/doc/3" -d '{"title": "three"}'
curl -XPUT "localhost:9200/test/doc/4" -d '{"title": "one two"}'
curl -XPUT "localhost:9200/test/doc/5" -d '{"title": "one three"}'
curl -XPUT "localhost:9200/test/doc/6" -d '{"title": "one two three"}'
curl -XPUT "localhost:9200/test/doc/7" -d '{"title": "two three"}'
curl -XPUT "localhost:9200/test/doc/8" -d '{"title": "none"}'
curl -XPUT "localhost:9200/test/doc/9" -d '{"title": "one abc"}'
curl -XPUT "localhost:9200/test/doc/10" -d '{"title": "two abc"}'
curl -XPUT "localhost:9200/test/doc/11" -d '{"title": "three abc"}'
curl -XPUT "localhost:9200/test/doc/12" -d '{"title": "one two abc"}'
curl -XPUT "localhost:9200/test/doc/13" -d '{"title": "one two three abc"}'
curl -XPUT "localhost:9200/test/doc/14" -d '{"title": "two three abc"}'
# Make test data available for search
curl -XPOST "localhost:9200/test/_refresh?pretty"
# Search using function score
curl -XPOST "localhost:9200/test/doc/_search?pretty" -d'{
"query": {
"function_score": {
"query": {
"match": {
"title": "one two three"
}
},
"functions": [
{
"filter": {
"query": {
"match": {
"title": "one"
}
}
},
"weight": 1
},
{
"filter": {
"query": {
"match": {
"title": "two"
}
}
},
"weight": 2
},
{
"filter": {
"query": {
"match": {
"title": "three"
}
}
},
"weight": 3
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
},
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"from": "0",
"size": "100"
}'

Tokenizing a multi-language text field in Elasticsearch

I have the following table which contains millions of documents data in the form of a json file:
+-------+---------------------------------------+------------+
| doc_id| doc_text | doc_lang |
+-------+---------------------------------------+------------+
| doc1 | "first /resource X 'title' " | en |
| doc2 | "<r>ressource 2 #titre en France" | Fr |
| doc3 | "die Tür geöffnet?" | ge |
| doc4 | "$risorsa 4 <in> lingua italiana" | It |
| ... | " ........." | .. |
| ... | "........." | .. |
+-------+---------------------------------------+------------+
I need to do the following:
Tokenizing, filtering and stopwords removing for each document text using an appropriate analyzer (dynamically) according to the text language shown in doc_lang field (let's say European languages).
Getting TF and IDF for each term inside doc_text field.(no search operations are required, just for scoring)
Q) Could anybody advice me if Elasticsearch is a good choice in this case?
P.S. I am looking for something compatible with Apache Spark.

Include language code in the doc_text field when indexing like
{ "doc_id": "doc", "doc_text_en": "xxx", "doc_lang": "en"}
Then you will be able to specify dynamic mapping of lang-specific analyzer.
https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Configure Logstash to create an Elasticsearch document with nested arrays - elasticsearch

Related

Kafka jdbc sink connector creates data types that do not matching the original

ElasticSearch Search for specified value within `FromX` and `ToY` fields

Undefined binding(s) detected when compiling SELECT query

Elasticsearch query time boosting produces result in inadequate order

Tokenizing a multi-language text field in Elasticsearch

Categories

Resources