Update part of document in Elasticsearch from Kafka - elasticsearch

I have multiple Kafka Connectors and Topics that all house different sources of data, yet all contain reference to the same primary key (lets call "id"). Can you update Elasticsearch using this same id?
For example, source 1 has the following schema
{
"id": 123
"some_value": "yo"
"details": {}
}
Source 2 has the following
{
"id": 123
"reference": 1
},
{
"id": 123
"reference": 2
}
Is there a way I can create my expected outcome within ES to mimic the following
{
"id": 123
"some_value": "yo"
"details": [
{
"id": 123
"reference": 1
},
{
"id": 123
"reference": 2
}
]
}
I have tried using Kafka's transforms with hoistfield but have been unsuccessful

Related

Directus Filtering multiple fields

I'm trying to filter Directus CMS data set through URL parameters.
This is a sample data set. I can successfully filter data set by single parameter.
{
"data":[
{
"id": "1",
"status": "published",
"category": "Novel",
"section": "Kids"
},
{
"id": "2",
"status": "published",
"category": "Novel",
"section": "Adults"
}
]
}
/items/books?filter[category][_eq]=Novel
gives me exactly what I expected which is 1 & 2 data records.
But I need to filter both "category" & "section" fields
/items/books?filter[category][_eq]=Novel&filter[section][_eq]=Adults
For above I receive an empty data set.
Why is this getting failed ? Where do I need to fix? Appreciate your support in advance. Thanks!
Try the following query:
/items/books?filter={"_or":[{"category":{"_eq": "Novel"}},{"section":{"_eq":"Adults"}}]}
An expanded version of the filter:
"_or": [
{
"category": {
"_eq": "Novel"
}
},
{
"section": {
"_eq": "Adults"
}
}
]
Visit the official docs to read more about filtering rules and logical operators.

How to cleanly batch queries together in Gremlin

I am writing a GraphQL resolver that retrieves all vertices by a particular edge using the following query (created returns label person):
software {
created {
name
}
}
Which would resolve to the following Gremlin Query for each software node found:
g.V().hasLabel('software').has('name', 'ripple').in('created')
This returns a result that includes all properties of the object:
{
"result": [
{
"#type": "d",
"#rid": "#24:0",
"#version": 6,
"#class": "person",
"in_knows": [
"#35:0"
],
"name": "josh",
"out_created": [
"#32:0",
"#33:0"
],
"age": 32,
"#fieldTypes": "in_knows=g,out_created=g"
}
],
"dbStats": {
...
}
}
I realize that this will fall foul on GraphQL's N+1 query so i'm trying to batch queries together using a Dataloader pattern. (i'm also hoping to do property selections, so i'm not asking the database to return too much info)
So i'm trying to craft a query like so:
g.V().union(
__.hasLabel('software').has('name', 'ripple').
project('parent', 'child').by('id').
by(__.in('created').fold()),
__.hasLabel('software').has('name', 'lop').
project('parent', 'child').by('id').
by(__.in('created').fold())
)
But this results in the following where the props are missing and it just includes the id of the vertices I want:
{
"result": [
{
"parent": "ripple",
"child": [
"#24:0"
]
},
{
"parent": "lop",
"child": [
"#22:0",
"#23:0",
"#24:0"
]
}
],
"dbStats": {
...
}
}
My Question is, how can I have the Gremlin query return all of the props for the found vertices and none of the other props? Should I even been doing batching this way?
For anyone else reading, the query I was trying to write wouldn't work because the TraversalSet created in the .by(_.in('created') can't be cast from a List to an ElementMap as the stream cardinality wouldn't be enforced. (You can only have one record per row, I think?)
My working query would be to duplicate the keys for each row and specify the props needed (the query below is ok for gremlin 3.3 as used in ODB, otherwise if you've got < gremlin 3.4 replace the last by step with be(elementMap('name', 'age')):
g.V().union(
__.hasLabel('software').has('name', 'ripple').
as('parent').
in('created').as('child').
select('parent', 'child').
by(values('name')).
by(properties('id', 'name', 'age').
group().by(__.key()).
by(__.value())),
__.hasLabel('software').has('name', 'lop').
as('parent').
in('created').as('child').
select('parent', 'child').
by(values('name')).
by(properties('id', 'name', 'age').
group().by(__.key()).
by(__.value()))
)
So that you get a result like this:
{"data": [
{
"parent": "ripple",
"child": {
"id": 5717,
"name": "josh",
"age": 32
}
},
{
"parent": "lop",
"child": {
"id": 5709,
"name": "peter",
"age": 35
}
},
{
"parent": "lop",
"child": {
"id": 5713,
"name": "marko",
"age": 29
}
},
{
"parent": "lop",
"child": {
"id": 5717,
"name": "josh",
"age": 32
}
}
]
}
Which would allow you to create a lookup where you concat all results for "lop" and "ripple" into arrays.

ReferenceManyFields (One to Many Relationship)

I am working on a project where I have to create one to many relationships which will get all the list of records referenced by id in another table and I have to display all the selected data in the multi-select field (selectArrayInput). Please help me out in this, if you help with an example that would be great.
Thanks in advance.
Example:
district
id name
1 A
2 B
3 C
block
id district_id name
1 1 ABC
2 1 XYZ
3 2 DEF
I am using https://github.com/Steams/ra-data-hasura-graphql hasura-graphql dataprovider for my application.
You're likely looking for "nested object queries" (see: https://hasura.io/docs/1.0/graphql/manual/queries/nested-object-queries.html#nested-object-queries)
An example...
query MyQuery {
district(where: {id: {_eq: 1}}) {
id
name
blocks {
id
name
}
}
}
result:
{
"data": {
"district": [
{
"id": 1,
"name": "A",
"blocks": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "XYZ"
}
]
}
]
}
}
Or...
query MyQuery2 {
block(where: {district: {name: {_eq: "A"}}}) {
id
name
district {
id
name
}
}
}
result:
{
"data": {
"block": [
{
"id": 1,
"name": "ABC",
"district": {
"id": 1,
"name": "A"
}
},
{
"id": 2,
"name": "XYZ",
"district": {
"id": 1,
"name": "A"
}
}
]
}
}
Setting up the tables this way...
blocks:
districts:
Aside: I recommend using plural table names as they are more standard, "districts" and "blocks"

Dedup elasticsearch results using multiple fields as unique key

There have been similar question asked to this (see Remove duplicate documents from a search in Elasticsearch) but I haven't found a way to dedup using multiple fields as the "unique key". Here's a simple example to illustrate a bit of what I'm looking for:
Say this is our raw data:
{ "name": "X", "event": "A", "time": 1 }
{ "name": "X", "event": "B", "time": 2 }
{ "name": "X", "event": "B", "time": 3 }
{ "name": "Y", "event": "A", "time": 4 }
{ "name": "Y", "event": "C", "time": 5 }
I would essentially like to get the distinct event counts based on name and event. I want to avoid double counting the event B which happened on the same name X twice, so the counts I'd be looking for are:
event: A, count: 2
event: B, count: 1
event: C, count: 1
Is there a way to set up an agg query as seen in the related question? Another option I've deliberated is to index the object with a special key field (i.e. "X_A", "X_B", etc.). I could then simply dedup on this field. I'm not sure which is a preferred approach, but I'd personally prefer not to index the data with extra metadata.
You can specify a script in a terms aggregation in order to build a key out of multiple fields:
POST /test/dedup/_search
{
"aggs":{
"dedup" : {
"terms":{
"script": "[doc.name.value, doc.event.value].join('_')"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
This will basically provide the following results:
X_A: 1
X_B: 2
Y_A: 1
Y_C: 1
Note: There's only one event C in your sample data, so the count cannot be two unless I'm missing something.

Query nested field with index support

Can anyone tell me is there a way to query nested field with index support. I created nested index like:
r.table('comments').indexCreate('authorName', r.row("author")("name")).run(conn, callback)
but I can't see any ways to query all comments which have specified author name - documentations says that getAll command takes number, string, bool, pseudotype, or array and filter command does not currently have an optimizer for indexes
I've just tried creating a nested r.row("author")("name") secondary index named "authorName" for a table with the following rows:
[
{
"author": {
"name": "Lennon"
},
"text": "c1",
"id": "4f66dcac-be74-49f2-b8dc-5fc352f4f928"
},
{
"author": {
"name": "Cobain"
},
"text": "c2",
"id": "82936ae0-bc4d-435b-b19a-6786339da232"
}
]
It seems that
r.table('comments').getAll("Cobain", {index: "authorName"}).run(conn, callback)
is working and returns
{
"author": {
"name": "Cobain"
},
"text": "c2",
"id": "82936ae0-bc4d-435b-b19a-6786339da232"
}

Resources