MongDB sink connector: How to upsert deep objects - apache-kafka-connect

I have a usecase of reading messages from kafka topics and loading them into MongoDB. As part of this process, I'm also looking to handle the data update part.
For example, consider this kafka message
{
"_id": "123",
"meta": {
"id": "456",
"name: "abc"
"lastname": "xyz"
}
}
After this is added to MongoDB sink, consider the next message
{
"meta": {
"id": "456",
"lastname": "oxy"
}
}
I'm expecting the sink connector to update only the lastname field without overwriting other fields and the document should look like
{
"_id": "123",
"meta": {
"id": "456",
"name: "abc"
"lastname": "oxy"
}
}
Basically, how to achieve this upsert functionality in MongoDB sink connector write model strategy? Here is the custom strategy and the sink configuration
public class UpsertAsPartOfDocumentStrategy implements WriteModelStrategy, Configurable {
private final static String ID_FIELD_NAME = "_id";
private boolean isPartialId = false;
private static final String CREATE_PREFIX = "%s%s.";
private static final String ELEMENT_NAME_PREFIX = "%s%s";
private static final UpdateOptions UPDATE_OPTIONS = new UpdateOptions().upsert(true);
static final String FIELD_NAME_MODIFIED_TS = "_modifiedTS";
static final String FIELD_NAME_INSERTED_TS = "_insertedTS";
#Override
public WriteModel<BsonDocument> createWriteModel(SinkDocument document) {
BsonDocument vd =
document
.getValueDoc()
.orElseThrow(
() ->
new DataException(
"Could not build the WriteModel,the value document was missing unexpectedly"));
BsonValue idValue = vd.get(ID_FIELD);
if (idValue == null || !idValue.isDocument()) {
throw new DataException(
"Could not build the WriteModel,the value document does not contain an _id field of"
+ " type BsonDocument which holds the business key fields.\n\n If you are including an"
+ " existing `_id` value in the business key then ensure `document.id.strategy.overwrite.existing=true`.");
}
BsonDocument businessKey = idValue.asDocument();
if (isPartialId) {
businessKey = flattenKeys(businessKey);
}
System.out.println("document" + vd);
return new UpdateOneModel<>(businessKey, vd, UPDATE_OPTIONS);
}
Reference: https://github.com/mongodb/mongo-kafka/blob/r1.7.0/src/main/java/com/mongodb/kafka/connect/sink/writemodel/strategy/UpdateOneBusinessKeyTimestampStrategy.java
https://www.mongodb.com/docs/drivers/go/current/fundamentals/crud/write-operations/upsert/
Sink properties
# Connection details
connector.class=com.mongodb.kafka.connect.MongoSinkConnector
connection.uri=mongodb://<connection>
tasks.max=1
topics=topc
database=db
collection=col
# Specific global MongoDB Sink Connector configuration
document.id.strategy.overwrite.existing=true
#writemodel.strategy=com.mongodb.kafka.connect.sink.writemodel.strategy.ReplaceOneBusinessKeyStrategy
writemodel.strategy=custom.writestrategy.UpsertAsPartOfDocumentStrategy
document.id.strategy=com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy
document.id.strategy.partial.value.projection.list=meta.id
document.id.strategy.partial.value.projection.type=AllowList
errors.tolerance=all
errors.deadletterqueue.topic.name=error_queue
errors.deadletterqueue.context.headers.enable=true
errors.log.include.messages=true

Related

Update inner object element in Json using Gson

I have below json and need to update elements, below code works for elements in top level, How can I extend this to work it inside another inner level (object).
Json:
{
"name": George,
"version": "2.0",
"reqParams": {
"headerId": "this needs to be updated",
"queue": "draft",
}
}
In below code I am passing below
eg.
keyPath = "headerId"
updateText = "123456"
jsonText = above json
Code :
public String updateValue(String keyPath, String updateText, String jsonText) {
String[] keys = keyPath.split("/");
JsonParser jsonParser = new JsonParser();
JsonObject jsonObject = (JsonObject) jsonParser.parse(jsonText);
JsonObject returnVal = jsonObject; // This holds the ref to target json object
JsonPrimitive jp = new JsonPrimitive(updateText);
String finalKey = keys[keys.length - 1];
for(String key : keys)
{
if (jsonObject.get(key) != null && jsonObject.get(key).isJsonObject())
{
jsonObject = (JsonObject)jsonObject.get(key);
}
}
jsonObject.remove(finalKey);
jsonObject.add(finalKey, jp);
return returnVal.toString();
}
Code
Expected out put json:
{
"name": George,
"version": "2.0",
"reqParams": {
"headerId": "123456",
"queue": "draft",
}
}
Actual reult:
{
"name": George,
"version": "2.0",
"reqParams": {
"headerId": "this needs to be updated",
"queue": "draft",
},
"headerId": "123456",
}
Pass keyPath as "reqParams/headerId" because headerId is inside reqParams and not at root level of JSON.
Updated code slightly and pass parameters as suggested by #Smile answer
keyPath : reqParams/headerId
someId (if exist in root level)
Code :
public String updateValue(String keyPath, String updateText, String jsonText) {
String[] keys = keyPath.split("/");
JsonParser jsonParser = new JsonParser();
JsonObject jsonObject = (JsonObject) jsonParser.parse(jsonText);
JsonObject returnVal = jsonObject; // This holds the ref to target json object
JsonPrimitive jp = new JsonPrimitive(updateText);
String finalKey = keys[keys.length - 1];
for (String key : keys) {
if (jsonObject.get(key) != null && jsonObject.get(key).isJsonObject()) {
jsonObject = (JsonObject) jsonObject.get(key);
jsonObject.remove(finalKey);
jsonObject.add(finalKey, jp);
return returnVal.toString();
} else if (jsonObject.get(finalKey) == null) {
return returnVal.toString();
}
}
jsonObject.remove(finalKey);
jsonObject.add(finalKey, jp);
return returnVal.toString();
}

How to process a CSV file using Reactor Flux and output as JSON

I've got a CSV file which I want to process using Spring Reactor Flux.
Given a CSV file with header where first two columns are fixed, and
can have more then one optional data columns
Id, Name, Group, Status
6EF3C06E-6240-1A4A-17D6-27E73F0CDD31, Harlan Ferguson, xy1, true
6B261437-217C-0FDF-741A-92477EE354EC, Risa Greene, xy2, false
4FADC070-FCD0-C7E8-1963-A7FACDB6D8D1, Samson Blanchard, xy3, false
562C3486-E009-2C2D-9D3E-14355DB7D4D7, Damian Carson, xy4, true
...
...
...
I want to process the input using Flux
So that the output is
[{
"Id": "6EF3C06E-6240-1A4A-17D6-27E73F0CDD31",
"Name": "Harlan Ferguson",
"data": {
"Group": "xyz1",
"Status": true
}
}, {
"Id": "6B261437-217C-0FDF-741A-92477EE354EC",
"Name": "Risa Greene",
"data": {
"Group": "xy2",
"Status": false
}
}, {
"Id": "4FADC070-FCD0-C7E8-1963-A7FACDB6D8D1",
"Name": "Samson Blanchard",
"data": {
"Group": "xy3",
"Status": false
}
}, {
"Id": "562C3486-E009-2C2D-9D3E-14355DB7D4D7",
"Name": "Damian Carson",
"data": {
"Group": "xy4",
"Status": true
}
}]
I'm using CSVReader to stream and creating and Flux using
new CSVReader( Files.newBufferedReader(file) );
Flux<String[]> fluxOfCsvRecords = Flux.fromIterable(reader);
I'm coming back to Spring Reactor after couple of years, so my understanding is a bit rusty.
Creating a Mono of header using
Mono<String[]> headerMono = fluxOfCsvRecords.next();
And then,
fluxOfCsvRecords.skip(1)
.flatMap(csvRecord -> headerMono.map(header -> header[0] + " : " + csvRecord[0]))
.subscribe(System.out::println);
This is half-way code just to test that I'm able to combine data from header and rest of the flux, expecting to see
Id : 6EF3C06E-6240-1A4A-17D6-27E73F0CDD31
Id : 6B261437-217C-0FDF-741A-92477EE354EC
Id : 4FADC070-FCD0-C7E8-1963-A7FACDB6D8D1
Id : 562C3486-E009-2C2D-9D3E-14355DB7D4D7
But my output is just
4FADC070-FCD0-C7E8-1963-A7FACDB6D8D1 : 6EF3C06E-6240-1A4A-17D6-27E73F0CDD31
I'll appreciate if anyone can help me understand how to achieve this.
---------------------------Update---------------------
Tried another approach
Flux<String[]> take1 = fluxOfCsvRecords.take(1);
take1.flatMap(header -> fluxOfCsvRecords.map(csvRecord -> header[0] + " : " + csvRecord[0]))
.subscribe(System.out::println);
The output is
Id : 6B261437-217C-0FDF-741A-92477EE354EC
Id : 4FADC070-FCD0-C7E8-1963-A7FACDB6D8D1
Id : 562C3486-E009-2C2D-9D3E-14355DB7D4D7
Missing the row after the header
Add Two class like
public class TopJson {
private int Id;
private String name;
private InnerJson data;
public TopJson() {}
public TopJson(int id, String name, InnerJson data) {
super();
Id = id;
this.name = name;
this.data = data;
}
}
class InnerJson{
private String group;
private String status;
public InnerJson() {}
public InnerJson(String group, String status) {
super();
this.group = group;
this.status = status;
}
converted to appropriate types and used to create the object.
fluxOfCsvRecords.skip(1)
.map((Function<String, TopJson>) x -> {
String[] csvRecord = line.split(",");// a CSV has comma separated lines
return new TopJson(Integer.parseInt(csvRecord[0]), csvRecord[1],
new InnerJson(csvRecord[2], csvRecord[3]));
}).collect(Collectors.toList()));

How could I build this Elasticsearch query?

I'm using Elasticsearch with Spring Data and I have this configuration:
public class Address {
//...
#MultiField(
mainField = #Field(type = FieldType.Text),
otherFields = {
#InnerField(suffix = "raw", type = FieldType.Keyword)
}
)
private String locality;
//...
}
User can filter addresses by locality, so I'm trying to find the proper Elasticsearch query.
Say there are 2 documents:
{ /* ... */, locality: "Granada" }
{ /* ... */, locality: "Las Palmas de Gran Canaria" }
Given user input granada or Granada, I want to just the first document to be returned. However, using this query, both of them are returned.
{
"query": {
"match": {
"address.locality": "granada"
}
}
}
I have also tried with:
{
"query": {
"term": {
"address.locality.raw": "granada"
}
}
}
But, in that case, query is case sensitive and only returns first document when input is Granada, but not granada.
How could I achieve that behaviour?
I wonder why you get both documents with your query, nothing is returned when I try this, because address is not a property of your Document class.
The query should be
{
"query": {
"match": {
"locality": "granada"
}
}
}
Then it returns just the one document.
The mapping that is produced using Spring Data Elasticsearch 3.2.0.RC2 when using this class:
#Document(indexName = "address")
public class Address {
#Id private Long id;
#MultiField(mainField = #Field(type = FieldType.Text),
otherFields = { #InnerField(suffix = "raw", type = FieldType.Keyword) }) private String locality;
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public String getLocality() {
return locality;
}
public void setLocality(String locality) {
this.locality = locality;
}
}
is:
{
"address": {
"mappings": {
"address": {
"properties": {
"id": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"locality": {
"fields": {
"raw": {
"type": "keyword"
}
},
"type": "text"
}
}
}
}
}
}
First thing to notice is that using match() queries - Elasticsearch analyzes (pre-processes) it queries (tokenization is performed: chops of spaces, removes punctuation and more...), in the same way as it has been indexed.
So if your "address.locality" string-field is indexed as 'text', it will use the standard analyzer for both search (using match() query) and indexing.
Term queries are not being analyzed before search is executed, and thus different results might appear.
So in your example, our analysis process will look like:
locality: 'Granada' >> ['granada'], locality.raw: 'Granada' >> ['Granada']
locality: 'Las Palmas de Gran Canaria' >> ['las', 'palmas', 'de', 'gran', 'canaria'] locality.raw: 'Las Palmas de Gran Canaria' >> ['Las
Palmas de Gran Canaria']
as for the second case, "address.locality.raw" is indexed as 'keyword' which uses the keyword analyzer, this analyzer indexes the entire token (does not chop off anything).
Possible solution:
for first part: it should actually return only one document. if you set your property as P.J mentioned above.
for second part: index the innerfield type as type = FieldType.Text, which will break
'Granada' to 'granada'
thus term() queries of 'granada' will match - but any other term() query would not.
any match() queries of
'Granada', 'GRANADA', 'granada', etc...
will match as well (as it will be analyzed to 'granada' using the standard analyzer). This must be checked with your future use cases, maybe keyword indexing is relevant in your other use cases, and just change the query itself.

Java 8 collect and change the format of the result

I have the data structre called MyPojo which has fields called time, name and timetaken (all are in Strings). I'm trying to do some grouping as follows:
List<MyPojo> myPojos = Arrays.asList(
new MyPojo("2017", "ABC", "30"),
new MyPojo("2017", "ABC", "20"),
new MyPojo("2016", "ABC", "25"),
new MyPojo("2017", "XYZ", "40")
);
Map<String, Map<String, Double>> resultMap = myPojos.stream()
.collect(Collectors.groupingBy(MyPojo::getName,
Collectors.groupingBy(MyPojo::getTime,
Collectors.averagingDouble(MyPojo::getTimeTakenAsDouble))));
Please note that I've a method called getTimeTakenAsDouble to convert thetimetaken string to double value.
This results as follows:
{ABC={2017=25.0, 2016=25.0}, XYZ={2017=40.0}}
However, my frontend developer wanted the data either in the following format:
{ABC={2017=25.0, 2016=25.0}, XYZ={2017=40.0, 2016=0.0}}
or
[
{
"time": "2017",
"name": "ABC",
"avgTimeTaken": 25.0
},
{
"time": "2017",
"name": "XYZ",
"avgTimeTaken": 40.0
},
{
"time": "2016",
"name": "ABC",
"avgTimeTaken": 25.0
},
{
"time": "2016",
"name": "XYZ",
"avgTimeTaken": 0.0
}
]
I'm thinking to perform iterations on the resultMap and prepare the 2nd format. I'm trying to perform the iteration again on the resultMap. Is there any other way to handle this?
Actually it's pretty interesting what you are trying to achieve. It's like you are trying to do some sort of logical padding. The way I've done it is to use Collectors.collectingAndThen. Once the result is there - I simply pad it with needed data.
Notice that I'm using Sets.difference from guava, but that can easily be put into a static method. Also there are additional operations performed.
So I assume your MyPojo looks like this:
static class MyPojo {
private final String time;
private final String name;
private final String timetaken;
public MyPojo(String time, String name, String timetaken) {
super();
this.name = name;
this.time = time;
this.timetaken = timetaken;
}
public String getName() {
return name;
}
public String getTime() {
return time;
}
public String getTimetaken() {
return timetaken;
}
public static double getTimeTakenAsDouble(MyPojo pojo) {
return Double.parseDouble(pojo.getTimetaken());
}
}
And input data that I've checked against is :
List<MyPojo> myPojos = Arrays.asList(
new MyPojo("2017", "ABC", "30"),
new MyPojo("2017", "ABC", "20"),
new MyPojo("2016", "ABC", "25"),
new MyPojo("2017", "XYZ", "40"),
new MyPojo("2018", "RDF", "80"));
Here is the code that does what you want:
Set<String> distinctYears = myPojos.stream().map(MyPojo::getTime).collect(Collectors.toSet());
Map<String, Map<String, Double>> resultMap = myPojos.stream()
.collect(Collectors.groupingBy(MyPojo::getName,
Collectors.collectingAndThen(
Collectors.groupingBy(MyPojo::getTime,
Collectors.averagingDouble(MyPojo::getTimeTakenAsDouble)),
map -> {
Set<String> localYears = map.keySet();
SetView<String> diff = Sets.difference(distinctYears, localYears);
Map<String, Double> toReturn = new HashMap<>(localYears.size() + diff.size());
toReturn.putAll(map);
diff.stream().forEach(e -> toReturn.put(e, 0.0));
return toReturn;
}
)));
Result of that would be:
{ABC={2016=25.0, 2018=0.0, 2017=25.0},
RDF={2016=0.0, 2018=80.0, 2017=0.0},
XYZ={2016=0.0, 2018=0.0, 2017=40.0}}

Does Nest.ConnectionSettings.SetJsonSerializerSettingsModifier even work?

Here is my question. Due to project needs, we have to keep our dates within elasticsearch index in the same format. What we've tried is the next way -
var connectionPool = new SniffingConnectionPool(nodeList);
var connectionSettings = new ConnectionSettings(connectionPool)
.SetJsonSerializerSettingsModifier(
m => m.DateFormatString = "yyyy-MM-ddTHH:mm:ss.fffffffK")
// other configuration goes here
But it didn't work out. Searching through ES index, I saw dates with dropped trailing zeros ( like 2015-05-05T18:55:27Z insted of expected 2015-05-05T18:55:27.0000000Z). Neither did next option help:
var connectionPool = new SniffingConnectionPool(nodeList);
var connectionSettings = new ConnectionSettings(connectionPool)
.SetJsonSerializerSettingsModifier(m =>
{
m.Converters.Add(new IsoDateTimeConverter { DateTimeFormat = "yyyy'-'MM'-'dd'T'HH':'mm':'ss.fffffffK"});
})
// other configuration goes here
With digging into ElasticClient at run-time, I've found that eventually there is a contract resolver which seems like overrides all those settings:
public class ElasticContractResolver : DefaultContractResolver
{
protected override JsonContract CreateContract(Type objectType)
{
JsonContract contract = base.CreateContract(objectType);
...
if (objectType == typeof(DateTime) || objectType == typeof(DateTime?))
contract.Converter = new IsoDateTimeConverter();
...
if (this.ConnectionSettings.ContractConverters.HasAny())
{
foreach (var c in this.ConnectionSettings.ContractConverters)
{
var converter = c(objectType);
if (converter == null)
continue;
contract.Converter = converter;
break;
}
}
return contract;
}
}
So if I have it right, without specifying a converter explicitly(via Connection Settings.AddContractJsonConverters()), my json settings will be gone since IsoDateTimeConverter is instantiated with the default settings rather than ones I've passed through SetJsonSerializerSettingsModifier.
Has anyone run into this issue? Or I'm just missing something? Thanks in advance!
This is how I handled custom date format for my needs:
public class Document
{
[ElasticProperty(DateFormat = "yyyy-MM-dd", Type = FieldType.Date)]
public string CreatedDate { get; set; }
}
client.Index(new Document {CreatedDate = DateTime.Now.ToString("yyyy-MM-dd")});
My document in ES
{
"_index": "indexname",
"_type": "document",
"_id": "AU04kd4jnBKFIw7rP3gX",
"_score": 1,
"_source": {
"createdDate": "2015-05-09"
}
}
Hope it will help you.

Resources