How to convert my JsonObject (com.google.gson.JsonObject) to GenericRecord (org.apache.avro.generic.GenericRecord) type - gson

We are creating a dataflow pipeline, Which will get a JSON and write to a parquet file. we are using the org.apache.beam.sdk.io.parquet package to write a file. ParquetIO.Sink allows you to write a PCollection of GenericRecord into a Parquet file (from here https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO.html). Now we to know how to convert JsonObject (with complex structure) to GenericRecord.
We tried to generate GenericRecord by using GenericRecordBuilder (org.apache.avro.generic.GenericRecordBuilder). and we are using JsonObject from com.google.gson.JsonObject But we got stuck how to convert generate GenericRecord for JsonArray with Ojects
Our sample Json
{
"event_name": "added_to_cart",
"event_id": "AMKL9877",
"attributes": [
{"key": "total", "value": "8982", "type": "double"},
{"key": "order_id", "value": "AKM1011", "type": "string"}
]
}
Our schema
{
"type":"record",
"name":"event",
"fields":[
{
"name":"event_name",
"type":"string"
},
{
"name":"event_id",
"type":"string"
},
{
"name":"attributes",
"type":{
"type":"array",
"items":{
"type":"record",
"name":"attribute_data",
"fields":[
{
"name":"key",
"type":"string"
},
{
"name":"value",
"type":"string"
},
{
"name":"type",
"type":"string"
}
]
}
}
}
]
}
Our code used to convert JsonObject to GenericRecord using GenericRecordBuilder
JsonObject event = element.getAsJsonObject();
GenericRecordBuilder recordBuilder = new GenericRecordBuilder(SCHEMA);
for (Schema.Field field:SCHEMA.getFields()) {
System.out.println(field);
String at_header = field.getProp(FIELD_AT_HEADER_PROPERTY);
System.out.println(at_header);
if(at_header != null && at_header.equals(Boolean.TRUE.toString())){
recordBuilder.set(field.name(), null);
}else{
JsonElement keyElement = event.get(field.name());
recordBuilder.set(field.name(), getElementAsType(field.schema(), keyElement));
}
}
return recordBuilder.build();
Object getElementAsType(Schema schema, JsonElement element) {
if(element == null || element.isJsonNull())
return null;
switch(schema.getType()){
case BOOLEAN:
return element.getAsBoolean();
case DOUBLE:
return element.getAsDouble();
case FLOAT:
return element.getAsFloat();
case INT:
return element.getAsInt();
case LONG:
return element.getAsLong();
case NULL:
return null;
case ARRAY:
???
case MAP:
???
default:
return element.getAsString();
}
We need to know how to build GenericRecord for complex type like an array of objects, map from a JSON. Thanks in Advance.

Here i found my answer from this page https://avro.apache.org/docs/1.8.2/api/java/org/apache/avro/generic/package-summary.html
A generic representation for Avro data.
This representation is best for applications which deal with dynamic data, whose schemas are not known until runtime.
Avro schemas are mapped to Java types as follows:
Schema records are implemented as GenericRecord.
Schema enums are implemented as GenericEnumSymbol.
Schema arrays are implemented as Collection.
Schema maps are implemented as Map.
Schema fixed are implemented as GenericFixed.
Schema strings are implemented as CharSequence.
Schema bytes are implemented as ByteBuffer.
Schema ints are implemented as Integer.
Schema longs are implemented as Long.
Schema floats are implemented as Float.
Schema doubles are implemented as Double.
Schema booleans are implemented as Boolean.

Related

Jackson derealization with SpringBoot : To get field names present in request along with respective field mapping

I have a requirement to throw different error in case of different scenarios like below, and there are many such fields not just 1.
e.g.
{
"id": 1,
"name": "nameWithSpecialChar$"
}
Here it should throw error for special character.
{
"id": 1,
"name": null
}
Here throw field null error.
{
"id": 1
}
Here throw field missing error.
Handling, 1st and 2nd scenario is easy, but for 3rd one, is there any way we can have a List of name of fields that were passed in input json at the time of serialization itself with Jackson?
One way, I am able to do it is via mapping request to JsonNode and then check if nodes are present for required fields and after that deserialize that JsonNode manually and then validate rest of the members as below.
public ResponseEntity myGetRequest(#RequestBody JsonNode requestJsonNode) {
if(!requestJsonNode.has("name")){
throw some error;
}
MyRequest request = ObjectMapper.convertValue(requestJsonNode, MyRequest .class);
validateIfFieldsAreInvalid(request);
But I do not like this approach, is there any other way of doing it?
You can define a JSON schema and validate your object against it. In your example, your schema may look like this:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"id": {
"description": "The identifier",
"type": "integer"
},
"name": {
"description": "The item name",
"type": "string",
"pattern": "^[a-zA-Z]*$"
}
},
"required": [ "id", "name" ]
}
To validate your object, you could use the json-schema-validator library. This library is built on Jackson. Since you're using Spring Boot anyway, you already have Jackson imported.
The example code looks more or less like this:
String schema = "<define your schema here>";
String data = "<put your data here>";
JsonSchemaFactory factory = JsonSchemaFactory.byDefault();
ObjectMapper m = new ObjectMapper();
JsonSchema jsonSchema = factory.getJsonSchema(m.readTree(schema));
JsonNode json = m.readTree(data);
ProcessingReport report = jsonSchema.validate(json);
System.out.println(report);
The report includes detailed errors for different input cases. For example, with this input
{
"id": 1,
"name": "nameWithSpecialChar$"
}
this output is printed out
--- BEGIN MESSAGES ---
error: ECMA 262 regex "^[a-zA-Z]*$" does not match input string "nameWithSpecialChar$"
level: "error"
schema: {"loadingURI":"#","pointer":"/properties/name"}
instance: {"pointer":"/name"}
domain: "validation"
keyword: "pattern"
regex: "^[a-zA-Z]*$"
string: "nameWithSpecialChar$"
--- END MESSAGES ---
Or instead of just printing out the report, you can loop through all errors and have your specific logic
for (ProcessingMessage message : report) {
// Add your logic here
}
You could check the example code to gain more information about how to use the library.

How to create a HashMap with custom object as a key?

In Elasticsearch, I have an object that contains an array of objects. Each object in the array have type, id, updateTime, value fields.
My input parameter is an array that contains objects of the same type but different values and update times. Id like to update the objects with new value when they exist and create new ones when they aren't.
I'd like to use Painless script to update those but keep them distinct, as some of them may overlap. Issue is that I need to use both type and id to keep them unique. So far I've done it with bruteforce approach, nested for loop and comparing elements of both arrays, but I'm not too happy about that.
One of the ideas is to take array from source, build temporary HashMap for fast lookup, process input and later store all objects back into source.
Can I create HashMap with custom object (a class with type and id) as a key? If so, how to do it? I can't add class definition to the script.
Here's the mapping. All fields are 'disabled' as I use them only as intermidiate state and query using other fields.
{
"properties": {
"arrayOfObjects": {
"properties": {
"typ": {
"enabled": false
},
"id": {
"enabled": false
},
"value": {
"enabled": false
},
"updated": {
"enabled": false
}
}
}
}
}
Example doc.
{
"arrayOfObjects": [
{
"typ": "a",
"id": "1",
"updated": "2020-01-02T10:10:10Z",
"value": "yes"
},
{
"typ": "a",
"id": "2",
"updated": "2020-01-02T11:11:11Z",
"value": "no"
},
{
"typ": "b",
"id": "1",
"updated": "2020-01-02T11:11:11Z"
}
]
}
And finally part of the script in it's current form. The script does some other things, too, so I've stripped them out for brevity.
if (ctx._source.arrayOfObjects == null) {
ctx._source.arrayOfObjects = new ArrayList();
}
for (obj in params.inputObjects) {
def found = false;
for (existingObj in ctx._source.arrayOfObjects) {
if (obj.typ == existingObj.typ && obj.id == existingObj.id && isAfter(obj.updated, existingObj.updated)) {
existingObj.updated = obj.updated;
existingObj.value = obj.value;
found = true;
break;
}
}
if (!found) {
ctx._source.arrayOfObjects.add([
"typ": obj.typ,
"id": obj.id,
"value": params.inputValue,
"updated": obj.updated
]);
}
}
There's technically nothing suboptimal about your approach.
A HashMap could potentially save some time but since you're scripting, you're already bound to its innate inefficiencies... Btw here's how you initialize & work with HashMaps.
Another approach would be to rethink your data structure -- instead of arrays of objects use keyed objects or similar. Arrays of objects aren't great for frequent updates.
Finally a tip: you said that these fields are only used to store some intermediate state. If that weren't the case (or won't be in the future), I'd recommend using nested arrays to enable querying independently of other objects in the array.

azure logic app with table storage get last rowKey

How can I use the "Get Entity for Azure table storage" connector in a Logic App to return the last rowKey.
This would be used in situation where the rowkey is say an integer incremented each time a new entity is added. I recognize the flaw in design of this but this question is about how some sort of where clause or last condition could be used in the Logic app.
Currently the Logic App code view snippet looks like this:
"actions": {
"Get_entity": {
"inputs": {
"host": {
"connection": {
"name": "#parameters('$connections')['azuretables']['connectionId']"
}
},
"method": "get",
"path": "/Tables/#{encodeURIComponent('contactInfo')}/entities(PartitionKey='#{encodeURIComponent('a')}',RowKey='#{encodeURIComponent('b')}')"
},
"runAfter": {},
"type": "ApiConnection"
}
Where I have the hard coded:
RowKey='#{encodeURIComponent('b')}'
This is fine if I always want this rowKey. What I want though is the last rowKey so something sort of like:
RowKey= last(RowKey)
Any idea on how this can be achieved?
This is fine if I always want this rowKey. What I want though is the last rowKey so something sort of like: RowKey= last(RowKey)
AFAIK, there is no build-in functions for you to achieve this purpose. I assumed that you could use the Azure Functions connector to retrieve the new RowKey value. Here are the detailed steps, you could refer to them:
For test, I created a C# Http Trigger function, then add a Azure Table Storage Input, then retrieve all the items under the specific PartitionKey, then order by the RowKey and calculate the new Row Key.
function.json:
{
"bindings": [
{
"authLevel": "function",
"name": "req",
"type": "httpTrigger",
"direction": "in"
},
{
"name": "$return",
"type": "http",
"direction": "out"
},
{
"type": "table",
"name": "inputTable",
"tableName": "SampleTable",
"take": 50,
"connection": "AzureWebJobsDashboard",
"direction": "in"
}
],
"disabled": false
}
run.csx:
#r "Microsoft.WindowsAzure.Storage"
using Microsoft.WindowsAzure.Storage.Table;
using System.Net;
public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, IQueryable<SampleTable> inputTable,TraceWriter log)
{
log.Info("C# HTTP trigger function processed a request.");
// parse query parameter
string pk = req.GetQueryNameValuePairs()
.FirstOrDefault(q => string.Compare(q.Key, "pk", true) == 0)
.Value;
// Get request body
dynamic data = await req.Content.ReadAsAsync<object>();
// Set name to query string or body data
pk = pk ?? data?.pk;
if(pk==null)
return req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a pk on the query string or in the request body");
else
{
var latestItem=inputTable.Where(p => p.PartitionKey == pk).ToList().OrderByDescending(i=>Convert.ToInt32(i.RowKey)).FirstOrDefault();
if(latestItem==null)
return req.CreateResponse(HttpStatusCode.OK,new{newRowKey=1});
else
return req.CreateResponse(HttpStatusCode.OK,new{newRowKey=int.Parse(latestItem.RowKey)+1});
}
}
public class SampleTable : TableEntity
{
public long P1 { get; set; }
public long P2 { get; set; }
}
Test:
For more details about Azure Functions Storage table bindings, you could refer to here.
azure table storage entities are sorted lexicographically. So choose a row key that actually decrements every time you add a new entity, ie. if your row key is an integer that gets incremented when new entity is created than choose your row key as Int.Max - entity.RowKey. The latest entity for that partition key will always be on the top since it is going to have the lowest row key, so all you need to do then to retrieve it is query only with partition key and Take(1). This is called Log Tail pattern, if you want to read more about it.

How to merge splitted FlowFiles with the data from Elasticsearch?

I have the problem with merging splitted FlowFiles. let me explain the problem step by step.
This is my sequence of processors.
In Elasticsearch I have this index and mapping:
PUT /myindex
{
"mappings": {
"myentries": {
"_all": {
"enabled": false
},
"properties": {
"yid": {"type": "keyword"},
"days": {
"properties": {
"Type1": { "type": "date" },
"Type2": { "type": "date" }
}
},
"directions": {
"properties": {
"name": {"type": "keyword"},
"recorder": { "type": "keyword" },
"direction": { "type": "integer" }
}
}
}
}
}
}
I get directions from Elasticsearch using QueryElasticsearchHTTP and then I split directions into using SplitJson in order to get 10 FlowFiles. Each FlowFile has this content: {"name": "X","recorder": "X", "direction": "X"}
After this, for each of 10 FlowFiles I generate the attribute filename using UpdateAttribute and ${UUID()}. Then, I enrich each FlowFile with some constant data from ElasticSearch. In fact, the data that I merge to each FlowFile is the same. Therefore, ideally, I would like to run Get constants from Elastic only once instead of running it 10 times.
But anyway the key problem is different. FlowFiles that come from Gets constants from Elastic have other values of filename and they cannot be merged with the files that come from Set the attribute "filename". I also tried to use EvaluateJsonPath, but had the same problem. Any idea of how to solve this problem?
UPDATE:
The Groovy code used in Merge inputs.... I am not sure if it works when in the input queues come batches of 10 and 10 files that should be merged:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def filename = ff0.getAttribute('filename')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( filename == ff.getAttribute('filename') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
session.rollback(false)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)

Changing bags into arrays in Pig Latin

I'm doing some transformations on some data set and need to publish to a sane looking format. Current my final set looks like this when I run describe:
{memberId: long,companyIds: {(subsidiary: long)}}
I need it to look like this:
{memberId: long,companyIds: [long] }
where companyIds is the key to an array of ids of type long?
I'm really struggling with how to manipulate things in this way? Any ideas? I've tried using FLATTEN and other commands to know avail. I'm using AvroStorage to write the files into this schema:
The field schema I need to write this data to looks like this:
"fields": [
{ "name": "memberId", "type": "long"},
{ "name": "companyIds", "type": {"type": "array", "items": "int"}}
]
There is no array type in PIG (http://pig.apache.org/docs/r0.10.0/basic.html#data-types). However, if all you need is a good looking output and if you don't have too many elements in companyIds, you may want to write a simple UDF that converts the bag into a nice formatted string.
Java code
public class BagToString extends EvalFunc<String>
{
#Override
public String exec(Tuple input) throws IOException
{
List<String> strings = new ArrayList<String>();
DataBag bag = (DataBag) input.get(0);
if (bag.size() == 0) {
return null;
}
for (Iterator<Tuple> it = bag.iterator(); it.hasNext();) {
Tuple t = it.next();
strings.add(t.get(0).toString());
}
return StringUtils.join(strings, ":");
}
}
PIG script
foo = foreach bar generate memberId, BagToString(companyIds);
I know this is a bit old, but I recently ran into the same problem.
Based on the avrostorage documentation, using the latest version of pig and avrostorage, it is possible to directly cast bag to avro array.
In your case, you may want something like:
STORE blah INTO 'blah' USING AvroStorage('schema','{your schema}');
where the array field in the schema is
{
"name":"companyIds",
"type":[
"null",
{
"type":"array",
"items":"long"
}
],
"doc":"company ids"
}

Resources