Streamsets Data Collector: Replace a Field With Its Child Value - etl

I have a data structure like this
{
"id": 926267,
"updated_sequence": 2304899,
"published_at": {
"unix": 1589574240,
"text": "2020-05-15 21:24:00 +0100",
"iso_8601": "2020-05-15T20:24:00Z"
},
"updated_at": {
"unix": 1589574438,
"text": "2020-05-15 21:27:18 +0100",
"iso_8601": "2020-05-15T20:27:18Z"
},
}
I want to replace the updated_at field with its unix field value using Streamsets Data Collector. As far as I know, it can be done using field replacer. But I'm still didn't get it how to make a mapping expression. How can I achieve that?

In Field Replacer, set Fields to /rec/updated_at and New value to ${record:value('/rec/updated_at/unix')} and it will replace the value. See below.
Cheers,
Dash

Related

Splitting a json array format with same fields name

Currently, I have this kind of JSON array with the same field, what I wanted is to split this data into an independent field and the field name is based on a "name" field
events.parameters (this is the field name of the JSON array)
{
"name": "USER_EMAIL",
"value": "dummy#yahoo.com"
},
{
"name": "DEVICE_ID",
"value": "Wdk39Iw-akOsiwkaALw"
},
{
"name": "SERIAL_NUMBER",
"value": "9KJUIHG"
}
expected output:
events.parameters.USER_EMAIL : dummy#yahoo.com
events.parameters.DEVICE_ID: Wdk39Iw-akOsiwkaALw
events.parameters.SERIAL_NUMBER : 9KJUIHG
Thanks.
Tldr;
There is no filter that does exactly what you are looking for.
You will have to use the ruby filter
I just fixed the problem, for everyone wondering here's my ruby script
if [events][parameters] {
ruby {
code => '
event.get("[events][parameters]").each { |a|
name = a["name"]
value = a["value"]
event.set("[events][parameters_split][#{name}]", value)
}
'
}
}
the output was just like what I wanted.
Cheers!

How do I use FreeFormTextRecordSetWriter

I my Nifi controller I want to configure the FreeFormTextRecordSetWriter, but I have no Idea what I should put in the "Text" field. I'm getting the text from my source (in my case GetSolr), and just want to write this, period.
Documentation and mailinglist do not seem to tell me how this is done, any help appreciated.
EDIT: Here the sample input + output I want to achieve (as you can see: not ransformation needed, plain text, no JSON input)
EDIT: I now realize, that I can't tell GetSolr to return just CSV data - but I have to use Json
So referencing with attribute seems to be fine. What the documentation omits is, that the ${flowFile} attribute should containt the complete flowfile that is returned.
Sample input:
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"_": "1553686715465"
}
},
"response": {
"numFound": 3194,
"start": 0,
"docs": [
{
"id": "{402EBE69-0000-CD1D-8FFF-D07756271B4E}",
"MimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"FileName": "Test.docx",
"DateLastModified": "2019-03-27T08:05:00.103Z",
"_version_": 1629145864291221504,
"LAST_UPDATE": "2019-03-27T08:16:08.451Z"
}
]
}
}
Wanted output
{402EBE69-0000-CD1D-8FFF-D07756271B4E}
BTW: The documentation says this:
The text to use when writing the results. This property will evaluate the Expression Language using any of the fields available in a Record.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
I want to use my source's text, so I'm confused
You need to use expression language as if the record's fields are the FlowFile's attributes.
Example:
Input:
{
"t1": "test",
"t2": "ttt",
"hello": true,
"testN": 1
}
Text property in FreeFormTextRecordSetWriter:
${t1} k!${t2} ${hello}:boolean
${testN}Num
Output(using ConvertRecord):
test k!ttt true:boolean
1Num
EDIT:
Seems like what you needed was reading from Solr and write a single column csv. You need to use CSVRecordSetWriter. As for the same,
I should tell you to consider to upgrade to 1.9.1. Starting from 1.9.0, the schema can be inferred for you.
otherwise, you can set Schema Access Strategy as Use 'Schema Text' Property
then, use the following schema in Schema Text
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
this should work
I'll edit it into my answer. If it works for you, please choose my answer :)

Elasticsearch - ignore fields from indexing document

I have simple question. I'm indexing JSON files that contains several fields into Elasticsearch. How to say ElasticSearch to ignore (don't index and store them at all) group of defined fields from the incoming file OR to work only with defined group of fields? Is this possible by using the mapping?
For example: I have JSON files like this:
{
"id": 123456789,
"name": "Name value",
"screenName": "Nick name",
"location": "Location value",
"description": "Description text",
"url": "url value",
"anotherField": 456789,
"status": null,
"anotherField2": "9AE4E8",
"color": "333333",
.......
}
And now I want to Elasticsearch works only with fields (for example) "id;name;description;location;status;url" and ingnore other fields.
Any help? Thanks.
When you serialize the data into a json using a DTO(POJO), you can mark the fields which you dont want to index using #Jsonignore annotation if you are using Jackson serializer/desrializer.
eg: #JsonIgnore
private Date creationDate;
package com.fasterxml.jackson.annotation;

Which is the better design for this API response

I'm trying to decide upon the best format of response for my API. I need to return a reports response which provides information on the report itself and the fields contained on it. Fields can be of differing types, so there can be: SelectList; TextArea; Location etc..
They each use different properties, so "SelectList" might use "Value" to store its string value and "Location" might use "ChildItems" to hold "Longitude" "Latitude" etc.
Here's what I mean:
"ReportList": [
{
"Fields": [
{
"Id": {},
"Label": "",
"Value": "",
"FieldType": "",
"FieldBankFieldId": {},
"ChildItems": [
{
"Item": "",
"Value": ""
}
]
}
]
}
The problem with this is I'm expecting the users to know when a value is supposed to be null. So I'm expecting a person looking to extract the value from "Location" to extract it from "ChildItems" and not "Value". The benefit to this however, is it's much easier to query for things than the alternative which is the following:
"ReportList": [
{
"Fields": [
{
"SelectList": [
{
"Id": {},
"Label": "",
"Value": "",
}
]
"Location": [
{
"Id": {},
"Label": "",
"Latitude": "",
"Longitude": "",
"etc": "",
}
]
}
]
}
So this one is a reports list that contains a list of fields which on it contains a list of fieldtype for every fieldtype I have (15 or something like that). This is opposed to just having a list of reports which has a list of fields with a "fieldtype" enum which I think is fairly easy to manipulate.
So the Question: Which format is best for a response? Any alternatives and comments appreciated.
EDIT:
To query all fields by fieldtype in a report and get values with the first way it would go something like this:
foreach(field in fields)
{
switch(field.fieldType){
case FieldType.Location :
var locationValue = field.childitems;
break;
case FieldType.SelectList:
var valueselectlist = field.Value;
break;
}
The second one would be like:
foreach(field in fields)
{
foreach(location in field.Locations)
{
var latitude = location.Latitude;
}
foreach(selectList in field.SelectLists)
{
var value= selectList.Value;
}
}
I think the right answer is the first one. With the switch statement. It makes it easier to query on for things like: Get me the value of the field with the id of this guid. It just means putting it through a big switch statement.
I went with the first one because It's easier to query for the most common use case. I'll expect the client code to put it into their own schema if they want to change it.

How to remove a key from a RethinkDB document?

I'm trying to remove a key from a RethinkDB document.
My approaches (which didn't work):
r.db('db').table('user').replace(function(row){delete row["key"]; return row})
Other approach:
r.db('db').table('user').update({key: null})
This one just sets row.key = null (which looks reasonable).
Examples tested on rethinkdb data explorer through web UI.
Here's the relevant example from the documentation on RethinkDB's website: http://rethinkdb.com/docs/cookbook/python/#removing-a-field-from-a-document
To remove a field from all documents in a table, you need to use replace to update the document to not include the desired field (using without):
r.db('db').table('user').replace(r.row.without('key'))
To remove the field from one specific document in the table:
r.db('db').table('user').get('id').replace(r.row.without('key'))
You can change the selection of documents to update by using any of the selectors in the API (http://rethinkdb.com/api/), e.g. db, table, get, get_all, between, filter.
You can use replace with without:
r.db('db').table('user').replace(r.row.without('key'))
You do not need to use replace to update the entire document.
Here is the relevant documentation: ReQL command: literal
Assume your user document looks like this:
{
"id": 1,
"name": "Alice",
"data": {
"age": 19,
"city": "Dallas",
"job": "Engineer"
}
}
And you want to remove age from the data property. Normally, update will just merge your new data with the old data. r.literal can be used to treat the data object as a single unit.
r.table('users').get(1).update({ data: r.literal({ age: 19, job: 'Engineer' }) }).run(conn, callback)
// Result passed to callback
{
"id": 1,
"name": "Alice",
"data": {
"age": 19,
"job": "Engineer"
}
}
or
r.table('users').get(1).update({ data: { city: r.literal() } }).run(conn, callback)
// Result passed to callback
{
"id": 1,
"name": "Alice",
"data": {
"age": 19,
"job": "Engineer"
}
}

Resources