How can I using Pig scripts to generate nested Avro field? - hadoop

I am new to Pig, My input data is in the format as
Record1:
{
label:int,
id: long
},
Record 2:
{
...
}
...
And what I want as output is to get
Record 1:
{
data:{
label:int,
id:long
}
},
Record 2:
{
...
}
...
I tried:
result = FOREACH input GENERATE (id, label) AS data;
but this results in a nested tuple structure that looks as below:
Record 1:
{
data:{
TUPLE_1:{
label:int,
id: long
}
}
}
How could I get rid of the one more bag as "TUPLE_1", that looks like I missed a trivial setting.

You probably need to specify schema when you STORE the data.
If you use org.apache.pig.piggybank.storage.avro.AvroStorage, it can take schema definition as parameter.
result = FOREACH input GENERATE label:int, id:long;
STORE result INTO 'result.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage('schema', '{"type": "record","name": "data","fields": [{"name": "label","type": "int"},{"name": "id", "type": "long"}]}');

My final solution is like this:
First of all to create an avro file with a certain schema, I make sure I do the following schema configuration for AvroStorage:
STORE sth INTO 'someplace' USING AvroStorage('schema','
{
### AN AVRO SCHEMA JSON STRING ###
}
');
I found using such indentation really helps to make the schema definition cleaner. And also I need to make sure I escape all special characters, especially for quotes (they may exist in "doc", that's tricky).
Then to make sth has the correct pig structure to be stored, I need to construct the entire data structure properly. One good trick could be using DESCRIBE if there already some examples of the target data files. In my previous question, the code should look like this:
in = LOAD '$INPUT_PATHS' USING AvroStorage();
in = FOREACH in GENERATE foo.label AS label, bar.id AS id;
out = FOREACH in GENERATE TOMAP('id', (long)id, 'label', (chararray)label) AS data;
RMF $OUTPUT_PATH;
STORE out INTO '$OUTPUT_PATH USING AvroStorage('schema',
{
"type": "records",
"name": "XXItem",
"namespace": "com.xxx.xxx",
"fields": [
{
"name": "data",
"type": {"type": "map", "values" : ["string", "long", "int"]}
}
]
}
');

Related

fetch attribute type from terraform provider schema

am trying to find out a way to fetch the attribute type of a resource/data_source from a terraform providers schema (am currently using gcp, but will be extending to pretty much all providers).
My current flow of setup
Am running the terraform providers schema -json to fetch the providers schema
This will generate a huge json file with the schema structure of the provider
ref:
How to get that list of all the available Terraform Resource types for Google Cloud?
https://developer.hashicorp.com/terraform/cli/commands/providers/schema
And from this am trying to fetch the type of each attribute eg below
`
"google_cloud_run_service": {
"version": 1,
"block": {
"attributes": {
"autogenerate_revision_name": {
"type": "bool",
"description_kind": "plain",
"optional": true
},
`
4) My end goal is to generate variables.tf from the above schema for all resources and all attributes supported in that resource along with the type constraint
ref: https://developer.hashicorp.com/terraform/language/values/variables
I already got some help on how we can generate that
ref: Get the type of value using cty in hclwrite
Now the challenge is to work on complex structures like below
The below is one of the attributes of "google_cloud_run_service".
`
"status": {
"type": [
"list",
[
"object",
{
"conditions": [
"list",
[
"object",
{
"message": "string",
"reason": "string",
"status": "string",
"type": "string"
}
]
],
"latest_created_revision_name": "string",
"latest_ready_revision_name": "string",
"observed_generation": "number",
"url": "string"
}
]
],
"description": "The current status of the Service.",
"description_kind": "plain",
"computed": true
}
`
7) so based on the above complex structure type, I want to generate the variables.tf file for this kind of attribute using the code sample from point #5, and the desired output should look something like below in variables.tf
`
variable "autogenerate_revision_name" {
type = string
default = ""
description = "Sample description"
}
variable "status" {
type = list(object({
conditions = list(object({
"message" = string
"reason" = string
"status" = string
" type" = string
}))
"latest_created_revision_name" = string
"latest_ready_revision_name" = string
"observed_generation" = number
"url" = string
}))
default = "default values in the above type format"
}
`
The above was manually written - so might not exactly align with the schema, but i hope i made it understood , as to what am trying to achieve.
The first variable in the above code is from the first eg i gave in point #3 which is easy to generate, but the second eg in point #6 is a complex type constraint and am seeking help to get this generated
Is this possible to generate using the helper schema sdk (https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2#v2.24.0/helper/schema) ? along with code eg given in point #5 ?
Summary : Am generating json schema of a terraform provider using terraform providers schema -json, am reading that json file and generating hcl code for each resource, but stuck with generating type constraints for the attributes/variables, hence seeking help on this.
Any sort of help is really appreciated as am stuck at this for quite a while.
If you've come this far, then i thank you for reading such a lengthy question, and any sort of pointers are welcome.

Match keys with sibling object JSONATA

I have an JSON object with the structure below. When looping over key_two I want to create a new object that I will return. The returned object should contain a title with the value from key_one's name where the id of key_one matches the current looped over node from key_two.
Both objects contain other keys that also will be included but the first step I can't figure out is how to grab data from a sibling object while looping and match it to the current value.
{
"key_one": [
{
"name": "some_cool_title",
"id": "value_one",
...
}
],
"key_two": [
{
"node": "value_one",
...
}
],
}
This is a good example of a 'join' operation (in SQL terms). JSONata supports this in a path expression. See https://docs.jsonata.org/path-operators#-context-variable-binding
So in your example, you could write:
key_one#$k1.key_two[node = $k1.id].{
"title": $k1.name
}
You can then add extra fields into the resulting object by referencing items from either of the original objects. E.g.:
key_one#$k1.key_two[node = $k1.id].{
"title": $k1.name,
"other_one": $k1.other_data,
"other_two": other_data
}
See https://try.jsonata.org/--2aRZvSL
I seem to have found a solution for this.
[key_two].$filter($$.key_one, function($v, $k){
$v.id = node
}).{"title": name ? name : id}
Gives:
[
{
"title": "value_one"
},
{
"title": "value_two"
},
{
"title": "value_three"
}
]
Leaving this here if someone have a similar issue in the future.

How do I use FreeFormTextRecordSetWriter

I my Nifi controller I want to configure the FreeFormTextRecordSetWriter, but I have no Idea what I should put in the "Text" field. I'm getting the text from my source (in my case GetSolr), and just want to write this, period.
Documentation and mailinglist do not seem to tell me how this is done, any help appreciated.
EDIT: Here the sample input + output I want to achieve (as you can see: not ransformation needed, plain text, no JSON input)
EDIT: I now realize, that I can't tell GetSolr to return just CSV data - but I have to use Json
So referencing with attribute seems to be fine. What the documentation omits is, that the ${flowFile} attribute should containt the complete flowfile that is returned.
Sample input:
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"_": "1553686715465"
}
},
"response": {
"numFound": 3194,
"start": 0,
"docs": [
{
"id": "{402EBE69-0000-CD1D-8FFF-D07756271B4E}",
"MimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"FileName": "Test.docx",
"DateLastModified": "2019-03-27T08:05:00.103Z",
"_version_": 1629145864291221504,
"LAST_UPDATE": "2019-03-27T08:16:08.451Z"
}
]
}
}
Wanted output
{402EBE69-0000-CD1D-8FFF-D07756271B4E}
BTW: The documentation says this:
The text to use when writing the results. This property will evaluate the Expression Language using any of the fields available in a Record.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
I want to use my source's text, so I'm confused
You need to use expression language as if the record's fields are the FlowFile's attributes.
Example:
Input:
{
"t1": "test",
"t2": "ttt",
"hello": true,
"testN": 1
}
Text property in FreeFormTextRecordSetWriter:
${t1} k!${t2} ${hello}:boolean
${testN}Num
Output(using ConvertRecord):
test k!ttt true:boolean
1Num
EDIT:
Seems like what you needed was reading from Solr and write a single column csv. You need to use CSVRecordSetWriter. As for the same,
I should tell you to consider to upgrade to 1.9.1. Starting from 1.9.0, the schema can be inferred for you.
otherwise, you can set Schema Access Strategy as Use 'Schema Text' Property
then, use the following schema in Schema Text
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
this should work
I'll edit it into my answer. If it works for you, please choose my answer :)

Try to understand normalizr's schema.entity VS Array and Object

All:
I am trying to understand the relationship between Entity Array and Object:
Are they just different format to describe diff structure of data? Or Entity is quite diff from the rest two?
The normalized data result has a structure like {result:,entities:}, are the data structures only defined with schema.Entity put inside entities or so can schema.Array and Object? When I define a schema only use Object and Array, it seems nothing put in entities, I am not sure if it is my schema def fault or this is how normalizr work?
If only schema.Entity() defined data can put into entities, then how can I put an data array into it, something like {0:.., 1:..,2:,}?
For exmaple, I have data like:
var data = [
{
id:"0",
items:[
{
id: "0",
data: {name:"data-0-0"}
},
{
id: "1",
data: {name:"data-0-1"}
}
]
},
{
id:"1",
items:[
{
id: "0",
data: {name:"data-1-0"}
},
{
id: "1",
data: {name:"data-1-1"}
}
]
}
]
const normalizedData = normalize(data, [{items:[{data:{}}]}]);
And the normalized data is like:
{
"entities": {},
"result": {
"0": {
"id": "0",
"items": [
{
"id": "0",
"data": {
"name": "data-1-0"
}
}
]
}
}
}
Thanks
Question: Are they just different format to describe diff structure of data? Or Entity is quite diff from the rest two?
Answer: Yes. An Entity is a singular object that has a unique identifier associated with it. Array and Object are more generic structures that can't be uniquely identified. In your case, it looks like you only need to use Array and Entity for the data you're describing.
Question: Are the data structures only defined with schema? Entity put inside entities?
Answer: Yes.

Changing bags into arrays in Pig Latin

I'm doing some transformations on some data set and need to publish to a sane looking format. Current my final set looks like this when I run describe:
{memberId: long,companyIds: {(subsidiary: long)}}
I need it to look like this:
{memberId: long,companyIds: [long] }
where companyIds is the key to an array of ids of type long?
I'm really struggling with how to manipulate things in this way? Any ideas? I've tried using FLATTEN and other commands to know avail. I'm using AvroStorage to write the files into this schema:
The field schema I need to write this data to looks like this:
"fields": [
{ "name": "memberId", "type": "long"},
{ "name": "companyIds", "type": {"type": "array", "items": "int"}}
]
There is no array type in PIG (http://pig.apache.org/docs/r0.10.0/basic.html#data-types). However, if all you need is a good looking output and if you don't have too many elements in companyIds, you may want to write a simple UDF that converts the bag into a nice formatted string.
Java code
public class BagToString extends EvalFunc<String>
{
#Override
public String exec(Tuple input) throws IOException
{
List<String> strings = new ArrayList<String>();
DataBag bag = (DataBag) input.get(0);
if (bag.size() == 0) {
return null;
}
for (Iterator<Tuple> it = bag.iterator(); it.hasNext();) {
Tuple t = it.next();
strings.add(t.get(0).toString());
}
return StringUtils.join(strings, ":");
}
}
PIG script
foo = foreach bar generate memberId, BagToString(companyIds);
I know this is a bit old, but I recently ran into the same problem.
Based on the avrostorage documentation, using the latest version of pig and avrostorage, it is possible to directly cast bag to avro array.
In your case, you may want something like:
STORE blah INTO 'blah' USING AvroStorage('schema','{your schema}');
where the array field in the schema is
{
"name":"companyIds",
"type":[
"null",
{
"type":"array",
"items":"long"
}
],
"doc":"company ids"
}

Resources