Changing bags into arrays in Pig Latin - hadoop

I'm doing some transformations on some data set and need to publish to a sane looking format. Current my final set looks like this when I run describe:
{memberId: long,companyIds: {(subsidiary: long)}}
I need it to look like this:
{memberId: long,companyIds: [long] }
where companyIds is the key to an array of ids of type long?
I'm really struggling with how to manipulate things in this way? Any ideas? I've tried using FLATTEN and other commands to know avail. I'm using AvroStorage to write the files into this schema:
The field schema I need to write this data to looks like this:
"fields": [
{ "name": "memberId", "type": "long"},
{ "name": "companyIds", "type": {"type": "array", "items": "int"}}
]

There is no array type in PIG (http://pig.apache.org/docs/r0.10.0/basic.html#data-types). However, if all you need is a good looking output and if you don't have too many elements in companyIds, you may want to write a simple UDF that converts the bag into a nice formatted string.
Java code
public class BagToString extends EvalFunc<String>
{
#Override
public String exec(Tuple input) throws IOException
{
List<String> strings = new ArrayList<String>();
DataBag bag = (DataBag) input.get(0);
if (bag.size() == 0) {
return null;
}
for (Iterator<Tuple> it = bag.iterator(); it.hasNext();) {
Tuple t = it.next();
strings.add(t.get(0).toString());
}
return StringUtils.join(strings, ":");
}
}
PIG script
foo = foreach bar generate memberId, BagToString(companyIds);

I know this is a bit old, but I recently ran into the same problem.
Based on the avrostorage documentation, using the latest version of pig and avrostorage, it is possible to directly cast bag to avro array.
In your case, you may want something like:
STORE blah INTO 'blah' USING AvroStorage('schema','{your schema}');
where the array field in the schema is
{
"name":"companyIds",
"type":[
"null",
{
"type":"array",
"items":"long"
}
],
"doc":"company ids"
}

Related

Splitting a json array format with same fields name

Currently, I have this kind of JSON array with the same field, what I wanted is to split this data into an independent field and the field name is based on a "name" field
events.parameters (this is the field name of the JSON array)
{
"name": "USER_EMAIL",
"value": "dummy#yahoo.com"
},
{
"name": "DEVICE_ID",
"value": "Wdk39Iw-akOsiwkaALw"
},
{
"name": "SERIAL_NUMBER",
"value": "9KJUIHG"
}
expected output:
events.parameters.USER_EMAIL : dummy#yahoo.com
events.parameters.DEVICE_ID: Wdk39Iw-akOsiwkaALw
events.parameters.SERIAL_NUMBER : 9KJUIHG
Thanks.
Tldr;
There is no filter that does exactly what you are looking for.
You will have to use the ruby filter
I just fixed the problem, for everyone wondering here's my ruby script
if [events][parameters] {
ruby {
code => '
event.get("[events][parameters]").each { |a|
name = a["name"]
value = a["value"]
event.set("[events][parameters_split][#{name}]", value)
}
'
}
}
the output was just like what I wanted.
Cheers!

Jackson derealization with SpringBoot : To get field names present in request along with respective field mapping

I have a requirement to throw different error in case of different scenarios like below, and there are many such fields not just 1.
e.g.
{
"id": 1,
"name": "nameWithSpecialChar$"
}
Here it should throw error for special character.
{
"id": 1,
"name": null
}
Here throw field null error.
{
"id": 1
}
Here throw field missing error.
Handling, 1st and 2nd scenario is easy, but for 3rd one, is there any way we can have a List of name of fields that were passed in input json at the time of serialization itself with Jackson?
One way, I am able to do it is via mapping request to JsonNode and then check if nodes are present for required fields and after that deserialize that JsonNode manually and then validate rest of the members as below.
public ResponseEntity myGetRequest(#RequestBody JsonNode requestJsonNode) {
if(!requestJsonNode.has("name")){
throw some error;
}
MyRequest request = ObjectMapper.convertValue(requestJsonNode, MyRequest .class);
validateIfFieldsAreInvalid(request);
But I do not like this approach, is there any other way of doing it?
You can define a JSON schema and validate your object against it. In your example, your schema may look like this:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"id": {
"description": "The identifier",
"type": "integer"
},
"name": {
"description": "The item name",
"type": "string",
"pattern": "^[a-zA-Z]*$"
}
},
"required": [ "id", "name" ]
}
To validate your object, you could use the json-schema-validator library. This library is built on Jackson. Since you're using Spring Boot anyway, you already have Jackson imported.
The example code looks more or less like this:
String schema = "<define your schema here>";
String data = "<put your data here>";
JsonSchemaFactory factory = JsonSchemaFactory.byDefault();
ObjectMapper m = new ObjectMapper();
JsonSchema jsonSchema = factory.getJsonSchema(m.readTree(schema));
JsonNode json = m.readTree(data);
ProcessingReport report = jsonSchema.validate(json);
System.out.println(report);
The report includes detailed errors for different input cases. For example, with this input
{
"id": 1,
"name": "nameWithSpecialChar$"
}
this output is printed out
--- BEGIN MESSAGES ---
error: ECMA 262 regex "^[a-zA-Z]*$" does not match input string "nameWithSpecialChar$"
level: "error"
schema: {"loadingURI":"#","pointer":"/properties/name"}
instance: {"pointer":"/name"}
domain: "validation"
keyword: "pattern"
regex: "^[a-zA-Z]*$"
string: "nameWithSpecialChar$"
--- END MESSAGES ---
Or instead of just printing out the report, you can loop through all errors and have your specific logic
for (ProcessingMessage message : report) {
// Add your logic here
}
You could check the example code to gain more information about how to use the library.

How do I update MongoDB query results using inner query?

BACKGROUND
I have a collection of json documents that represent chemical compounds. A compound will have an id and a name. An external process generates new compound documents at intervals, and ids may change across iterative generations. Compound documents whose compound ids have changed need to be updated to point to the most recent iterations ids, and as such, a "lastUpdated" field and "relatedCompoundIds" field will be added. To demonstrate, consider the following compounds across 3 steps:
Step 1: initial compound document for 'acetone' is generated with id="001".
{
"id": "001",
"name": "acetone",
"lastUpdated": "2000-01-01",
}
Step 2: another iteration generates acetone, but with a different id.
{
"id": "001",
"name": "acetone",
"lastUpdated": "2000-01-01"
}
{
"id": "002",
"name": "acetone",
"lastUpdated": "2000-01-02"
}
Step 3: compound with id of "001" will append a "relatedCompoundIds" array pointing to any other compounds with the same name.
{
"id": "001",
"name": "acetone",
"lastUpdated": "2000-01-02",
"relatedCompoundIds": ["002"]
}
{
"id": "002",
"name": "acetone",
"lastUpdated": "2000-01-02"
}
I'm using MongoDB to house these records, and to resolve relatedCompoundId "pointers". I'm accessing Mongo using Spring ReactiveMongoTemplate. My process is as follows:
Upsert newly generated compounds into MongoDB.
For each record where "lastUpdated" is before now:
Get all related compounds (searching by name), and set "relatedCompoundIds".
CODE
public class App {
public static void main(String[] args) {
public static ReactiveMongoTemplate mongoOps = new ReactiveMongoTemplate(MongoClients.create(),
"CompoundStore");
Date updatedDate = new Date();
upsertAll(updatedDate, readPath);
setRelatedCompounds(updatedDate);
}
private static void upsertAll(Date updatedDate, String readPath) {
// [upsertion code here] <- this is working fine
}
private static void setRelatedCompounds(Date updatedDate) {
mongoOps.find(//
Query.query(Criteria.where("lastUpdated").lt(updatedDate)), Compound.class, "compound")//
.doOnNext(compound -> {
findRelatedCompounds(updatedDate, compound)//
.doOnSuccess(rc -> {
if (rc.size() > 0) {
compound.setRelatedCompoundIDs(rc);
mongoOps.save(Mono.just(compound)).subscribe();
}
})//
.subscribe();
}).blockLast();
}
private static Mono<List<String>> findRelatedCompounds(Date updatedDate, Compound compound) {
Query query = new Query().addCriteria(new Criteria().andOperator(//
Criteria.where("lastUpdated").gte(updatedDate), //
Criteria.where("name").is(compound.getName)));
query.fields().include("id");
return mongoOps.find(query, Compound.class)//
.map(c -> c.getId())//
.filter(cid -> !StringUtils.isEmpty(cid))//
.distinct().collectSortedList();
}
}
ERROR
Upon running, I get the following error:
17:08:35.957 [Thread-12] ERROR org.mongodb.driver.client - Callback onResult call produced an error
com.mongodb.MongoException: org.springframework.data.mongodb.UncategorizedMongoDbException: Too many operations are already waiting for a connection. Max number of operations (maxWaitQueueSize) of 500 has been exceeded.; nested exception is com.mongodb.MongoWaitQueueFullException: Too many operations are already waiting for a connection. Max number of operations (maxWaitQueueSize) of 500 has been exceeded.
at com.mongodb.MongoException.fromThrowableNonNull(MongoException.java:79)
Is there a better way to accomplish what I'm trying to do?
How do I adjust backpressure so as not to overload the mongo?
Other advice?
EDIT
The above error can be resolved by adding a limitRate modifier after the find method inside setRelatedCompounds.
private static void setRelatedCompounds(Date updatedDate) {
mongoOps.find(//
Query.query(Criteria.where("lastUpdated").lt(updatedDate)), Compound.class, "compound")//
.limitRate(500)//
.doOnNext(compound -> {
// do work here
}).subscribe();
}).blockLast();
}
Still open to suggestions for alternative solutions.

GraphQL: Explore API without a wildcard (*)?

I am new to GraphQL and I wonder how I can explore an API without a possible wildcard (*) (https://github.com/graphql/graphql-spec/issues/127).
I am currently setting up a headless Craft CMS with GraphQL and I don't really know how my data is nested.
Event with the REST API I have no chance of just getting all the data, because I have to setup all the endpoints and therefore I have to know all field names as well.
So how could I easily explore my CraftCMS data structure?
Thanks for any hints on this.
Cheers
merc
------ Edit -------
If I use #simonpedro s suggestion:
{
__schema {
types {
name
kind
fields {
name
}
}
}
}
I can see a lot of types (?)/fields (?)...
For example I see:
{
"name": "FlexibleContentTeaser",
"kind": "OBJECT",
"fields": [
{
"name": "id"
},
{
"name": "enabled"
},
{
"name": "teaserTitle"
},
{
"name": "text"
},
{
"name": "teaserLink"
},
{
"name": "teaserLinkConnection"
}
]
But now I would like to know how a teaserLink ist structured.
I somehow found out that the teaserLink (it is a field with the type Entries, where I can link to another page) has the properties url & title.
But how would I set up query to explore the properties available within teaserLink?
I tried all sorts of queries, but I am always confrontend with messages like this:
I would be really glad if somebody could give me another pointer how I can find out which properties I can actually query...
Thank you
As far as I'm concerned currently there is no graphql implementation with that capability. However, if what you want to do is to explore the "data structure", i.e, the schema, you should use schema instrospection, which was thought for that (explore the graphql schema). For example, a simple graphql instrospection query would be like this:
{
__schema {
types {
name
kind
fields {
name
}
}
}
}
References:
- https://graphql.org/learn/introspection/
UPDATE for edit:
What you want to do I think is the following:
Make a query like this
{
__schema {
types {
name
kind
fields {
name
type {
fields {
name
}
}
}
}
}
}
And then find the wished type field to grab more information (the fields) from it. Something like this (I don't know if this works, just an idea):
const typeFlexibleContentTeaser = data.__schema.types.find(t => t === "FlexibleContentTeaser")
const teaserLinkField = typeFlexibleContentTeaser.fields.find(f => f.name === "teaserLink")
const teaserLinkField = teaserLinkField.type.fields;
i.e, you have to transverse recursively through the type field.

How can I using Pig scripts to generate nested Avro field?

I am new to Pig, My input data is in the format as
Record1:
{
label:int,
id: long
},
Record 2:
{
...
}
...
And what I want as output is to get
Record 1:
{
data:{
label:int,
id:long
}
},
Record 2:
{
...
}
...
I tried:
result = FOREACH input GENERATE (id, label) AS data;
but this results in a nested tuple structure that looks as below:
Record 1:
{
data:{
TUPLE_1:{
label:int,
id: long
}
}
}
How could I get rid of the one more bag as "TUPLE_1", that looks like I missed a trivial setting.
You probably need to specify schema when you STORE the data.
If you use org.apache.pig.piggybank.storage.avro.AvroStorage, it can take schema definition as parameter.
result = FOREACH input GENERATE label:int, id:long;
STORE result INTO 'result.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage('schema', '{"type": "record","name": "data","fields": [{"name": "label","type": "int"},{"name": "id", "type": "long"}]}');
My final solution is like this:
First of all to create an avro file with a certain schema, I make sure I do the following schema configuration for AvroStorage:
STORE sth INTO 'someplace' USING AvroStorage('schema','
{
### AN AVRO SCHEMA JSON STRING ###
}
');
I found using such indentation really helps to make the schema definition cleaner. And also I need to make sure I escape all special characters, especially for quotes (they may exist in "doc", that's tricky).
Then to make sth has the correct pig structure to be stored, I need to construct the entire data structure properly. One good trick could be using DESCRIBE if there already some examples of the target data files. In my previous question, the code should look like this:
in = LOAD '$INPUT_PATHS' USING AvroStorage();
in = FOREACH in GENERATE foo.label AS label, bar.id AS id;
out = FOREACH in GENERATE TOMAP('id', (long)id, 'label', (chararray)label) AS data;
RMF $OUTPUT_PATH;
STORE out INTO '$OUTPUT_PATH USING AvroStorage('schema',
{
"type": "records",
"name": "XXItem",
"namespace": "com.xxx.xxx",
"fields": [
{
"name": "data",
"type": {"type": "map", "values" : ["string", "long", "int"]}
}
]
}
');

Resources