diff between ResourceSchema & Schema in pig - hadoop

What's the difference between ResourceSchema & Schema in pig?
There is already Schema class provided, why does pig bother to add another Schema-akin class called ResourceSchema(it is almost like Schema API , it needs to set its ResourceFieldSchema's name and type , it also can has child ResourceSchema) for storage functions?

The API Docs backup #zsxwing's comment:
Schema - The Schema class encapsulates the notion of a schema for a relational operator. A schema is a list of columns that describe the output of a relational operator.
Each column in the relation is represented as a FieldSchema, a static class inside the Schema. A column by definition has an alias, a type and a possible schema (if the column is a bag or a tuple).
In addition, each column in the schema has a unique auto generated name used for tracking the lineage of the column in a sequence of statements. The lineage of the column is tracked using a map of the predecessors' columns to the operators that generate the predecessor columns.
The predecessor columns are the columns required in order to generate the column under consideration. Similarly, a reverse lookup of operators that generate the predecessor column to the predecessor column is maintained.
ResourceSchema - A represenation of a schema used to communicate with load and store functions. This is separate from Schema, which is an internal Pig representation of a schema.
So one of the main differences i can see from the API docs is that a Schema is able to track the input columns required to build it, where as ResourceSchema is just the schema definition of the field name, type (and optional sub-schema)

Related

Inject a list into SpringData to be used as a kind of virtual database table

I have a database table on which a sorted query needs to be done.
To do the sorting a join on another table is requiered. The problem is that this other table does not exist in the database because we read the required data on the services startup from a CSV file and keep it as an in-memory list.
Is it possible to somehow inject this list as a kind of virtual database into Spring Data? So that it could use this list to make the required join and sorting.
As far as I know, the only other options I have would be to create a real database table from this in-memory list or load the whole table and do the sorting in the service itself.
You can add a special order by expression through e.g. Spring Data Specification, but that is going to be very ugly. In HQL it looks like this:
case rootAlias.attribute when 'value1' then 1 when 'value2' then 2 ... else null end
which will return some integer value by which you can sort ascending or descending, based on the mapping you have.
Even if you have lots of values, I would rather recommend you don't do a join at all, and instead try to make this attribute of your main table sortable, so that you don't need this mapping. You can maybe create a trigger that maintains a column based on the mapping, which can be used for sorting directly. If you do all your changes through JPA/Hibernate, you could also use a #PreUpdate/#PrePersist listener to handle the maintenance of this column.

Migrating table with PutDatabaseRecord with different column name at the target table

I need to migrate the data from a db2 table to a mssql table but one column has a different name, but the same datatype.
Db2 table:
NROCTA,NUMRUT,DIASMORA2
MSSQL table:
NROCTA,NUMRUT,DIAMORAS
As you see DIAMORAS is different.
Im using the following flow:
ExecuteSQL -> SplitAvro -> PutDatabaseRecord
In PutDataBaseRecord I have as RecordReader an AvroReader configured in this way:
Schema Acesss Strategy: Use Embedded Avro Schema.
Schema Text: ${avro.schema}
The flow just insert the two first columns.¿How I can do the mapping between DIASMORA2 and DIAMORAS columns ?
Thanks in advance!
First thing, you probably don't need SplitAvro in your flow at all, unless there's some logical subset of rows that you are trying to send as individual transactions.
For the column name change, use UpdateRecord and set the field /DIASMORAS to the record path /DIASMORA2, and change the name of the field in the AvroRecordSetWriter's schema from DIASMORA2 to DIASMORAS.
That last part is a little trickier since you are using the embedded schema in your AvroReader. If the schema will always be the same, you can stop the UpdateRecord processor and put in an ExtractAvroMetadata processor to extract the avro.schema attribute. That will put the embedded schema in the flowfile's avro.schema attribute.
Then before you start UpdateRecord, start the ExecuteSQL and ExtractAvroMetadata processors, then inspect a flow file in the queue to copy the schema out of the avro.schema attribute. Then in your AvroRecordSetWriter in ConvertRecord, instead of Inheriting the schema, you can choose to Use Schema Text, then paste in the schema from the attribute, changing DIASMORA2 to DIASMORAS. This approach puts values from the DIASMORA2 field into the DIASMORAS field, but since DIASMORA2 is not in the output schema, it is ignored, thereby effectively renaming the field (although under the hood it is a copy-and-remove).

Data Transformation for Large data in a file

I am new to ensemble and have a clarification regarding the Data Transformations.
I have 2 schemas as follows,
PatientID,
Patient Name,
Patient Address (combination of door number, Street, District, State)
and another schema as,
PatientID,
Patient Name,
Door Number
Street
District
State
Now there is an incoming text file with 1000's of records as per the first schema ('|' separated) as below,
1001|John|220,W Maude Ave,Suisun City, CA
like this there a 1000's of recrods in the input file
My requirement is to convert this as per the second schema (i.e to separate the Address) and store in the file like,
1001|John|220|W Maude Ave|Suisun City|CA
One solution I implemented was to loop through each line in the file and replace the , in the address with '|'.
My question is, whether we can do it through DTL. If the answer is yes how do we loop through 1000s of records using DTL.
Whether DTL will be time consuming? because we need to load the schema and then do the transformations.
Please help.
You can use DTL with any class that inherit from Ens.VirtualDocument or %XML.Adaptor, virtually Ensemble use class dictionary to represent the schema so for basic classes there is not problem is you extends %XML.Adaptor Ensemble can represent it. In case of virtual documents the object has to be set the DocType.
In order to do the loop there is a in DTL
Yes, DTLs can parse 1000's of records. You can do the following:
1) Create a record map to parse the incoming file that has schema 1
2) Define an intermediate object that maps schema 2 fields to object properties
3) Create a DTL whose source object is the record map object from 1 above and target is object from 2 above.

Transform normalized Table (1NF) to a nested table in Oracle 11g

Is there any implemented function/procedure in Oracle 11g to simply transform a standard normalized table into a nested table? Or does someone has an idea for the following problem?
Reason: All of our tables are normalized in a way that if a certain attribute value is set twice or more for one record the whole record (except this attribute value)has to be stored redundant. Meaning, if I have in my person table an attribute "ID_CARD_NUMBER" and one person has two nationalities (and two ID cards) there is a second record with redundant attribute values except "ID_CARD_NUMBER". Outsourcing such an attribute in an extra table is NOT an option and I also don't want to denormalize in a way that one attribute value consists of a concatenation of prior, separated attribute values.

Altova Mapforce: Joining XML Input and conditional SQL Join using two tables

I'm trying to get the following done: Using Altova Mapforce, I use an XML file with schema as a source. I want to map it to exactly the same output, but only add data to one field.
The value of the field (it's Tax) is determined using a two table SQL join with a WHERE clause over both tables. The tables are joined using foreign keys, the relation is recognized by Mapforce.
The first field of the WHERE clause comes from the first table (header type table), the second and third field from the second tables (lines type tables).
However, I cannot seem to create the logical and correct equivalent of what I am describing here. I've tried it using complex AND constructions where it then inserts the one field I would need multiple times. I've tried WHERE clauses but they fail as they never supply both tables at the same time and there seems to be no way to use a pre-specified JOINing of two tables as a source. The WHERE clause then recognizes only the fields from the first table, not the second one.
Is there an example for this? Joining two (or more) tables, using WHERE to determine the exact row, then using a value from that row?
Best wishes.

Resources