does anyone have examples of orientDB etl transformers with multiple transforms or something that can create class identifiers on the fly, so for example, if you want to create organization entities and the id could be a hash from the organization name , essentially if the json we are importing is not exactly the schema we want in the destination
What about using block code in your ETL configuration file? You can use it in the begin phase, so you could transform id column in your .csv input file. It is not an ideal solution I agree.
see the Block documentation
Related
I have to do data transfomration using Apache NiFi standard processor for below mentioned input data. I have to add two new fields class and year and drop extra price fields.
Below are my input data and transformed data.
Input data
Expected output
Disclaimer: I am assuming that your input headers are not dynamic, which means that you can maintain a predictable input schema. If that is true, you can do this with the standard processors as of 1.12.0, but it will require a little work.
Here's a blog post of mine about how to use ScriptedTransformRecord to take input from one schema, build a new data structure and mix it with another schema. It's a bit involved.
I've used that methodology recently to convert a much larger set of data into summary records, so I know it works. The summary of what's involved is this:
Create two schemas, one that matches input and one for output.
Set up ScriptedTransformRecord to use a writer that explicitly sets which schema to use since ScriptedTransformRecord doesn't support the ability to change the schema configuration internally.
Create a fat jar with Maven or Gradle that compiles your Avro schema into an object that can be used with the NiFi API to expose a static RecordSchema (NiFi API) to your script.
Write a Groovy script that generates a new MapRecord.
We have fixed length format file in S3. We want to create Athena Table after converting it to Parque. We have around 50-60 different such files
Currently I could think of two Approach:
Put fixed length parsing logic in Athena Table creation script.
Create Glue job which will parse it and create Parque files then
create Athena table on that
Approach-1:
Though, it may have minimal code, but this will be in create table script. We are using Teraform to create
Infra, so parsing logic(Regex or Grok pattern) would be part of infra, I am skeptical to put logic in infra code.
Approach-2:
This will be Glue job written using Spark, it will be flexible to parse fixed length file, we could write reusable code for fixed length format to use for all different files. Logic to parse would be with developers. Athena will have external table on Glue job's output location.Infra code would contain only create statement.
Could you please provide your views.
My recommendation would be to go with Approach #2. Using spark's readfile method, you can read most of fix length format files and convert them to parquet. You may do validations or quick transformation before saving in parquet if needed.
I have a lot of parquet files. I need to read them through Amazon Glue and then provide column names to the table that is being read.
The problem is parquet already have column names which is being read by the crawler and show it in the table. Is it possible to provide my column names to these parquet files in glue
To replace the detected column names with names of your own, you could either:
Use one of the following build in transformations on DynamicFrame
ApplyMapping - Applies a declarative mapping to this DynamicFrame and returns a new DynamicFrame with those mappings applied. (source column, source type, target column, target type)
RenameField - Renames a field in this DynamicFrame and returns a new DynamicFrame with the field renamed. (oldName -> newName)
See the Scala or Python ETL programming guides for more detail.
Or try updating the data catalog field names manually if you don't need to continuously re-crawl the data (or if you do, it is possible to prevent a glue crawler from updating existing data catalog tables via the crawler configuration).
Alternatively, if your requirements are more discrete, the map transform is available to convert each DynamicRecord in the DynamicFrame to a new DynamicRecord of your choosing.
I want to do something like Read only certain columns from xls in Jaspersoft ETL Express 5.4.1, but I don't know the schema of the files. However, from what I read here, it looks like I can only do this with the Enterprise Version's Dynamic Schema thing.
Is there no other way?
you can do it using tMap component. design job like below.
tFileInputExcel--main--TMap---main--youroutput
create metadata for your input file that is excel
then used this metadata in your input component
in Tmap select only required columns in output.
See the image of tMap wherein i am selecting only two columns from input flow.
Enterprise version has many features and dynamic schema is the most important one. But as far as your concern that is not required. it is required if you have variable of schema wherein you don`t know how many columns you will received in your feed.
I've been doing some investigation lately around using Hadoop, Hive, and Pig to do some data transformation. As part of that I've noticed that the schema of data files doesn't seem to attached to files at all. The data files are just flat files (unless using something like a SequenceFile). Each application that wants to work with those files has its own way of representing the schema of those files.
For example, I load a file into the HDFS and want to transform it with Pig. In order to work effectively with it I need to specify the schema of the file when I load the data:
EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int};
Now, I know that when storing a file using PigStorage, the schema can optionally be written out along side it, but in order to get a file into Pig in the first place it seems like you need to specify a schema.
If I want to work with the same file in Hive, I need to create a table and specify the schema with that too:
CREATE EXTERNAL TABLE EMP ( first_name string
, last_name string
, empno int)
LOCATION 'myfile';
It seems to me like this is extremely fragile. If the file format changes even slightly then the schema must be manually updated in each application. I'm sure I'm being naive but wouldn't it make sense to store the schema with the data file? That way the data is portable between applications and the barrier to using another tool would be lower since you wouldn't need to re-code the schema for each application.
So the question is: Is there a way to specify the schema of a data file in Hadoop/HDFS or do I need to specify the schema for the data file in each application?
It looks like you are looking for Apache Avro. With Avro your schema is embedded in your data, so you can read it without having to worry about schema issues and it makes schema evolution really easy.
The great thing about Avro is that it is completely integrated in Hadoop and you can use it with a lot of Hadoop sub-projects like Pig and Hive.
For example with Pig you could do:
EMP = LOAD 'myfile.avro' using AvroStorage();
I would advise looking at the documentation for AvroStorage for more details.
You can also work with Avro with Hive as described here but I have not used that personally but it should work the same way.
What you need is HCatalog which is
"Apache HCatalog is a table and storage management service for data
created using Apache Hadoop.
This includes:
Providing a shared schema and data type mechanism.
Providing a table abstraction so that users need not be concerned with where or how
their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive."
You can take a look at the "data flow example" in the docs to see exactly the scenario you are talking about
Apache Zebra seems to be the tool that could provide a common schema definition across mr, pig and hive. It has its own schema store. MR job can use its built in TableStore to write to HDFS.