Data Integration: Bring Data to a Standard Format [closed] - etl

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am trying to make a data integration process using ETL tool (Talend).
The challenge I am facing is when I try to bring data from various sources (in different formats) into a single format.
The sources may have different column names and structures (order, datatypes, etc.). So different metadata.
As I see it, it is a very common case. But the tool is not able to handle it as it does not provide any dynamic mapping feature.
What is the best approach to handle such scenario?

Talend does provide a dynamic mapping tool. It's called the tMap or tXmlMap for XML data.
There's also the tHMap (Hierarchical Mapping tool) which is a lot more powerful but I've yet to use it at all because it's very raw in the version of Talend I'm using (5.4) but should be more usable in 5.5.
Your best approach here may be to use a tMap after each of your components to standardise the schema of your data.
First you should pick what the output schema should look like (this could be the same as one of your current schemas or something entirely different if necessary) and then simply copy and paste the schema to the output table of each tMap. Then map the relevant data across.
An example job may look something like this:
Where the schemas and the contained data for each "file" (I'm using tFixedFlowInput components to hardcode the data to the job rather than read in a file but the premise is the same) is as following:
File 1:
File 2:
File 3:
And they are then mapped to match the schema of the first "file":
File 1:
File 2:
File 3:
Notice how the first tMap configuration shows no changes as we are keeping the schema exactly the same.
Now that our inputs all share the same schema we can use a tUnite component to union (much like SQL's UNION operator) the data.
After this we then also take one final step and use a tReplace component so that we can easily standardise the "sex" field to M or F:
And then lastly I output that to the console but this could be output to any available output component.
For a truly dynamic option without having to predefine the mapping you would need to read in all of your data with a dynamic schema. You could then parse the structure out into a defined output.
In this case you could read the data from your files in as a dynamic schema (single column) and then drop it straight into a temporary database table. Talend will automatically create the columns as per the headers in the original file.
From here you could then use a transformation mapping file and the databases' data dictionary to extract the data in the source columns and map it directly to the output columns.

Related

How to do data transformation using Apache NiFi standrad processor?

I have to do data transfomration using Apache NiFi standard processor for below mentioned input data. I have to add two new fields class and year and drop extra price fields.
Below are my input data and transformed data.
Input data
Expected output
Disclaimer: I am assuming that your input headers are not dynamic, which means that you can maintain a predictable input schema. If that is true, you can do this with the standard processors as of 1.12.0, but it will require a little work.
Here's a blog post of mine about how to use ScriptedTransformRecord to take input from one schema, build a new data structure and mix it with another schema. It's a bit involved.
I've used that methodology recently to convert a much larger set of data into summary records, so I know it works. The summary of what's involved is this:
Create two schemas, one that matches input and one for output.
Set up ScriptedTransformRecord to use a writer that explicitly sets which schema to use since ScriptedTransformRecord doesn't support the ability to change the schema configuration internally.
Create a fat jar with Maven or Gradle that compiles your Avro schema into an object that can be used with the NiFi API to expose a static RecordSchema (NiFi API) to your script.
Write a Groovy script that generates a new MapRecord.

How do I read this mapping document?

I am new to etl and am working with talend. I was given this document and was told to make an "extraction job." how exactly do I read this document for this talend job that I have to make?
Well, ETL basically means, Extract-Transform-Load.
From your example, I can understand that you have to create a Target table which will pull data from the Source table based on certain conditions. These conditions are mentioned in your image.
You basically have to look at the Source File columns from you image. They clearly state,
1.) File(Table name), this means which table from the Source DB this attribute is flowing in.
2.) Attribute(s) (Field Name) : This is the name of the column.
3.) Extract logic : This means what logic has to be applied while extracting this column from Source(Straight Move) means, just dump the source value in Target.
This is just to get you started, as nobody will actually create the whole ETL flow for you here.

Can I keep data of different file formats in same hive table?

I am receiving data of formats like csv, xml, json and I want to keep all the files in same hive table.Is it achievable?
Hive expects all the files for one table to use the same delimiter, same compression applied etc. So, you cannot use a Hive table on top of files with multiple formats.
The solution you may want to use is
Create a separate table (json/xml/csv) for each of the file formats
Create a view for the UNION of the 3 tables created above.
This way the consumer of the data has to query only one view/object, if that's what you are looking for.
Yes, you can achieve this through a combination of different external tables.
Because different SerDes with different specifications for how to read columns in the different files will be needed, you will need to create one external table per type of file (and table). The data from each of these external tables can then be combined into a view with UNION, as suggested by Ramesh. The view can could then be used for reading from these, and you could e.g. insert the data into a managed table.

How to import only some columns from XLS with ETL?

I want to do something like Read only certain columns from xls in Jaspersoft ETL Express 5.4.1, but I don't know the schema of the files. However, from what I read here, it looks like I can only do this with the Enterprise Version's Dynamic Schema thing.
Is there no other way?
you can do it using tMap component. design job like below.
tFileInputExcel--main--TMap---main--youroutput
create metadata for your input file that is excel
then used this metadata in your input component
in Tmap select only required columns in output.
See the image of tMap wherein i am selecting only two columns from input flow.
Enterprise version has many features and dynamic schema is the most important one. But as far as your concern that is not required. it is required if you have variable of schema wherein you don`t know how many columns you will received in your feed.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources