I have developed an ETL with shell scripting .
After that,I've found that there's an existing solution Talend open studio.
I'm thinking of using it in my future tasks.
But my problem is that the files that i want to integrate into the database must be transformed in structure . this is the structure that i have :
19-08-02 Name appel ok hope local merge (mk)
juin nov sept oct
00:00:t1 T1 299 0 24 8 3 64
F2 119 0 11 8 3 62
I1 25 0 2 9 4 64
F3 105 0 10 7 3 61
Regulated F2 0 0 0
FR T1 104 0 10 7 3 61
i must transform it into a flat file format .
Do talend offer me the possibility to do several transformations before integrating data from csvfiles into the databaseor not ?
Edit
this is an example of the flat file that i want to acheive before integrating data to the database (only first row is concerned) :
Timer,T1,F2,I1,F3,Regulated F2,FR T1
00:00:t1,299,119,25,105,0,104
00:00:t2,649,119,225,165,5,102
00:00:t5,800,111,250,105,0,100
We can split the task into three pieces, extract, transform, load.
Extract
First you have to find out how to connect to the source. With Talend its possible to connect to different kinds of sources, like databases, XML files, flat files, csv etc. pp. They are called tFileInput or tMySQLInput to name a few.
Transform
Then you have to tell Talend how to split the data into columns. In your example, this could be the white spaces, although the splitting might be difficult because the field Name is also split by a white space.
Afterwards, since it is a column to row transposition, you have to write some Java code in a tJavaRow component or could alternatively use a tMap component with conditional mapping: (row.Name.equals("T1") ? row.value : 0)
Load
Then the transformation would be completed and your data could be stored in a database, target file, etc. Components here would be called tFileOutput or tOracleOutput for example.
Conclusion
Yes, it would be possible to build your ETL process in Talend. The transposition could be a little bit complicated if you are new to Talend. But if you keep in mind that Talend processes data row by row (as your script does, I assume) this is not that big of a problem.
Related
I have a big file (attribute file) in my Amazon S3 bucket in .zip form. It is around 30 gb when unzipped. The file is updated every 2 days.
INDEX HIEGHT GENDER AGE
00125 155 MALE 15
01002 161 FEMALE 18
00410 173 MALE 17
00001 160 MALE 20
00010 159 FEMALE 22
.
.
.
My use-case is such that I want to iterate once through the sorted attribute file in 1 program run. Since the zipped file is around 3.6 gb and is updated every 2 days, my code downloads it from S3 everytime. (Probably I can use caching but currently I am not using that.)
I want the file to be sorted. Since the unzipped file is large and expected to grow more everytime, I do not want to unzip it during the code run.
What I am trying to achieve is following:-
I have other files- metric files. They are relatively smaller in size ( ~20-30 mb) and are sorted.
INDEX MARKS
00102 45
00125 62
00342 134
00410 159
.
.
.
Using the INDEX, I want to create METRIC-ATTRIBUTE file for each METRIC file. If the attribute file was also sorted, I could do something similar to merging two sorted lists, taking only the common INDEX rows. It would take O(SIZE_OF_ATTRIBUTE_FILE + SIZE_OF_METRIC_FILE) space and time.
What is the best way to sort it( the attribute file)? An aws solution is preferred.
You may be able to use AWS Athena service. Athena service can operate over big S3 files (even if they are zipped). You could create one table for every S3 file and run SQL queries against them.
Sorting is possible, as you can use ORDER clause, but it don't know, how efficient it will be.
It may not be necessary though. You will be able to join the tables on the INDEX column.
I have a use case where i have all my 4 tb of data in HBase tables that i have interrogated with HIVE tables .
Now i want to extract 5 k files out of this 30 tables that i have created in HIVE.
This 5K files will be created by predefined 5K queries.
Can somebody suggest me what approach i should follow for this?
Required time for this is 15 hrs .
Should i write java code to generate all this files .
File generation is fast .Out of 5k text files there are 50 files that takes around 35 minutes rest of all creates very fast .
I have to generate zipped file and have to send it to client using ftp.
If I understand your question right, you can accomplish your task by first exporting the query results via one of methods from here : How to export a Hive table into a CSV file?, compressing the files in a zip archive and then FTP'ing them. You can write a shell script to automate the process.
I have a delimited file like the below
donaldtrump 23 hyd tedcruz 25 hyd james 27 hyd
the first three set of fields should be one record ,second 3 set of fields are one record and so on...what is the best way in loading this file into a hive table like below(emp_name,age,location)
A very, very dirty way to do that could be:
design a simple Perl script (or Python script, or sed command line) that takes source records from stdin, breaks them into N logical records, and push these to stdout
tell Hive to use that script/command as a custom Map step, using the TRANSFORM syntax -- the manual is there but it's very cryptic, you'd better Google for some examples such as this or that or whatever
Caveat: this "streaming" pattern is rather slow, because of the necessary Serialization / Deserialisation to plain text. But once you have a working examople, the development cost is minimal.
Additional caveat: of course, if source records must be processed in order -- because the logical records can spill on the next row, for example -- then you have a big problem, because Hadoop may split the source file arbitrarily and feed the splits to different Mappers. And you have no criteria for a DISTRIBUTE BY clause in your example. Then, a very-very-very dirty trick would be to compress the source file with GZIP so that it is de facto un-splittable.
I receive an input file which has 200 MM of records. The records are just a keys.
For each record from this file (which i'll call SAMPLE_FILE), i need to retrieve all records from a database (which i'll call EVENT_DATABASE ) that match key . The EVENT_DATABASE can have billions of records.
For example:
SAMPLE_FILE
1234
2345
3456
EVENT_DATABASE
2345 - content C - 1
1234 - content A - 3
1234 - content B - 5
4567 - content D - 7
1234 - content K - 7
1234 - content J - 2
So the system will iterate through each record from SAMPLE_RECORD and get all EVENTS which has the same key. For example, getting 1234 and query the EVENT_DATABASE will retrieve:
1234 - content A - 3
1234 - content B - 5
1234 - content K - 7
1234 - content J - 2
Then i will execute some calculations using the result set. For example, count, sum, mean
F1 = 4 (count)
F2 = 17 (sum(3+5+7+2))
I will approach the problem storing the EVENT_DATABASE using HBASE. Then i will run a map-reduce job, and in the map phase i will query the HBase, get he events and execute the calculations.
The process can be in batch. It is not necessary to be real time.
Does anyone suggests another architecture? Do i really need a map reduce job? Can i use another approach?
I personally solved this kind of problem using MapReduce, HDFS & HBase for batch analysis. Your approach seems to be good for implementing your use-case I am guessing you are going to store the calculations back into HBase.
Storm could also be used to implement the same usecase, but Storm really shines with streaming data & near real time processing rather than data at rest.
You don't really need to query Hbase for every single event. According to me this would be a better approach.
Create an external table in hive using your input file.
Create an external table in hive for your hbase table using Hive Hbase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration)
Do a join on both the tables and get the retrieve the results.
Your approach would have been good if you were only querying for a subset of your input file but since you are querying hbase for all recrods (20M), using a join would be more efficient.
We have a hive table that has around 500 million rows. Each row here represents a "version" of the data and Ive been tasked to create table which just contains the final version of each row. Unfortunately, each version of the data only contains a link to his previous version. The actual computation for deriving the final version of the row is not trivial but I believe the below example illustrates the issue.
For example:
id | xreference
----------------
1 | null -- original version of 1
2 | 1 -- update to id 1
3 | 2 -- update to id 2-1
4 | null -- original version of 4
5 | 4 -- update to version 4
6 | null -- original version of 6
When deriving the final version of the row from the above table I would expect the rows with ids 3, 5 and 6 to be produced.
I implemented a solution for this in cascading, which although is correct has an n^2 runtime and takes half a day to complete.
I also implemented a solution using giraffe which worked great on small data sets but I kept running out of memory for larger sets. With this implementation I basically created a vertices for every id and an edge between each id/xreference pair.
We have now been looking into consolidating/simplifying our ETL process and Ive been asked to provide an implementation that can run as a hive UDF. I know that Oracle provides functions for just this sort of thing but I haven't found much in the way of Hive functions.
Im looking for any suggestions for implementing this type of recursion specifically with Hive but Id like to hear any suggestions.
Hadoop 1.3
Hive 0.11
Cascading 2.2
12 node development cluster
20 node production cluster