I made a job design which consists of tFileInputDelimited -> tMap -> tDBOutput(Oracle)
The csv I am using has columns which are not currently in the table which I don't think should be a problem.. but when I run my job I get multiple ORA-00904 invalid identifier errors.
I check my DB in Oracle SQL developer and no rows have been updated.
Looking for some help how to fix this.. I looked up the error and I get referenced to a SQL code but I am not using SQL only a CSV file to upload.
Thank you!
You say that your csv has columns that are not in your table. That is a problem if you map those columns to the tMap output. Only those columns which are present in your target table need to be in the tMap output flow going to tDBOutput.
I want to get the table count for all tables under a folder called "planning" in HADOOP hive database but i couldn't figure out a way to do so. Most of these tables are not inter-linkable and hence cant use full join with common key.
Is there a way to do table count and output to 1 table with each row of record represent 1 table name?
Table name that i have:
add_on
sales
ppu
ssu
car
Secondly, I am a SAS developer. Is the above process do-able in SAS? I tried data dictionary but "nobs" is totally blank for this library. All other SAS datasets can display "nobs" properly. I wonder why and how.
I am working on a project and i am stuck on following scenario.
I have a table: superMerge(id, name, salary)
and I have 2 other tables: table1 and table2
all the tables ( table1, table2 and superMerge) has same structure.
Now, my challenge is to insert/update superMerge table from table1 and table2.
table1 is updated every 10mins and table2 every 20 mins therefore at time t=20mins i have 2 jobs trying to update same table(superMerge in this case.)
I want to understand how can i acheive this parallel insert/update/merge into superMerge table using Spark or any other hadoop application.
The problem here is that the two jobs can't communicate with each other, not knowing what the other is doing. A relatively easy solution whould be to implement a basic file-based "locking" system:
Each job creates a (empty) file in a specific folder on HDFS indicating that the update/insert is in progress and removes that file if the jobs is done
Now, each jobs has to check whether such a file exists or not prior to starting the update/insert. If it exists, the job must wait until the files is gone.
Can you control code of job1 & job2? How do you schedule those?
In general you can convert those two jobs into 1 that runs every 10 minutes. Once in 20 mins this unified job runs with different mode(merging from 2 tables), while default mode will be to merge from 1 table only.
So when you have same driver - you don't need any synchronisation between two jobs(e.g. locking). This solution supposes that jobs are finishing under 10 mins.
How large are your dataset ? Are you planning to do it in Batch (Spark) or could you stream your inserts / updates (Spark Streaming) ?
Lets assume you want to do it in batch:
Launch only one job every 10 minutes that can process the two tables. if you got Table 1 and Table 2 do a Union and join with superMerge. As Igor Berman suggested.
Be careful has your superMerge table will get bigger your join will take longer.
I faced this situation, write the tb1 DF1 to a location1 and tb2 DF2 to location 2 and at the end just switch the paths to the super merge table, you can also do the table to table insert but that consumes a lot of runtimes especially in the hive.
overwriting to the staging locations location1 and location 2:
df1.write.mode("overwrite").partitionBy("partition").parquet(location1)
df2.write.mode("overwrite").partitionBy("partition").parquet(location2)
switching paths to super merge table :
hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location1/partition=x/' INTO TABLE super_merge_table partition(partition=x))"
hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location2/partition=x/' INTO TABLE super_merge_table partition(partition=x))"
You can do the parallel merging without overriding the one on other.
I am a newbie to TalendETL and am using Talend Open Studio for Big Data version 6.2.
I have developed a simple Talend ETL job that picks up data from a tFileInputExcel and tOracleInput(dimension date ) and inserts data into my local Oracle Database.
Below is how my package looks :
this job run but i get 0 rows insert into my local Oracle Database
Your picture shows that no rows come out your tMap Component. Verify that your links inside the Tmap are corrects.
Seems there is no data that matches between fgf.LIBELLE_MOIS and row2.B.
I have two tables in my database i.e. Columns and Data. Data in these tables are like:
Columns:
ID
Name
Data:
1 John
2 Steve
Now I want to create a package which will create a csv file like:
ID NAME
------------
1 John
2 Steve
Can we achieve the same output? I have searched on google but I haven't found any solution.
Please help me....
You can achieve this effect through a script task OR you can create a temporary dataset in SQL Server where you combine the first row of your Column table and append data from the Data table to this. My guess is, you would have to fight with the metadata issues while doing anything of this sort. Another suggestion that I can think of is to dump the combination to a flat file, but again you will have to take care of the metadata conversion.