I have a situation !
10 Modules take about close to 2 hours each to process that includes loading data from external files, 20 hours of run is unreasonable and they must run sequentially because of the way it is coded. Each module has a same set of scripts but deals with different set of data.
Components :
a) Tables : TempTableA, FinalTableA, TempTableB, FinalTableB;
Each of these tables are uniquely represented by a module Key.
The module key is defaulted to '-99'
b) External File (FileA,FileB) does not have a module key but only has a data.
c) The script knows about the module key for that module.
d) .ctr file
The code inside each module more or less has following steps :
Truncate Table TempTableA
sqlldr $USER/$PASSWRD#$PRD_SID control=ctr/fileA.ctr log=log/fileA.log bad=log/fileA.bad skip=1 rows=10000 silent=FEEDBACK
update table TempTableA set moduleKey = $moduleKey where moduleKey = '-99'
insert into FinalTableA as select * from TempTableA;
Now I cant run these modules in parallel because of these temporary tables being truncated.
Is there a better solution ?
I am aware of the external tables , but this isnt about using external tables it is about how I get around with the problem of using shared temporary tables. And this may not about running parallel loads either.
You can use SQL*LOADER in parallel and to make a direct path load. So, you need to configure the command with DIRECT=TRUE and PARALLEL=TRUE.
Supress the UPDATE sentence. Replace it with a constant in the CTL file.
I think you can use a hint like /*+ APPEND */ in the last sentence to INSERT data. But, check your requirements to know if the final table could be partitioned. So, the final step wouldn't be an INSERT INTO type sentence but an EXCHANGE PARTITION one. And this is the fastest way lo upload the data to the final table.
Regards from Chile.
Pável
Related
We have a flow where GenerateTableFetch takes inpute from splitJson which gives TableName, ColumnName as argument. At once multiple tables are passed as input to GenerateTableFetch and next ExecuteSql executes the query.
Now i want to trigger a new process when all the files for a table has been processed by the below processor (At the end there is PutFile).
How to find that all the files created for a Table has been processed?
You may need NIFI-5601 to accomplish this, there is a patch currently under review at the time of this writing, I hope to get it into NiFi 1.9.0.
EDIT: Adding potential workarounds in the meantime
If you can use ListDatabaseTables instead of getting your table names from a JSON file, then you can set Include Count to true. Then you will get attributes for the table name and the count of its rows. Then you can divide the count by the value of the Partition Size in GTF and that will give you the number of fetches (let's call it X). Then add an attribute via UpdateAttribute called "parent" or something, and set it to ${UUID()}. Keep these attributes in the flow files going into GTF and ExecuteScript, then you can use Wait/Notify to wait until X flow files are received (setting Target Signal Count to ${X}) and using ${parent} as the Release Signal Identifier.
If you can't use ListDatabaseTables, then you may be able to have ExecuteSQLRecord after your SplitJSON, you can execute something like SELECT COUNT(*) FROM ${table.name}. If using ExecuteSQL, you may need a ConvertAvroToJSON, if using ExecuteSQLRecord use a JSONRecordSetWriter. Then you can extract the count from the flow file contents using EvaluateJsonPath.
Once you have the table name and the row count in attributes, you can continue with the flow I outlined above (i.e. determine the number of flow files that GTF will generate, etc.).
Hi I don't understand why this code takes too much time.
val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")
It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?
What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).
The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.
spark.sql("""
create table mydata
using parquet
options(
path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")
spark.sql("msck repair table mydata")
From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:
spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")
I'm started to learning Hadoop Stack for one of my projects (quite newbie in hadoop stack). I try to figure out what's the best approach for ETL process for putting data in HIVE. I have some working solution, but I suppose it's not optimal, and there are better options.
My Case:
I have a raw data in binary files generated by system. Before putting them on HDFS/HIVE I have to parse them using unix console program (quite complex) for text lines with data, and then place it to HIVE table.
My current solution:
System add a message to Kafka that there is a new binary file waiting for processing.
I have a python script on hadoop master node (at least for now):
A) recieveing Kafka messages
B) downloading the file.
C) executing the console program
D) Saving text output to CSV
E) Pushing CSV file to HDFS
F) Creating temporary table in HIVE from CSV File
G) INSERT data from temporary TABLE into seperate pernament table on ORC engine
H) Delete temporary table
My Questions:
Is this flow optimal? Maybe there is something which could be simplier?
It this possible to schedule/deploy/execute this python script (or other better technology?) automatically on every hadoop node?
Any clues about tools/options to make the whole process easy to maintenance, schedule, and efficient?
I assume your point 2 - > D has a constant layout for csv. In That
case , You may combine points F and H , rather than creating and
dropping table every-time , You can create a template temp-table and
overwrite data every next time.
For Eg :
create external table template
(
---- Your csv schema.
)
Next you may try following type of insert :
LOAD DATA LOCAL INPATH '%s' OVERWRITE INTO TABLE template;
This will reduce some time in your processing.
I am not sure about java , but i have used lot of python and have implemented these similar requirements at my work. I never felt any challenges with python due to its diversity and different modules available.
If you are implementing this in UNIX box , you may either use cron or
oozie to schedule the whole automation.
I was wondering, for folks familiar with DataStage, if Oracle SQLLDR can be used on DataStage. I have some sets of control files that I would like to incorporate into DataStage. A step by step way of accomplishing this will greatly be appreciated. Thanks
My guess is that you can run it with external stage in data stage.
You simply put the SQLLDR command in the external stage and it will be executed.
Try it and tell me what happens.
We can use ORACLE SQL Loader in DataStage .
If you check Oracle Docs there are two types of fast loading under SQL Loader
1) Direct Path Load - less validation in database side
2) Conventional Path Load
There is less validation in Direct Load if we compare to Conventional Load.
In SQL Loader process we have to specify points like
Direct or not
Parallel or not
Constraint and Index options
Control and Discard or Log files
In DataStage , we have Oracle Enterprise and Oracle Connector Stages
Oracle Enterprise -
we have load option in this stage to load data in fast mode and we can set Environment variable OPTIONS
for Oracle , example is below
OPTIONS(DIRECT=FALSE,PARALLEL=TRUE)
Oracle Connector -
We have bulk load option for it and other properties related to SQL Loader are available in properties tab .
Example - control and discard file values all set by DataStage but you can set these properties and others manually.
As you know SQLLDR basically loads data from files to database so datastage allows you to use any input data file, that would take input in any data file like sequential file, pass them format, pass the schema of the table, and it’ll create an in memory template table, then you can use a database connecter like odbc or db2 etc. and that would load your data in your table, simple as that.
NOTE: if your table does not exist already at the backend then for first execution make it create then set it to append or truncate.
Steps:
Read the data from the file(Sequential File Stage)
Load it using the Oracle Connector(You could use Bulk load so that you could used direct load method using the SQL loader and the data file and control file settings can be configured manually). Bulk Load Operation: It receives records from the input link and passes them to Oracle database which formats them into blocks and appends the blocks to the target table as opposed to storing them in the available free space in the existing blocks.
You could refer the IBM documentation for more details.
Remember, there might be some restriction in loading when it comes to handling rejects, triggers or constraints when you use bulk load. It all depends on your requirement.
I am working on a new requirment and I am new into this. So seeking your help.
Requriment - From Siebel base tables (S_ORG_EXT,S_CONTACT,S_PROD_INT) I have to export data and need to put into two staging tables (S1 and S2) and from these staging tables I need to create dat files pipe delimited that include row count also. For staging table S1, we should have Accounts with their associated contacts and for S2, we should have account with its associated contact and Product.
How should I need to go about this. Should I need to use Informatica job directly to pull data from Siebel base tables or need to run EIM export job to get data in EIM table and from there to staging table.
Kindly help me know which way I should go.
Access the base tables directly using Informatica, limiting the extract to only the rows and columns you need.
I'd recommend unloading these to flat files before loading them into the Staging Tables (it gives you a point of recovery if something goes wrong in your Staging Table load, and means you don't have to hit the Siebel DB again).
Then from there you can either unload the staging tables, or just use your flat file extract, to generate your delimited files with row counts.
I tend to favour modular processes, with sensible recovery points, over 'streaming' the data through for (arguably) faster execution time, so here's what I'd do (one mapping for each):
1. Unload from Base Tables to flat files.
2. Join the flat file entities as required and create new flat files in the Staging Table format.
3. Load staging tables.
4. Unload staging tables (optional, if you can get away with using the files created in Step 2)
5. Generate .dat files in pipe-delimited format with the row count.
If the loading of a staging table is only for audit purposes etc, and you can base Step 5 on the files you created in Step 2, then you could perform stage (3) concurrently with stage (5), which may reduce overall runtime.
If this is a one-off process, or you just want to write it in a hurry, you could skip writing out the flat files and just do it all in one or two mappings. I wouldn't do this though, because
a) it's harder to test and
b) there are fewer recovery points.
Cheers!