Please do advice the best way to perform the bulk data load from multiple tables to single table.
Need to pivot the data from two tables compare the same with third table and load the data to fourth table.
Related
We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table
I am new to HDFS/HIVE. Need some advice. I have a background of RDBMS Data modelling.
I have a requirement of a daily report. The report requires fetching of data from two staging Tables(HIVE).
What if I create a table in HIVE, write a view to fetch records from staging to populate HIVE table. create a HIVE view pointing to HIVE table with where clause of selecting one-day data?
HIVE staging tables ---> 2. View to populate HIVE table --> 3. HIVE table ----> 4. View to fetch data from HIVE table created in 3.
what if I create a view on top of two staging HIVE tables (joining two tables with where clause to fetch one-day data)?
HIVE staging tables ---> 2. View to fetch data from HIVE staging tables
I want to know HIVE best practice and solution strategies.
View or not View but you need ETL process to load tables. ETL process can join, aggregate, etc, so you will be able use finally joined and aggregated data in the form star/snowflake or report table. Why do you need Views here? To reuse some common queries, to reduce complexity of some long complex queries, make interfaces to data, create logical entities, etc. You do not necessarily need View simply to join tables and load data to another table. All depends on your requirements. If reports should query data fast then data should be precalculated by ETL process. View is just wrapper over query, it will be calculated each time you query data.
I think its best if you have zero views, 1 single table, and make your partition the date field (but you can't partition on the date, so you have to store it as a string) ... this make it easier for the end user to have only 1 table... fewer tables.
This gives your users the ability to engage only the latest date they want, or leverage the full table.
I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.
Consider the following scenario:
Main Control Table: 100 rows (Denormalized table with multiple processing ID's).
Set of 10 Parent Tables populated based on Control table.
Set of 10 Child Tables populated based on the Parent tables.
For daily processing:
We need to delete the data from Child tables first.
Parent Tables next.
Control table last.
Then insert data into Control table using multiple Insert Statements as it is denormalized.
Is this possible in one mapping?
One suggestion is to use SQL Transform and just execute the SQL's one after the other.
Is there an alternative way of Handling this?
I am going to create a lot of data scripts such as INSERT INTO and UPDATE
There will be 100,000 plus records if not 1,000,000
What is the best way to get this data into Oracle quickly? I have already found that SQL Loader is not good for this as it does not update individual rows.
Thanks
UPDATE: I will be writing an application to do this in C#
Load the records in a stage table via SQL*Loader. Then use bulk operations:
INSERT INTO SELECT (for example "Bulk Insert into Oracle database")
mass UPDATE ("Oracle - Update statement with inner join")
or a single MERGE statement
To keep It as fast as possible I would keep it all in the database.
Use external tables (to allow Oracle to read the file contents),
and create a stored procedure to do the processing.
The update could be slow, If possible, It may be a good idea to consider creating a new table based on all the records in the old (with updates) then switch the new & old tables around.
How about using a spreadsheet program like MS Excel or LibreOffice Calc? This is how I perform bulk inserts.
Prepare your data in a tabular format.
Let's say you have three columns, A (text), B (number) & C (date). In the D column, enter the following formula. Adjust accordingly.
="INSERT INTO YOUR_TABLE (COL_A, COL_B, COL_C) VALUES ('"&A1&"', "&B1&", to_date ('"&C1&"', 'mm/dd/yy'));"