Is it possible to write multiple oracle database tables into one parquet file? - parquet

I have a requirement where I want to convert my oracle DB data to parquet. So in my database I have multiple tables for example Employee, Department.
So is it possible to insert the data of both the tables in single parquet file? Or do i need to create separate parquet file for each table?

Related

How to import data from parquet file to existing Hadoop table?

I have created some tables in my Hadoop cluster, and I have some parquet tables with data to put it in. How do I perform this? I want to stress, that I already have empty tables, created with some DDL commands, and they are also stored as parquet, so I don't have to create tables, only to import data.
You should take advantage of a hive feature that enables you to use parquet to import data. Even if you don't want to create a new table. I think it's implied that the parquet table schema is the same as the existing empty table. If this isn't the case then below won't work as is. You will have to select the columns that you need. There
Here the table that you already have this is empty is called emptyTable located in myDatabase. The new data you want to add is located /path/to/parquet/hdfs_path_of_parquet_file
CREATE TABLE myDatabase.my_temp_table
LIKE PARQUET '/path/to/parquet/hdfs_path_of_parquet_file'
STORED AS PARQUET
LOCATION '/path/to/parquet/';
INSERT INTO myDatabase.emptyTable as
SELECT * from myDatabase.my_temp_table;
DELETE TABLE myDatabase.my_temp_table;
You said you didn't want to create tables but I think the above kinda cheats around your ask.
The other option again assuming the schema for parquet is already the same as the table definition that is empty that you already have:
ALTER TABLE myDatabase.emptyTable SET LOCATION '/path/to/parquet/';
This technically isn't creating a new table but does require altering you table you already created so I'm not sure if that's acceptable.
You said this is a hive things so I've given you hive answer but really if emptyTable table definition understands parquet in the exact format that you have the /path/to/parquet/hdfs_path_of_parquet_file in you could just drop this file into the folder defined by the table definition:
show create table myDatabase.emptyTable;
This would automatically add the data to the existing table. Provided the table definition matched. Hive is Schema on read so you don't actually need to "import" only enable hive to "interpret".

How to compare data between two tables where one table is in oracle other is in postgres

I have two tables with the same table structure. One in oracle and the other in Postgres. I would like to compare the data between the two tables. I cannot use DB_Link, because of some connectivity issues.
I have copied both the contents to an excel sheet. But still having issues comparing the data.
Please suggest a suitable option to compare the data between the two tables.

Refresh hive tables in Hive

I have few tables in Hive, every day new csv file will be adding to the hive table location. When a new data is available i need to refresh the tables so that i can see new data in the tables.
steps we follow to load the data:
first create a table with csv serde properties
create another table with parquet table to do in production
insert the data from first table to second table.
Initial:
1,a
2,b
3,c
New file:
4,d
I searched in google and found this can be done via:
1) incremental table, loading the new file in to incremental table and do insert statement. In my case we have more than 100 tables and so not want to create these many incremental tables
2) Using refresh command via Impala shell.
Our initial tables are stored as csv serde format. so when i do refresh on the initial tables i get an error impala does't support serde propertied.
Can you please provide a solution in my case.

Solution for Dynamic Schemas - HIVE/AVRO

The requirement is to keep up with the schema evolution for target ORC table. I am receiving JSON events from source. We plan to convert these to AVRO (since it supports schema evolution). Since schema can change daily/weekly, we need to keep ingesting new data JSON files, convert them to AVRO and store all the data (old/new) in an ORC hive table. How do we solve for this?
You can follow below approach, which is one among many different ways that you can implement to solve this.
1. Create HBASE Table
Read the AVRO data and create table in HBASE initially.( You can use spark to do this efficiently)
HBASE table will take care of schema evolution even in the future.
2. Create Hive Wrapper Table
Create a hive wrapper table (storage handlers) pointing to the HBASE table. (You can read more about it here
3. Create ORC Table
Now create ORC table from the table created in step2
4. Things you need to handle
Since Hive tables are tightly coupled with a schema, you need to handle a step before writing the data into Hive wrapper table in step 2. You need to identify the new columns here and then add the columns appropriately to the existing wrapper or ORC table. This again can be achieved by NiFi or Spark or as simple as a shell script. Choose the right tools according to your use case.

Is it possible to join a hive table with oracle table?

I have a problem in writing Query using HiveQL.
Is it possible to join a hive table with oracle table?
if yes how?
if no why?
To access data stored in your Hive tables, including joining on them, you will need Oracle Big Data connector.
From the documentation:
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other database-resident data. If required, you can also load data into the database using SQL.
You first access Hive tables from Oracle Database via external tables . The The external table definition is generated automatically from the Hive table definition. Hive table data can be accessed by querying this external table. The data can be queried with Oracle SQL and joined with other tables in the database.
You can use the Hive table that uses data and can access this Hive table from Oracle Database.

Resources