Solution for Dynamic Schemas - HIVE/AVRO - hadoop

The requirement is to keep up with the schema evolution for target ORC table. I am receiving JSON events from source. We plan to convert these to AVRO (since it supports schema evolution). Since schema can change daily/weekly, we need to keep ingesting new data JSON files, convert them to AVRO and store all the data (old/new) in an ORC hive table. How do we solve for this?

You can follow below approach, which is one among many different ways that you can implement to solve this.
1. Create HBASE Table
Read the AVRO data and create table in HBASE initially.( You can use spark to do this efficiently)
HBASE table will take care of schema evolution even in the future.
2. Create Hive Wrapper Table
Create a hive wrapper table (storage handlers) pointing to the HBASE table. (You can read more about it here
3. Create ORC Table
Now create ORC table from the table created in step2
4. Things you need to handle
Since Hive tables are tightly coupled with a schema, you need to handle a step before writing the data into Hive wrapper table in step 2. You need to identify the new columns here and then add the columns appropriately to the existing wrapper or ORC table. This again can be achieved by NiFi or Spark or as simple as a shell script. Choose the right tools according to your use case.

Related

How to import data from parquet file to existing Hadoop table?

I have created some tables in my Hadoop cluster, and I have some parquet tables with data to put it in. How do I perform this? I want to stress, that I already have empty tables, created with some DDL commands, and they are also stored as parquet, so I don't have to create tables, only to import data.
You should take advantage of a hive feature that enables you to use parquet to import data. Even if you don't want to create a new table. I think it's implied that the parquet table schema is the same as the existing empty table. If this isn't the case then below won't work as is. You will have to select the columns that you need. There
Here the table that you already have this is empty is called emptyTable located in myDatabase. The new data you want to add is located /path/to/parquet/hdfs_path_of_parquet_file
CREATE TABLE myDatabase.my_temp_table
LIKE PARQUET '/path/to/parquet/hdfs_path_of_parquet_file'
STORED AS PARQUET
LOCATION '/path/to/parquet/';
INSERT INTO myDatabase.emptyTable as
SELECT * from myDatabase.my_temp_table;
DELETE TABLE myDatabase.my_temp_table;
You said you didn't want to create tables but I think the above kinda cheats around your ask.
The other option again assuming the schema for parquet is already the same as the table definition that is empty that you already have:
ALTER TABLE myDatabase.emptyTable SET LOCATION '/path/to/parquet/';
This technically isn't creating a new table but does require altering you table you already created so I'm not sure if that's acceptable.
You said this is a hive things so I've given you hive answer but really if emptyTable table definition understands parquet in the exact format that you have the /path/to/parquet/hdfs_path_of_parquet_file in you could just drop this file into the folder defined by the table definition:
show create table myDatabase.emptyTable;
This would automatically add the data to the existing table. Provided the table definition matched. Hive is Schema on read so you don't actually need to "import" only enable hive to "interpret".

De-duplication from two hive tables

We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table

create a Parquet backed Hive table by using a schema file

Cloudera documentation, shows a simple way to "create a Avro backed Hive table by using an Avro schema file." This works great. I would like to do the same thing for a Parquet backed Hive table, but the relevant documentation in this case lists out every column type rather than reading from a schema. Is it possible to read the Parquet columns from a schema, in the same way as Avro data?
Currently, the answer appears to be no. There is an open issue with Hive.
https://issues.apache.org/jira/browse/PARQUET-76
The issue has been active recently, so hopefully in the near future Hive will offer the same functionality for Parquet as it does for Avro.

Why we need to move external table to managed hive table?

I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.

Hadoop & Hive as warehouse: daily data deliveries

I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.

Resources