How Hive Partition works - hadoop

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.

If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

Related

De-duplication from two hive tables

We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table

Can I directly consider the Hive partition columns similar to the partitions columns present in source (Teradata) tables?

Can I directly consider the Hive partition columns similar to the partitions columns present in my source (Teradata) tables? or do I have consider any other parameters to decide the Hive partitioning columns ? Please help.
This is not best practice. if you create data in this manner then a person who is trying to access HDFS data directly will not find 'partition columns' in each partition. For example say Teradata table is partitioned by date column then if hive table is also partitioned by date then HDFS partition say 2016-08-06 will not have date field. So to make it easy for end user partition by a dummy column say date_d which will exactly same values as date column.
Abstractly, partitioning in Teradata and Hive are similar.To begin
with you can probably use the same columns as in your source to
partition the tables.
If you data size is huge in each single partition, then consider
partitioning it further, to improve the performance.The multilevel
partitioning would mostly depend on the number of filters you apply
on your queries.

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?

I've seen some good explanations of creating a table with partitions which are CLUSTERED BY and SORTED BY. How does this compare with creating a table with partitions, then populating the table (with INSERT OVERWRITE for instance) using CLUSTER BY? Is the CLUSTER BY a persistent sort within the table?
Even if INSERT OVERWRITE + CLUSTER BY would produce table with persistently sorted data there is no way to tell hive that data is already sorted other than create CLUSTERED BY table. you can benefit from sorted data (sort-merge-join for example) only when the Hive knows about it and therefore can optimize the query. The data is not necessarily written to the disk in the same order it was produced or passed to the writer unless you specified that table is clustered(sorted). Usual (heap) tables are not sorted in theory. Writer process does not write data in the same order that the input because it is buffered (deferred write) and parallel.

Can't drop Oracle index partition -- any alternative besides dropping entire index and rebuilding?

So, I have a .NET program doing batch loading of records into partitioned tables using array bound stored procedure calls via Oracle ODP.NET, but that's neither here nor there.
What I would like to know is: because I have a partitioned index on said tables, the speed of the batch load is pretty slow. I fully understand that I cannot drop an index partition, but I would obviously prefer not to have to drop and rebuild the entire index since that will take considerably more time to execute. Is this my only recourse?
Is there a fairly simple way to drop the partition itself and then rebuild the partition and index partition that would save time and go about accomplishing my goal?
Are you loading an entire partition at once? Or are you merely adding new rows to an existing partition? Are all the indexes equipartitioned with the table?
Normally, if you are loading data into a partitioned table, your partitioning scheme is chosen so that each load will put data into a fresh partition. If that is the case, you can use partition exchange to load the data. In a nutshell, you load data into an (unindexed) staging table whose structure matches the real table, you create the indexes to match the indexes on the real table, and then do
ALTER TABLE partitioned_table
EXCHANGE PARTITION new_partition_name
WITH TABLE staging_table_name
WITHOUT VALIDATION;

Resources