I am new to hive,and hadoop ecosystem in general.From what I learnt of the basics of Hive you can create partitions on hive table based on certain attributes.And if a query has any mention of that attribute then it should supposedly get a performance boost as hive only scans that particular partition file instead of scanning the whole table.My question is suppose we have some hierarchical structure in the data.Say I partition a table based on unique state values and every time a query is based on state hive would only scan that particular state partition instead of scanning the whole table.However say every state also has unique district names.If I make a query based only on district values would hive scan the whole table?
If so then is there some way to change the query in such a way that I can manually instruct hive to query the particular state file to which the district belongs to.And then perform other operations only on that partition file,instead of scanning the whole table for matching district values.
One of the strengths of Hive is that it has strong support for partitioning. However, it cannot read your mind when you write queries.
If you have a partition on state, then you need state in the where clause for partition pruning. So, if you query only on district, the whole table would be scanned.
If you have a partition on district, then you need the district. A query on state would scan the whole table.
If you have a partition on both . . . well, then it is a little more complicated to declare, but your queries would read a minority of partitions with either state or district.
If you are just learning about partitions, I would advise you to start with date partitions. These are the most common and a good way to get familiar with the concept.
Related
I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data
I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:
I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?
Load source data into a hive table; let say Sales_Data_Warehouse
Generate Dimension from sales_data_warehouse; ex:
SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse
When all dimensions are done then load the fact table like
SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key...
FROM Sales_Data_Warehouse AS 'source'
JOIN Customer_Dimension Customer on source.Customer_Name =
Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address
JOIN Store_Dimension AS 'Store' ON
Store.Store_Name = Source.Store_Name
JOIN Product_Dimension AS 'Product' ON .....
Is this the way I should load my fact and dimension table in hive?
Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?
We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.
The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table
I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.
I assume the answer is "no" in this scenario, but I figured I'd ask and see if there was something I was missing:
I have an Oracle table which is partitioned for ease of data loading -- data is loaded into six separate tables and then partition-switched into the main table. The only thing differentiating these loading tables is the source of the data, so each one has a unique datasource column which is used to partition the main table. We occasionally have some ad hoc queries which look at this datasource in the main table, but the standard reports querying this table ignore this column entirely. Nothing insert/update/deletes individual records from this table, so there's no concern about updating any indexes.
In this case, is there any reason to use local indexes instead of global ones?
A local index makes a lot of sense - if you use partitioning for performance reasons.
If your queries always contain the partition key then a Oracle will only scan that specific partition (that is known as "partition pruning").
If you then have additional conditions that would benefit from an index lookup, the database only needs to check the local index which is much smaller then a global index and thus the lookup will be faster.
In your case, if you never (or almost never) include the partition key in the queries, you are right that the local index wouldn't be helpful.
I have a question on Hive Views Partitions.
I have a base table which is partitioned on a Date Field. My View is a simple view which does a select * from the base table.
My Question is would the view be Partition aware when a view is queried y the end user? or do i need to execute any other commands to be able to use the partitions by view?
I am having this question because of the following statement in wiki.apache.org https://cwiki.apache.org/confluence/display/Hive/PartitionedView on this topic which mentioned:
1.One possible approach mentioned in HIVE-1079 is to infer view partitions automatically based on the partitions of the underlying tables. A command such as SHOW PARTITIONS could then synthesize virtual partition descriptors on the fly. This is fairly easy to do for use case #1, but potentially very difficult for use cases #2 and #3. So for now, we are punting on this approach.
Regards,
Nish
At my prior engagement we used views extensively and all of our tables were partitioned. We relied on the ability of the hive query planner to perform proper partition pruning in these views and it did so successfully. In fact there were several edge cases/complicated scenarios that required updates to the hive source code by Hortonworks. But in the general/simpler cases the partition pruning was working.
My hbase row key is different and also I need to aggregate the data and store seperatly. In this use case which one is best approach
What is best approach creating multiple hbase tables or multiple column families in single hbase table
I am Refining my question
Below is my usecase.
I am processing weblogs which has retailer, Category, Product clicks.
I am storing above weblog into one hbase table (Log) with separate rowkey and same column family
Ex.
A.
for Retailer -- IP | DateTime | Sid | Retailer
B.
for Category -- IP | DateTime | Sid | Retailer | Category
C.
for Product -- IP | DateTime | Sid | Retailer | Category |Product
From above table I am calculating Day clicks and storing into other hbase tables like ( Retailer_Day_cnt, Category_Day_Cnt, Product_Day_Cnt)
Here my question is what is the best way to store the data into hbase with above 1 and 2 cases, is it separate hbase tables or column family.
Note: In case1 I am doing only writes, but in case2 I will do multiple reads and writes.
Thanks in advance
Surendra
From performance perspective, lesser the column families better it is. As all the column families in table are flushed at same time even if some of the column families have very little data, making flush less efficient. . If your table is heavy on write this will result lot hfiles -> increased in compactions -> increased GC pauses, this can make whole hbase very slow so better don't use multiple column family if you don't really need them or all column families will have same amount data.
Find more details here:
Hbase Book
Similar question
This depends on you use case.
In case you have the same rowKey but different data then you can divide into different column families. But if the rowkeys are different put it into different tables.
This also will depend on whether you have single write multiple reads (i.e. low write throughput is ok) or you want high write throughput. Also how you data is dictributed. If one column family has a lot of data (in size) compared to rest of column families better to put the column families into different tables.
If you give more details on your use case i can be more specific.
Row key design is the main challenge in these scenarios.
If you are able to make your row key in such a way so that you can use it for all of your purposes then you may proceed with different column families otherwise multiple tables would be the only option. For your case, it seems like you are storing aggregated result in the second table which must have different logical row key. So, you should go with two tables approach where first table to store all the inputs (write once read multiple times) and second table to store processed/aggregated data.