how indexing works internally in hive? - hadoop

An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. If a column is indexed in a table, and how the data of that particular column is pointed,when that specific column is queried?

From the documentation
The goal of Hive indexing is to improve the speed of query lookup on
certain columns of a table. Without an index, queries with predicates
like 'WHERE tab1.col1 = 10' load the entire table or partition and
process all the rows. But if an index exists for col1, then only a
portion of the file needs to be loaded and processed. The improvement
in query speed that an index can provide comes at the cost of
additional processing to create the index and disk space to store the
index.
Behind the scene, Hive creates essentially a Map with the values of the column that it is indexing and the offset + files where the data is located in the HDFS, in that way, Hive does not need to scan all the data to search for a certain value. Here is a good article explaining basic concepts
https://acadgild.com/blog/indexing-in-hive/

Related

How to coalesce large portioned data into single directory in spark/Hive

I have a requirement,
Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10). Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?
From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10), you will get max 10 files per partition. I would suggest using repartition($"COL"), here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

Hive partitioned column doesn't appear in rdd via sc.textFile

The Hive partitioned column is not the part of the underlying saved data, I need to know how it can be pulled via sc.textFile(filePath) syntax to be loaded in RDD.
I know the other way of creating hive context and all but was wondering is there a way I can directly get it via sc.textFile(filePath) syntax and use it.
By partitioning the data by a column when saving, that columns data will be stored in the file structure and not in the actual files. Since, sc.textFile(filePath) is made for reading single files I do not believe it supports reading partitioned data.
I would recommend reading the data as a dataframe, for example:
val df = hiveContext.read().format("orc").load("path/to/table/")
The wholeTextFiles() method could also be used. Then you would get a tuple of (file path, file data) and from that it should be possible to parse out the partitioned data column and then add it as a new column.
If the storage size is no problem, then an alternative solution would be to store the information of the partitioned column twice. Once in the file structure (done by partitioning on that column), and once more in the data itself. This is achieved by duplicating the column before saving it to file. Say the column in question is named colA,
val df2 = df.withColumn("colADup", $"colA")
df2.write.partitionBy("colADup").orc("path/to/save/")
This can also easily be extended to multiple columns.

If it is ok to create a hbase table with 300 column families?

I have a scenario , where each object has a 300 variants, so I want to store them in hbase and each row store the original object and 300 variants in different column families? The access model is try to insert the objects to the table in every morning by batch , then just read them .I have no idea if it is ok to create a hbase table with 300 column families for my scenario?
The documentation suggests that the number of column families should have a maximum of 10, and also that a normal amount of column families is between one and three.
Do you have any objections to storing three hundred columns into one column family instead?
There is a limit of the number of column families in HBase. There is one MemStore (it's a write cache which stores new data before writing it into Hfiles) per Column Family. When one is full, they all flush.
The more column families you add the more MemStores will be created and Memstore flush will be more frequent. It will degrade the performance.
Stick to a very low number of columnfamilies: 1 or 2
A column family maps to files in the underlying system and thus impose a load on hbase.
The way to do this in by creating 300 columns instead.

How to count rows that got copied from one Hive table to another

I am moving data from one hive table to another hive table. While moving data, I add few new columns add partition and also applying compression.
I wanted to know if there is an easy way to know that number of rows moved from one table to another are same. Just to validate the moving data action.
Currently I am doing count on both table, which is taking too much time as number of rows are in 10^10 magnitude.
Thanks
When a map-reduce job is triggered during the transfer of data from first table to the second, you can use the RECORDS counter from map/reduce to validate the row count.

How to optimize Hive queires with external table and serde

Part 1: my enviroment
I have following files uploaded to Hadoop:
The are plain text
Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}
code are numbers from 1 to 3000,
customerId are total up to 4 millions, daily up to 0.5 millon
All files are gzip
In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA)
All files from each date is stored in separate directory - and I use it as partitions in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
[customerId] [attribute_1] [attribute_2] ... [attribute_n]
which contains data for all my customers, so rows are up to 4 millions.
I query and filter my data in following way:
Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020)
Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784)
Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like
SELECT customerId
FROM CUSTOMER_DATA
JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId)
WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: question
Is there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.
Performance on search:
Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Bucketing the data on your searching columns would give a lot of performance gains. But whether you can bucket you data or not depends on your use case.
You can consider more partitioning (if possible) to get more gains if you can on code/customer id. Hopefully you don't have to many unique code or customer id.
Rather than trying these things out on your Textual Json formatted data, I would strongly suggest you to move away from JSON test data. Parsing JSON(Text) is a big performance killer.
These days there are a lot of file format which work pretty good. If cant change the component which produces the data, you use a series of queries and tables to convert to other file formats. This will be one time job for each partition data. After that your search queries will run faster on newer file formats.
for eg. RCFile format is support by hive. If you pull out code, customerid as separate columns in RCFILE then the query engine can completely skip data col for not matching code in (1,4,5,33,6784) , reducing IO heavily.
Also storing data in RCFile ie columnar storage will help your joins. With RCFile when you run a query with join the hive execution engine will only read in required columns, again significantly reducing IO. On top of this if you bucketted your columns which are a part of JOIN keys it will lead to more performance gains.
If you need to have JSON due to nesting nature of data then I would suggesting you look at Parquet
It will give you performance gains of RCFile + binary (avro, thrift etc)
At my work we had 2 columns of heavily nested JSON data. We tried storing this as compressed text and sequence file format. We then broke up the complex nested JSON columns to lesser nested multiple columns and pulled out some frequently searched keys into other columns. We stored this as RCfile and performance gains we observed on searching were huge.
Rightnow with more burst in data we need to improve more. After trying a few more things and talking to Cloudera guys there is only one big area to improve. Move away from JSON parsing. Parquet seems to be ideal candidate for this.
Yes you can use Indexes with External Tables. Index do optimize the search Queries.
CREATE INDEX your_index_name ON TABLE your_table_name(field_you_want_to_index) AS 'COMPACT' WITH DEFERRED REBUILD;
indexing takes a lot of time for a huge dataset, so we can do a deferred rebuild, i.e after production hours :)
ALTER INDEX your_index_name ON your_table_name REBUILD;
you can even rebuild a specific partition.
ALTER INDEX your_index_name ON your_table_name PARTITION(your_field = 'any_thing') REBUILD;
when you JOIN two tables BUCKETING is the best option to go with, does alot of optimization.

Resources