Hive Bucketing - How to run hive query for specific bucket - hadoop

I have hive query which reads 5 large tables and outputs the records to next process. All these tables are partitioned on proc_dt and bucketed on user_id (5 buckets). Joins are done on user_id and filtering on proc_dt.
How can I run this query for specific bucket of all the tables? For ex. I want to run the query for just first bucket of all tables.
The reason behind doing this is, once I complete the query for first bucket, I can send the output data to next process. While next process is running I can complete query for next bucket and so on. This way next process is not waiting for entire query to finish.
If I had one more column which had Mod5 of user ID, then I would have gone for partitioning. But there is no such column and I cannot add it.
Could anyone please give me some solution for this. Any suggestions will be really helpful.

I got the answer for it. We can mention the bucket number in join query. Check the below link for more detail.
https://www.qubole.com/blog/big-data/5-tips-for-efficient-hive-queries/

You can specify partitions within query statements but not buckets. Buckets are used for optimization purposes - e.g. faster sampling and mapside joins. But they are not visible to sql statements.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
So here is the documentation example:
CLUSTERED BY(user_id) INTO 256 BUCKETS;
This clearly does not permit access to individual buckets by value/name.

Related

how to we define hbase rowkey so we get reords in optimize manner when millons of records in table

I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data

Get the top N records from two unconnected data sets

I have two Rails services that return data from distinct databases. In one data set I have records with fields that are something like this:
query, clicks, impressions
In the second I have records with fields something like this:
query, clicks, visitors
What I want to be able to do, is get paged data from the merged set, matching on queries. But it needs to also include all records that exist in one or the other data sets, and then sort them by the 'clicks' column.
In SQL if these two tables were in the same database I'd do this:
SELECT COALESCE(a.query, b.query), a.clicks, b.clicks, impressions, visitors
FROM a OUTER JOIN b ON a.query = b.query
LIMIT 100 OFFSET 1
ORDER BY MAX(a.clicks, b.clicks)
An individual "top 100" to each data set produces incorrect results because 'clicks' in data set 'a' may be significantly higher or lower than in dataset 'b'.
As they aren't in the same database, I'm looking for help with the algorithm that makes this kind of query efficient and clean.
I never found a way to do this outside of a database. In the end, we just used PostgreSQL's Foreign Data Wrapper feature to connect the two databases together and use PostgreSQL for handling the sorting and paging.
One trick for anyone heading down this path, we built VIEWs on the remote server that provided exactly the data needed in a above. This was thousands of times faster than trying to join tables across the remote connection as the value of the indexes was lost.

Compound rowkey in Azure Table storage

I want to move some of my Azure SQL tables to Table storage. As far as I understand, I can save everything in the same table, seperating it using PartitionKey and keeping it unique within each partition using Rowkey.
Now, I have a table with a compound key:
ParentId: (uniqueidentifier)
ReportTime: (datetime)
I also understand RowKeys have to be strings. Will I need to combine these in a single string? Or can I combine multiple keys some other way? Do I need to make a new key perhaps?
Any help is appreciated.
UPDATE
My idea is to put data from several (three for now) database tables and put in the same storage table seperating them with the partition key.
I will query using the ParentId and a WeekNumber (another column). This table has about 1 million rows that's deleted weekly from the db. My two other tables has about 6 million and 3.5 million
This question is pretty broad and there is no right answer.
The specific question - can you use Compound Keys with Azure Table Storage. Yes, you can do that. But this involves manual Serializing / Deserializing of your object's properties. You can achieve that by overriding the TableEntity's ReadEntity and WriteEntity methods. Check this detailed blog post on how can you override these methods to use your own custom serialization/deserialization.
I will further discuss my view on your more broader question.
First of all, why you want to put data from 3 (SQL) tables into one (Azure Table)? Just have 3 Azure tables.
Second thought, as Fabrizio points out is how are you going to query the records. Because Windows Azure Table service has only one index, and that is PartitionKey + RowKey properties (columns). If you are pretty sure you will mostly query data by known PartitionKey and RowKey, then Azure Tables is perfectly suiting you! However you say that your combination for RowKey is ParentId + WeekNumber! That means that a record is uniquely identified by this combination! If it is true, then you are even more ready to go.
Next you say you are going to delete records every week! You should know that DELETE operation acts on a single entity. You can use Entity Group Transactions to DELETE multiple entities at once, but there is a limit of (a) All entities in batch operation must have the same PartitionKey, (b) The maximum number of entities per batch is 100, and (c) The maximum size of batch operation is 4MB. Say you have 1M records like you say. In order to delete them, you have to first retrieve them in groups by 100, then delete in groups by 100. These are, in best possible case 10k operations on retrieval and 10k operations on deletion. Event if it will only cost 0.002 USD, think about time taken to execute 10k operations against a REST API.
Since you have to delete entities on a regular basis, which is fixed to a WeekNumber let's say, I can suggest that you dynamically create your tables and include the week number in its name. Thus you will achieve:
Even better partitioning of information
Easier and more granular information backup / delete
Deleting millions of entities requires just one operation - delete table.
There is not an unique solution for your problem. Yes, you can use ParentID as PartitionKey and ReportTime as Rowkey (or invert the assignment). But the big 2 main questions re: how do you query your data, with what conditions? and how many data do you store? 1000, 1 million items, 1000 millions items? The total storage usage is important. But it's also very important to consider the number of transaction you will generate to the storage.

Which is faster in Apache Pig: Split then Union or Filter and Left Join?

I am currently processing a large input table (10^7 rows) in Pig Latin where the table is filtered on some field, processed and the processed rows are returned back into the original table. When the processed rows are returned back into the original table the fields the filters are based on are changed so that in subsequent filtering the processed fields are ignored.
Is it more efficient in Apache Pig to first split the processed and unprocessed tables on the filtering criteria, apply processing and union the two tables back together or to filter the first table, apply the process to the filtered table and perform a left join back into the original table using a primary key?
I can't say which one will actually run faster, I would simply run both versions and compare execution times :)
If you go for the first solution (split, then join) make sure to specify the smaller (if there is one) of the two tables first in the join operation (probably that's going to be the newly added data). The Pig documentation suggests that this will lead to a performance improvement because the last table is "not brought into memory but streamed through instead".

Match mass data records within thousands of queries

I have a lot of data records.(about 1.5billion) And a lot of queries.(about 10 thousands).
Each record can be matched with multi queries.(It can be determined by eval the query against the data record)
The records were stored in a distributed database. Each record have a field to store the IDs of the queries that match this data record.
I can scan all the records in about 15 minutes.(But do nothing with the data).
For each record. I want to mark it with the query ID that it matches. Without a big delay(eg: 1hour). Is there a good algorithm to do this ? Iterate each query over the queries is not a solution. I think some kind of indexing is need. Please help! Thanks!
Apache Pig has multiquery switched on by default. If your queries share the same datasource, than Pig will execute them in parallel, so that input data is read only once.

Resources