How to sample for each group in hive? - hadoop

I have a large table in hive that has 1.5 bil+ values. One of the columns is category_id, which has ~20 distinct values. I want to sample the table such that I have 1 mil values for each category.
I checked out Random sample table with Hive, but including matching rows and Hive: Creating smaller table from big table and I figured out how to get a random sample from the entire table, but I'm still unable to figure out how to get a sample for each category_id.

I understand you want to sample your table in multiple files. You might want to check Hive bucketing or Dynamic partitions to balance your records between multiple folder/files.

Related

How do i get table count for all tables in the same folder in HADOOP hive? if in SAS server?

I want to get the table count for all tables under a folder called "planning" in HADOOP hive database but i couldn't figure out a way to do so. Most of these tables are not inter-linkable and hence cant use full join with common key.
Is there a way to do table count and output to 1 table with each row of record represent 1 table name?
Table name that i have:
add_on
sales
ppu
ssu
car
Secondly, I am a SAS developer. Is the above process do-able in SAS? I tried data dictionary but "nobs" is totally blank for this library. All other SAS datasets can display "nobs" properly. I wonder why and how.

Can I directly consider the Hive partition columns similar to the partitions columns present in source (Teradata) tables?

Can I directly consider the Hive partition columns similar to the partitions columns present in my source (Teradata) tables? or do I have consider any other parameters to decide the Hive partitioning columns ? Please help.
This is not best practice. if you create data in this manner then a person who is trying to access HDFS data directly will not find 'partition columns' in each partition. For example say Teradata table is partitioned by date column then if hive table is also partitioned by date then HDFS partition say 2016-08-06 will not have date field. So to make it easy for end user partition by a dummy column say date_d which will exactly same values as date column.
Abstractly, partitioning in Teradata and Hive are similar.To begin
with you can probably use the same columns as in your source to
partition the tables.
If you data size is huge in each single partition, then consider
partitioning it further, to improve the performance.The multilevel
partitioning would mostly depend on the number of filters you apply
on your queries.

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

Can in insert data multiple times into a bucketed hive table

I have a bucketed hive table. It has 4 buckets.
CREATE TABLE user(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
CLUSTERED BY(user_id) INTO 4 BUCKETS;
Initially i have inserted some records into this table using the following query.
set hive.enforce.bucketing = true;
insert into user
select * from second_user;
After this operation In HDFS I see that 4 files are created under this table dir.
Again i needed to insert another set of data into user table. So i ran the below query.
set hive.enforce.bucketing = true;
insert into user
select * from third_user;
Now another 4 files are crated under user folder dir. Now it has total 8 files.
Is this fine to do this kind of multiple inserts into a bucketed table?
Does it affect the bucketing of the table?
I figured it out!!
Actually if you do multiple inserts on a bucketed hive table. Hive wont complain as such.
All hive queries will work fine.
Having said that, Such operation spoils the bucketing concept of the table. I mean after multiple inserts into a bucketed table the sampling fails.
The TABLASAMPLE doesnt work properly after multiple inserts.
Even sort merge bucket map join also doesnt work after such operation.
I dont think that should be a issue because you have declared that you want bucketing on user_id. so every time you would insert it will create 4 more files.
Bucketing is used for faster query processing so if it is making 4 more files everytime it will be making your query processing even faster.

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

Resources