How to create Hive table with user specified number of records? - hadoop

Is it possible to create a hive table with user-specified number of records?
For example, I want to create a table with x number of rows (where x is defined by the user). The table would have two columns 1. unique row id [could be auto-incremented] 2. Randomly generated String.
Is this possible using Hive?

set N=7;
select pe.i+1 as n
,java_method ('org.apache.commons.lang.RandomStringUtils','randomAlphabetic',10) as str
from (select 1) x
lateral view posexplode(split(space(${hiveconf:N}-1),' ')) pe as i,x
;
+---+------------+
| n | str |
+---+------------+
| 1 | udttBCmtxT |
| 2 | kkrMQmirSG |
| 3 | iYDABgXOvW |
| 4 | DKHKgtXKPS |
| 5 | ylebKcdcGj |
| 6 | DaujBCkCtz |
| 7 | VMaWfbtzFY |
+---+------------+
posexplode
java_method
RandomStringUtils

Specifying limit on number of rows at the time of creating table may not be possible but , its possible to limit the number of rows that can be inserted into table using LIMIT clause
-- <filename:dbloader.sql>
create table {hiveconf:TABLENAME} ( id int, string1 string)
insert into newtable
select id,string1 from oldtable limit {hiveconf:ROWLIMIT};
and while submitting hive script -
hive --hiveconf TABLENAME='XYZ' --hiveconf ROWLIMIT=1000 -f dbloader.sql
as far as creating unique incremental id , you will have to write UDF for it.

Related

Insert data to table from another table containing null values and replace null values with the original table 1 values

I want to match first column of both table and insert table 2 values to table 1 . But if Table 2 values are null leave table 1 vlaues as it is .I am using Hive to dothis .Please help.
You need to use coalesce to get non null value to populate b column and case statement to make decision to populate c column.
Example:
hive> select t1.a,
coalesce(t2.y,t1.b)b,
case when t2.y is null then t1.c
else t2.z
end as c
from table1 t1 left join table2 t2 on t1.a=t2.x;
+----+-----+----+--+
| a | b | c |
+----+-----+----+--+
| a | xx | 5 |
| b | bb | 2 |
| c | zz | 7 |
| d | dd | 4 |
+----+-----+----+--+

How do we get the 1000 tables description using hive?

I have 1000 tables, need to check the describe <table name>; for one by one. Instead of running one by one, can you please give me one command to fetch "N" number of tables in a single shot.
You can make a shell script and call it with a parameter. For example following script receives schema, prepares list of tables in the schema, calls DESCRIBE EXTENDED command, extracts location, prints table location for first 1000 tables in the schema ordered by name. You can modify and use it as a single command:
#!/bin/bash
#Create table list for a schema (script parameter)
HIVE_SCHEMA=$1
echo Processing Hive schema $HIVE_SCHEMA...
tablelist=tables_$HIVE_SCHEMA
hive -e " set hive.cli.print.header=false; use $HIVE_SCHEMA; show tables;" 1> $tablelist
#number of tables
tableNum_limit=1000
#For each table do:
for table in $(cat $tablelist|sort|head -n "$tableNum_limit") #add proper sorting
do
echo Processing table $table ...
#Call DESCRIBE
out=$(hive client -S -e "use $HIVE_SCHEMA; DESCRIBE EXTENDED $table")
#Get location for example
table_location=$(echo "${out}" | egrep -o 'location:[^,]+' | sed 's/location://')
echo Table location: $table_location
#Do something else here
done
Query the metastore
Demo
Hive
create database my_db_1;
create database my_db_2;
create database my_db_3;
create table my_db_1.my_tbl_1 (i int);
create table my_db_2.my_tbl_2 (c1 string,c2 date,c3 decimal(12,2));
create table my_db_3.my_tbl_3 (x array<int>,y struct<i:int,j:int,k:int>);
MySQL (Metastore)
use metastore
;
select d.name as db_name
,t.tbl_name
,c.integer_idx + 1 as col_position
,c.column_name
,c.type_name
from DBS as d
join TBLS as t
on t.db_id =
d.db_id
join SDS as s
on s.sd_id =
t.sd_id
join COLUMNS_V2 as c
on c.cd_id =
s.cd_id
where d.name like 'my\_db\_%'
order by d.name
,t.tbl_name
,c.integer_idx
;
+---------+----------+--------------+-------------+---------------------------+
| db_name | tbl_name | col_position | column_name | type_name |
+---------+----------+--------------+-------------+---------------------------+
| my_db_1 | my_tbl_1 | 1 | i | int |
| my_db_2 | my_tbl_2 | 1 | c1 | string |
| my_db_2 | my_tbl_2 | 2 | c2 | date |
| my_db_2 | my_tbl_2 | 3 | c3 | decimal(12,2) |
| my_db_3 | my_tbl_3 | 1 | x | array<int> |
| my_db_3 | my_tbl_3 | 2 | y | struct<i:int,j:int,k:int> |
+---------+----------+--------------+-------------+---------------------------+

Dropping hive partition based on certain condition in runtime

I have a table in hive built using the following command:
create table t1 (x int, y int, s string) partitioned by (wk int) stored as sequencefile;
The table has the data below:
select * from t1;
+-------+-------+-------+--------+--+
| t1.x | t1.y | t1.s | t1.wk |
+-------+-------+-------+--------+--+
| 1 | 2 | abc | 10 |
| 4 | 5 | xyz | 11 |
| 7 | 8 | pqr | 12 |
+-------+-------+-------+--------+--+
Now the ask is to drop the oldest partition when partition count is >=2
Can this be handled in hql or through any shell script and how?
Considering I will be using dbname as variable like hive -e 'use "$dbname"; show partitions t1
If your partitions are ordered by date, you could write a shell script in which you could use hive -e 'SHOW PARTITIONS t1' to get all partitions, in your example, it will return:
wk=10
wk=11
wk=12
Then you can issue hive -e 'ALTER TABLE t1 DROP PARTITION (wk=10)' to remove the first partition;
So something like:
db=mydb
if (( `hive -e "use $db; SHOW PARTITIONS t1" | grep wk | wc -l` < 2)) ; then
exit;
fi
partition=`hive -e "use $db; SHOW PARTITIONS t1" | grep wk | head -1`;
hive -e "use $db; ALTER TABLE t1 DROP PARTITION ($partition)";

How To Parse a String (From a different Table) in Hive (Hadoop) And Load It To a Different Table

I have this Table as an Input:
Table Name:Deals
Columns: Doc_id(BIGINT),Nv_Pairs_Feed(STRING),Nv_Pairs_Category(STRING)
For Example:
Doc_id: 4997143658422483637
Nv_Pairs_Feed: "TYPE:Wiper Blade;CONDITION:New;CATEGORY:Auto Parts and Accessories;STOCK_AVAILABILITY:Y;ORIGINAL_PRICE:0.00"
Nv_Pairs_Category: "Condition:New;Store:PartsGeek.com;"
I am trying to parse Fields: "Nv_Pairs_Feed" & "Nv_Pairs_Category" and extract their N:V Pairs (each pair is Divided by ';', and each Name and Value are divided with ':').
My goal is to insert each N:V as a Row in this table:
Doc_id | Name | Value | Source_Field
Example for desired Result:
4997143658422483637 | Condition | New | Nv_Pairs_Category
4997143658422483637 | Store | PartsGeek.com | Nv_Pairs_Category
4997143658422483637 | TYPE | Wiper Blade | Nv_Pairs_Feed
4997143658422483637 | CONDITION | New | Nv_Pairs_Feed
4997143658422483637 | CATEGORY | Auto Parts and Accessories | Nv_Pairs_Feed
4997143658422483637 | STOCK_AVAILABILITY | Y | Nv_Pairs_Feed
4997143658422483637 | ORIGINAL_PRICE | 0.00 | Nv_Pairs_Feed
You can convert the strings to a map using the standard Hive UDF str_to_map and then use the Brickhouse UDF ( http://github.com/klout/brickhouse ) map_key_values , combine and numeric_range to explode those maps. i.e Something like the following
create view deals_map_view as
select doc_id,
map_key_values(
combine( map_to_str( nv_pairs_feed, ';', ':'),
map_to_str( mv_pairs_category, ';', ':'))) as deals_map_key_values
from deals;
select
doc_id,
array_index( deals_map_key_values, i ).key as name,
array_index( deals_map_key_values, i ).value as value
from deals_map_view
lateral view numeric_range( size( feed_map_key_values) ) i1 as i
You can probably do something similar with an explode_map UDF

Oracle explain plan over simple select performs multiple hash joins when multiple columns are indexed in a table

I am currently running into an issue with my Oracle instance. I have two simple select statements:
select * from dog_vets
and
select * from dog_statuses
and the following fiddle
My explain plan on dog_vets is as follows:
0 | Select Statement
1 | Table Access Full Scan dog_vets
my explain plan on dog_statuses is as follows:
ID|Operation | Name | Rows |Bytes | cost | time
0 | Select Statement | | 20G | 500M | 100000 | 999:99:17
1 | View | index%_join_001 | 20G | 500M | 100000 | 999:99:17
2 | Hash Join | | | | |
3 | Hash Join | | | | |
4 | Index fast full scan dog_statuses_check_up | | 20G | 500M | 100000 | 32:15:00
5 | Index fast full scan dog_statuses_sick| | 20G | 500M | 100000 | 35:19:00
To get this type of output execute the following statement:
explain plan for
select * from dog_vets;
OR
explain plan for
select * from dog_statuses;
and then
select * from table(dbms_xplan.display);
Now my question is, why do multiple indexes imply a view (materialized I assume) being created in my above statements and further what type of performance hit am I suffering on this type of query? As it stands now dog_vets has ~300 million records and dog_Statuses has about 500 million. I have yet to be able to get select * from dog_statuses to return in under 10 hours. This is primarily because the query dies before it completes.
DDL
In case sql fiddle dies:
create table dog_vets
(
name varchar2(50),
founded timestamp,
staff_count number
);
create table dog_statuses
(
check_up timestamp,
sick varchar2(1)
);
create index dog_vet_name
on dog_vets(name);
create index dog_status_check_up
on dog_statuses(check_up);
create index dog_status_sick
on dog_statuses(sick);
You could try to tell the optimizer to forget about indexes
SELECT /*+NO_INDEX(dog_statuses)*/ *
FROM dog_statuses

Resources