How To Parse a String (From a different Table) in Hive (Hadoop) And Load It To a Different Table - hadoop

I have this Table as an Input:
Table Name:Deals
Columns: Doc_id(BIGINT),Nv_Pairs_Feed(STRING),Nv_Pairs_Category(STRING)
For Example:
Doc_id: 4997143658422483637
Nv_Pairs_Feed: "TYPE:Wiper Blade;CONDITION:New;CATEGORY:Auto Parts and Accessories;STOCK_AVAILABILITY:Y;ORIGINAL_PRICE:0.00"
Nv_Pairs_Category: "Condition:New;Store:PartsGeek.com;"
I am trying to parse Fields: "Nv_Pairs_Feed" & "Nv_Pairs_Category" and extract their N:V Pairs (each pair is Divided by ';', and each Name and Value are divided with ':').
My goal is to insert each N:V as a Row in this table:
Doc_id | Name | Value | Source_Field
Example for desired Result:
4997143658422483637 | Condition | New | Nv_Pairs_Category
4997143658422483637 | Store | PartsGeek.com | Nv_Pairs_Category
4997143658422483637 | TYPE | Wiper Blade | Nv_Pairs_Feed
4997143658422483637 | CONDITION | New | Nv_Pairs_Feed
4997143658422483637 | CATEGORY | Auto Parts and Accessories | Nv_Pairs_Feed
4997143658422483637 | STOCK_AVAILABILITY | Y | Nv_Pairs_Feed
4997143658422483637 | ORIGINAL_PRICE | 0.00 | Nv_Pairs_Feed

You can convert the strings to a map using the standard Hive UDF str_to_map and then use the Brickhouse UDF ( http://github.com/klout/brickhouse ) map_key_values , combine and numeric_range to explode those maps. i.e Something like the following
create view deals_map_view as
select doc_id,
map_key_values(
combine( map_to_str( nv_pairs_feed, ';', ':'),
map_to_str( mv_pairs_category, ';', ':'))) as deals_map_key_values
from deals;
select
doc_id,
array_index( deals_map_key_values, i ).key as name,
array_index( deals_map_key_values, i ).value as value
from deals_map_view
lateral view numeric_range( size( feed_map_key_values) ) i1 as i
You can probably do something similar with an explode_map UDF

Related

How do we get the 1000 tables description using hive?

I have 1000 tables, need to check the describe <table name>; for one by one. Instead of running one by one, can you please give me one command to fetch "N" number of tables in a single shot.
You can make a shell script and call it with a parameter. For example following script receives schema, prepares list of tables in the schema, calls DESCRIBE EXTENDED command, extracts location, prints table location for first 1000 tables in the schema ordered by name. You can modify and use it as a single command:
#!/bin/bash
#Create table list for a schema (script parameter)
HIVE_SCHEMA=$1
echo Processing Hive schema $HIVE_SCHEMA...
tablelist=tables_$HIVE_SCHEMA
hive -e " set hive.cli.print.header=false; use $HIVE_SCHEMA; show tables;" 1> $tablelist
#number of tables
tableNum_limit=1000
#For each table do:
for table in $(cat $tablelist|sort|head -n "$tableNum_limit") #add proper sorting
do
echo Processing table $table ...
#Call DESCRIBE
out=$(hive client -S -e "use $HIVE_SCHEMA; DESCRIBE EXTENDED $table")
#Get location for example
table_location=$(echo "${out}" | egrep -o 'location:[^,]+' | sed 's/location://')
echo Table location: $table_location
#Do something else here
done
Query the metastore
Demo
Hive
create database my_db_1;
create database my_db_2;
create database my_db_3;
create table my_db_1.my_tbl_1 (i int);
create table my_db_2.my_tbl_2 (c1 string,c2 date,c3 decimal(12,2));
create table my_db_3.my_tbl_3 (x array<int>,y struct<i:int,j:int,k:int>);
MySQL (Metastore)
use metastore
;
select d.name as db_name
,t.tbl_name
,c.integer_idx + 1 as col_position
,c.column_name
,c.type_name
from DBS as d
join TBLS as t
on t.db_id =
d.db_id
join SDS as s
on s.sd_id =
t.sd_id
join COLUMNS_V2 as c
on c.cd_id =
s.cd_id
where d.name like 'my\_db\_%'
order by d.name
,t.tbl_name
,c.integer_idx
;
+---------+----------+--------------+-------------+---------------------------+
| db_name | tbl_name | col_position | column_name | type_name |
+---------+----------+--------------+-------------+---------------------------+
| my_db_1 | my_tbl_1 | 1 | i | int |
| my_db_2 | my_tbl_2 | 1 | c1 | string |
| my_db_2 | my_tbl_2 | 2 | c2 | date |
| my_db_2 | my_tbl_2 | 3 | c3 | decimal(12,2) |
| my_db_3 | my_tbl_3 | 1 | x | array<int> |
| my_db_3 | my_tbl_3 | 2 | y | struct<i:int,j:int,k:int> |
+---------+----------+--------------+-------------+---------------------------+

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

How to create Hive table with user specified number of records?

Is it possible to create a hive table with user-specified number of records?
For example, I want to create a table with x number of rows (where x is defined by the user). The table would have two columns 1. unique row id [could be auto-incremented] 2. Randomly generated String.
Is this possible using Hive?
set N=7;
select pe.i+1 as n
,java_method ('org.apache.commons.lang.RandomStringUtils','randomAlphabetic',10) as str
from (select 1) x
lateral view posexplode(split(space(${hiveconf:N}-1),' ')) pe as i,x
;
+---+------------+
| n | str |
+---+------------+
| 1 | udttBCmtxT |
| 2 | kkrMQmirSG |
| 3 | iYDABgXOvW |
| 4 | DKHKgtXKPS |
| 5 | ylebKcdcGj |
| 6 | DaujBCkCtz |
| 7 | VMaWfbtzFY |
+---+------------+
posexplode
java_method
RandomStringUtils
Specifying limit on number of rows at the time of creating table may not be possible but , its possible to limit the number of rows that can be inserted into table using LIMIT clause
-- <filename:dbloader.sql>
create table {hiveconf:TABLENAME} ( id int, string1 string)
insert into newtable
select id,string1 from oldtable limit {hiveconf:ROWLIMIT};
and while submitting hive script -
hive --hiveconf TABLENAME='XYZ' --hiveconf ROWLIMIT=1000 -f dbloader.sql
as far as creating unique incremental id , you will have to write UDF for it.

Automatically generating documentation about the structure of the database

There is a database that contains several views and tables.
I need create a report (documentation of database) with a list of all the fields in these tables indicating the type and, if possible, an indication of the minimum/maximum values and values from first row. For example:
.------------.--------.--------.--------------.--------------.--------------.
| Table name | Column | Type | MinValue | MaxValue | FirstRow |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | day | date | ‘2010-09-17’ | ‘2016-12-10’ | ‘2016-12-10’ |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | price | double | 1030.8 | 29485.7 | 6023.8 |
:------------+--------+--------+--------------+--------------+--------------:
| … | | | | | |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | day | date | ‘2014-06-20’ | ‘2016-11-28’ | ‘2016-11-16’ |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | owner | string | NULL | NULL | ‘Joe’ |
'------------'--------'--------'--------------'--------------'--------------'
I think the execution of many queries
SELECT MAX(column_name) as max_value, MIN(column_name) as min_value
FROM table_name
Will be ineffective on the huge tables that are stored in Hadoop.
After reading documentation found an article about "Statistics in Hive"
It seems I must use request like this:
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;
But this command ended with error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
Do I understand correctly that this request add information to the description of the table and not display the result? Will this request work with view?
Please suggest how to effectively and automatically create documentation for the database in HIVE?

Joining tables with same column names - ORACLE

I am using Oracle.
I am currently working one 2 tables which both have the same column names. Is there any way in which I can combine the 2 tables together as they are?
Simple example to show what I mean:
TABLE 1:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| a | 1 | w |
| b | 2 | x |
TABLE 2:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| c | 3 | y |
| d | 4 | z |
RESULT THAT I WANT:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| a | 1 | w |
| b | 2 | x |
| c | 3 | y |
| d | 4 | z |
Any help would be greatly appreciated. Thank you in advance!
You can use the union set operator to get the result of two queries as a single result set:
select column1, column2, column3
from table1
union all
select column1, column2, column3
from table2
union on its own implicitly removes duplicates; union all preserves them. More info here.
The column names don't need to be the same, you just need the same number of columns with the same datatpes, in the same order.
(This is not what is usually meant by a join, so the title of your question is a bit misleading; I'm basing this on the example data and output you showed.)

Resources