Using bash to send hive script a variable number of fields - bash

I'm automating a data pipeline by using a bash script to move csvs to HDFS and build external Hive tables on them. Currently, this only works when the format of the table is predefined in an .hql file. But I want to be able to read the headers from the CSV and send them as arguments to Hive. So currently I do this inside a loop through the files:
# bash
hive -S -hiveconf VAR1=$target_db -hiveconf VAR2=$filename -hiveconf VAR3=$target_folder/$filename -f create_tables.hql
Which is sent to this...
-- hive
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
individual_pkey INT,
response CHAR(1)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
I want the hive script to look more like this...
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
${hiveconf:ROW1} ${hiveconf:TYPE1},
... ...
${hiveconf:ROW_N} ${hiveconf:TYPE_N}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
Is it possible to send it some kind of array that it would parse? Is this feasible or advisable?

I eventually figured out a way around this.
You can't really write an HQL script that takes in a variable number of fields. You can, however, write a bash script that generates an HQL script of variable length. I've implemented this for my team, but the general idea is to write out how you want the HQL to look as a string in bash, then use something like Rscript to read in and identify the data types of your CSV. Store the data types as an array along with the CSV headers and then loop through those arrays, writing the information to the HQL.

Related

Result of Hive unbase64() function is correct in the Hive table, but becomes wrong in the output file

There are two questions:
I use unbase64() to process data and the output is completely correct in both Hive and SparkSQL. But in Presto, it shows:
Then I insert the data to both local path and hdfs, and the the data in both output files are wrong:
The code I used to insert data:
insert overwrite directory '/tmp/ssss'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
select * from tmp_ol.aaa;
My question is:
1. Why the processed data can be shown correctly in both hive and SparkSQL but Presto? The Presto on my machine can display this kind of character.
Why the data cannot be shown correctly in the output file? The files is in utf-8 format.
You can try using CAST (AS STRING) over output of unbase64() function.
spark.sql("""Select CAST(unbase64('UsImF1dGhvcml6ZWRSZXNvdXJjZXMiOlt7Im5h') AS STRING) AS values FROM dual""").show(false)```

How do I ignore brackets when loading exteral table in HIVE

I'm trying to load an extract of a pig script as an external table in HIVE. Pig enclosed each row between brackets () (tuples?) like this:
(1,2,3,a)
(2,4,5,b)
(4,2,6,c)
and I can't find a way to tell HIVE to ignore those brackets which results in null values for the first column as it is actually an integer.
Any thoughts on how to proceed?
I know I can use a FLATTEN command in PIG but I would also like to learn how to deal with these files directly from HIVE.
There is no way to do this in one step. You'd have to have another step, be it the use of flatten in Pig or an extra Hive INSERT INTO.
In Hive you could use split(string field, string pattern) several times to read from your external table and create the columns you want and then load that into a new table. However I'd always lean towards having Pig output into the format you want, unless something else is reading from this file that expects the data in that format. It will save an expensive re-read of all your data.
As Ben said there is no way to do in one step.. but you can do it by creating one more temp table in hive.
Not sure if I am making it more complicated with one more table.. but it worked for me.
create external table A_TEMP (first string,second int,third int,fourth string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/hdfs/Adata';
Place your data under 'Adata' folder
create external table A (first int,second int,third int,fourth string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/user/hdfs/Afinaldata';
Now lets insert data
insert into table A
select cast(substr(first, 2, length(first) - 2) as int),second,third,substr(fourth, 1,length(fourth) - 1 ) from A_TEMP;
I know type casting will hit performance.. but for the given scenario this is the best I could come up with.

Using Hive in real world applications?

I am a newbee on Hadoop stack, I have learned map-reduce and now hive.
But I am not sure about hive use?
In map-R we have one or more output files n that's our final result, but In hive we can select the records using SQL like queries i.e. HQL but we are not getting any final output file. Results will be shown on terminal only.
Now my Q is how can we use this select HQL so that it can be used by some other analytic's team?
There are lot of ways to extract/export the hive query result outside.
If you want the result in any RDBMS storage you can use Sqoop.
I suggest you go through what Sqoop is and what it does.
And if you want your query results in a file, then there are lot of ways.
Hive supports exporting data from tables.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from table;
Another simple approach would be to simple redirecting you hive query outputs to a file while running your hive queries in CLI.
hive -e "select * from table" > output.txt

Creating hive table using configuration file

I know basic concepts of HIVE. My query is creating the hive table using the external configuration/schema file.
I know the basic query to create the hive table where we pass the column header and datatype in the create table statement . That is nothing but we hard code it.
But I wanted to create the hive table where it takes the column header and datatype from the external configuration file. Can it be done in Hive? It’s fine even we are supposed to write the unix shell script to achieve it but I’m not sure about it.
Below is the format of my configuration file :
Config.txt
id,Integer(2),NOT NULL
name,String(20)
state,String(5),NOT NULL
phone_no,Integer(4)
gender,Char(1)
As of now i have created one .hql file where i have written the hive create table statement script and calling the .hql file in the bash script file.
Below are the .hql file and .sh file:
hiveQ.hql:
create table goodrecs(
id int,
name string,
state string,
phone_no int,
gender string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/hduser/Dataparse/goodrec' INTO TABLE goodrecs;
testscript.sh :
#!/bin/bash
hive -f hiveQ.hql
In hiveQ.hql i wanted column headers and datatype should come from the config.txt file.
How this can be done ?
Thanks in advance
It is very convenient to change config.txt to a standard hql file,use a map which turns types in config.txt to hive column type such as integer to int,char to string.

What is the best way to produce large results in Hive

I've been trying to run some Hive queries with largish result sets. My normal approach is to submit a job through the WebHCat API, and read the results from the resulting stdout file, or to just run hive at the console and pipe stdout to a file. However, with large results (more than one reducer used), the stdout is blank or truncated.
My current solution is to create a new table from the results CREATE TABLE FROM SELECT which introduces an extra step, and leaves the table to clear up afterwards if I don't want to keep the result set.
Does anyone have a better method for capturing all the results from such a Hive query?
You can write the data directly to a directory on either hdfs or the local file system, then do what you want with the files. For example, to generate CSV files:
INSERT OVERWRITE DIRECTORY '/hive/output/folder'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT ... FROM ...;
This is essentially the same as CREATE TABLE FROM SELECT but you don't have to clean up the table. Here's the full documentation:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

Resources