Unable to load data from a CSV file into HIVE - hadoop

I am getting 'None' values while loading data from a CSV file into hive external table.
My CSV file structure is like this:
creation_month,accts_created
7/1/2018,40847
6/1/2018,67216
5/1/2018,76009
4/1/2018,87611
3/1/2018,99687
2/1/2018,92631
1/1/2018,111951
12/1/2017,107717
'creation_month' and 'accts_created' are my column headers.
create external table monthly_creation
(creation_month DATE,
accts_created INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/user/dir4/'
The location is '/user/dir4/' because that's where I put the 'monthly_acct_creation.csv' file, as seen in the screenshot below:
I have no idea why the external table I created had all 'None' values when the source data have dates and numbers.
Can anybody help?

DATE values describe a particular year/month/day, in the form YYYY-­MM-­DD. For example, DATE '2013-­01-­01'.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-date
I suggest using string type for your date column, which you can convert later or parse into timestamps.
Regarding the integer column, you'll need to skip the header for all columns to be appropriately converted to int types
By the way, new versions of HUE allow you to build Hive tables directly from CSV

Date data type format in hive only accepts yyyy-MM-dd as your date field is not in the same format and that results null values in creation_month field value.
Create table with creation_month field as string datatype and skip the first line by using skip.header.line property in create table statement.
Try with below ddl:
hive> create external table monthly_creation
(creation_month string,
accts_created INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Location '/user/dir4/'
tblproperties ("skip.header.line.count"="1");
hive> select * from monthly_creation;
+-----------------+----------------+--+
| creation_month | accts_created |
+-----------------+----------------+--+
| 7/1/2018 | 40847 |
| 6/1/2018 | 67216 |
| 5/1/2018 | 76009 |
| 4/1/2018 | 87611 |
| 3/1/2018 | 99687 |
| 2/1/2018 | 92631 |
| 1/1/2018 | 111951 |
| 12/1/2017 | 107717 |
+-----------------+----------------+--+

Related

Concat variable with a string in location parameter in a create statement?

In Greenplum, I need to create an external table with a dynamic location parameter. For an example:
CREATE READABLE TABLE_A(
date_inic date,
func_name varchar,
q_session bigint
)
LOCATION(:location)
FORMAT 'TEXT' (DELIMITER '|');
But in :location parameter I need to concat it with a fixed string. I tried:
LOCATION (:location || '123')
But I get a syntax error, otherwise in select statement it works perfectly. I'm inserting :location value like: " 'gphdfs://teste:1010/tmp' "
Can anyone help me?
You are missing a few things in your table definition. You forgot "external" and "table".
CREATE READABLE EXTERNAL TABLE table_a
(
date_inic date,
func_name varchar,
q_session bigint
)
LOCATION(:location)
FORMAT 'TEXT' (DELIMITER '|');
Note: gphdfs has been deprecated and you should use PXF or gpfdist instead.
Next, you just need to use double quotes around the location value.
[gpadmin#mdw ~]$ psql -f example.sql -v location="'gpfdist://teste:1010/tmp'"
CREATE EXTERNAL TABLE
[gpadmin#mdw ~]$ psql
psql (9.4.24)
Type "help" for help.
gpadmin=# \d+ table_a
External table "public.table_a"
Column | Type | Modifiers | Storage | Stats target | Description
-----------+-------------------+-----------+----------+--------------+-------------
date_inic | date | | plain | |
func_name | character varying | | extended | |
q_session | bigint | | plain | |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter '|' null '\N' escape '\'
External options: {}
External location: gpfdist://teste:1010/tmp
Execute on: all segments
And from bash, you can just concat the strings together too.
loc="gpfdist://teste"
port="1010"
dir="tmp"
location="'""$loc"":""$port""/""$dir""'"
psql -f example.sql -v location="$location"

Change table column name parquet format Hadoop

I have table with columns a,b,c.
The data store on hdfs as parquet, is it possible to change specific column name even if the parquet already writted with the schema of a,b,c?
read file in a loop
create a new df with changed column name
write new df in append mode in another dir
move this new dir to read dir
cmd=['hdfs', 'dfs', '-ls', OutDir]
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
for i in process.communicate():
if i:
for j in i.decode('utf-8').strip().split():
if j.endswith('snappy.parquet'):
print('reading file ',j)
mydf = spark.read.format("parquet").option("inferSchema","true")\
.option("header", "true")\
.load(j)
print('df built on bad file ')
mydf.createOrReplaceTempView("dtl_rev")
ssql="""select old-name AS new_name,
old_col AS new_col from dtl_rev"""
newdf=spark.sql(ssql)
print('df built on renamed file ')
aggdf.write.format("parquet").mode("append").save(newdir)
We can not rename column name in the existing files, parquet stores schema in the data file,
we can check schema using below command
parquet-tools schema part-m-00000.parquet
we have to take backup into a temp table and re-ingest the history data.
Try using, ALTER TABLE
desc p;
+-------------------------+------------+----------+--+
| col_name | data_type | comment |
+-------------------------+------------+----------+--+
| category_id | int | |
| category_department_id | int | |
| category_name | string | |
+-------------------------+------------+----------+--+
alter table p change column category_id id int
desc p;
+-------------------------+------------+----------+--+
| col_name | data_type | comment |
+-------------------------+------------+----------+--+
| id | int | |
| category_department_id | int | |
| category_name | string | |
+-------------------------+------------+----------+--+

Unable to load data into hive table in correct format

I am trying to load the below table which is having two array typed columns in hive.
Base table:
Array<int> col1 Array<string> col2
[1,2] ['a','b','c']
[3,4] ['d','e','f']
I have created the table in hive as below:
create table base(col1 array<int>,col2 array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';
And then loaded the data as below:
load data local inpath '/home/hduser/Desktop/batch/hiveip/basetable' into table base;
I have used below command:
select * from base;
I got the output as below
[null,null] ["['a'","'b'","'c']"]
[null,null] ["['d'","'e'","'f]"]
I am not getting the data in correct format.
Please help me out where I am getting wrong.
You can change the datatype of col1 array of string instead on array of int then you can get the data for col1.
With col1 datatype as Array(string):-
hive>create table base(col1 array<string>,col2 array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';
hive>select * from base;
+--------------+------------------------+--+
| col1 | col2 |
+--------------+------------------------+--+
| ["[1","2]"] | ["['a'","'b'","'c']"] |
| ["[3","4]"] | ["['d'","'e'","'f']"] |
+--------------+------------------------+--+
Why is this behaviour because hive not able to detect the values inside array as integers as we are having 1,2 values enclosed in []
Accessing col1 elements:-
hive>select col1[0],col1[1] from base;
+------+------+--+
| _c0 | _c1 |
+------+------+--+
| [1 | 2] |
| [3 | 4] |
+------+------+--+
(or)
With col1 datatype as Array(int type):-
if you are thinking to don't want to change the datatype then you need to keep your input file as below without [] square brackets for array(i.e.col1) values.
1,2 ['a','b','c']
3,4 ['d','e','f']
then create table same as you mentioned in the question, then hive can detect the first 1,2 as array elements as int type.
hive> create table base(col1 array<int>,col2 array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';
hive> select * from base;
+--------+------------------------+--+
| col1 | col2 |
+--------+------------------------+--+
| [1,2] | ["['a'","'b'","'c']"] |
| [3,4] | ["['d'","'e'","'f']"] |
+--------+------------------------+--+
Accessing array elements:-
hive> select col1[0] from base;
+------+--+
| _c0 |
+------+--+
| 1 |
| 3 |
+------+--+

How to load combination types of delimiters inputfile in hive?

How to load combination types of delimiters input file in hive?
The input file will have combination of "" and xml...how to load and process the data
for example the input data is -
"hi"|"welcome"|"to"|India|<xml>data</xml>
How to handle if this kind of issue if we face?
Thanks in advance for any idea or examples please.
I need to load the
hi|welcome|to|india|data,here how to append xml data when loading data into hive?
RegexSerDe
create external table mytable (c1 string,c2 string,c3 string,c4 string,c5 string)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties
(
'input.regex' = '(".*?"|[^|]*)\\|(".*?"|[^|]*)\\|(".*?"|[^|]*)\\|(".*?"|[^|]*)\\|(.*)'
)
;
select * from mytable
;
+------+-----------+------+-------+-----------------+
| c1 | c2 | c3 | c4 | c5 |
+------+-----------+------+-------+-----------------+
| "hi" | "welcome" | "to" | India | <xml>data</xml> |
+------+-----------+------+-------+-----------------+

How to create Hive table with user specified number of records?

Is it possible to create a hive table with user-specified number of records?
For example, I want to create a table with x number of rows (where x is defined by the user). The table would have two columns 1. unique row id [could be auto-incremented] 2. Randomly generated String.
Is this possible using Hive?
set N=7;
select pe.i+1 as n
,java_method ('org.apache.commons.lang.RandomStringUtils','randomAlphabetic',10) as str
from (select 1) x
lateral view posexplode(split(space(${hiveconf:N}-1),' ')) pe as i,x
;
+---+------------+
| n | str |
+---+------------+
| 1 | udttBCmtxT |
| 2 | kkrMQmirSG |
| 3 | iYDABgXOvW |
| 4 | DKHKgtXKPS |
| 5 | ylebKcdcGj |
| 6 | DaujBCkCtz |
| 7 | VMaWfbtzFY |
+---+------------+
posexplode
java_method
RandomStringUtils
Specifying limit on number of rows at the time of creating table may not be possible but , its possible to limit the number of rows that can be inserted into table using LIMIT clause
-- <filename:dbloader.sql>
create table {hiveconf:TABLENAME} ( id int, string1 string)
insert into newtable
select id,string1 from oldtable limit {hiveconf:ROWLIMIT};
and while submitting hive script -
hive --hiveconf TABLENAME='XYZ' --hiveconf ROWLIMIT=1000 -f dbloader.sql
as far as creating unique incremental id , you will have to write UDF for it.

Resources