Impala minimum DDL - hadoop

I know that we can create an Impala table like
CREATE EXTERNAL TABLE SCHEMA.TableName LIKE PARQUET
'/rootDir/SecondLevelDir/RawFileThatKnowsDataTypes.parquet'
But I am not sure if Impala can create a table from a file (preferably a text file) that has no known formatting. So in other words if I just dump a random file into hadoop with a put command, can I wrap an Impala DDL around it and have a table created. Can anyone tell me?

If you file is newline separated I believe it should work if you provide the column delimiter with the ROW FORMAT clause, since textfile is the default format. Just get rid of your LIKE clause, and choose names and datatypes for your columns something like this:
CREATE EXTERNAL TABLE SCHEMA.TableName (col1 STRING, col2 INT, col3 FLOAT)
'/rootDir/SecondLevelDir/RawFile'
row format delimited fields terminated by ",";

Related

How can I create a partitioned table that is semicolumn separated and has commas as decimal points?

I'm having problems whith this type of table:
manager; sales
charles; 100,1
ferdand; 212,6
aldalbert; 23,4
chuck; 41,6
I'm using the code bellow to create and define the partitioned table:
CREATE TABLE db.table
(
manager string,
sales string
)
partitioned by (file_type string)
row format delimited fields terminated by ';'
lines terminated by '\n'
tblproperties ("skip.header.line.count"="1");
Afterwards, I'm using a regex command to replace the commas by dots and then convert the sales field to a number datatype.
I wonder if there is a better solution than that.
Other than using Spark or Pig to clean the data as well as load the Hive table, then no, you'll need to replace and cast the sales column within HiveQL to get the format you want

hive: external partitioned table without location

Is it possible to create external partitioned table without location? I want to add all the locations later, together with partitions.
i tried:
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
PARTITIONED BY day;
but i got ParseException: missing EOF at 'PARTITIONED' near 'TEXTFILE'
I don't think so, as said in alter location.
But anyway, i think your query as some errors and the correct script would be :
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
PARTITIONED BY (day String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
;
I think the issue is that you have not specified data type for your partition column "day". And you can create a HIVE external table without location and can use ALTER table options later to change the location.

printing null values in hive while declaring the column as decimal

I declared column as decimal.The data looks like this.
4500.00
5321.00
532.00
column name : area Decimal(9,2)
but in Hive it shows like this:
NULL
NULL
If I declare the column as a string it works fine. But I need it in decimal only.
I believe this could be problem with the external table's delimiter mismatch.
I hope you might have configured different delimiter rather than the actual delimter exist in the file in case if you are using the external table.
Please try to find the actual delimiter and alter the table using the below command,
alter table <table_name> set SERDEPROPERTIES ('field.delim' = '<actual_delimiter>');
In create table statement
for example
CREATE TABLE figure(length DOUBLE, width DOUBLE, area DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
You replace '\t' by actual delimiter from your data file.

How to use Parquet in my current architecture?

My current system is architected in this way.
Log parser will parse raw log at every 5 minutes with format TSV and output to HDFS. I created Hive table out of the TSV file from HDFS.
From some benchmark, I found that Parquet can save up to 30-40% of the space usage. I also found that I can create Hive table out of Parquet file starting Hive 0.13. I would like know if I can convert TSV to Parquet file.
Any suggestion is appreciated.
Yes, in Hive you can easily convert from one format to another by inserting from one table to the other.
For example, if you have a TSV table defined as:
CREATE TABLE data_tsv
(col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
And a Parquet table defined as:
CREATE TABLE data_parquet
(col1 STRING, col2 INT)
STORED AS PARQUET;
You can convert the data with:
INSERT OVERWRITE TABLE data_parquet SELECT * FROM data_tsv;
Or you can skip the Parquet table DDL by:
CREATE TABLE data_parquet STORED AS PARQUET AS SELECT * FROM data_tsv;

Share data between hive and hadoop streaming-api output

I've several hadoop streaming api programs and produce output with this outputformat:
"org.apache.hadoop.mapred.SequenceFileOutputFormat"
And the streaming api program can read the file with input format "org.apache.hadoop.mapred.SequenceFileAsTextInputFormat".
For the data in the output file looks like this.
val1-1,val1-2,val1-3
val2-1,val2-2,val2-3
val3-1,val3-2,val3-3
Now I want to read the output with hive. I created a table with this script:
CREATE EXTERNAL
TABLE IF NOT EXISTS table1
(
col1 int,
col2 string,
col3 int
)
PARTITIONED BY (year STRING,month STRING,day STRING,hour STRING)
ROW FORMAT DELIMITED
FIELDs TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
LOCATION '/hive/table1';
When I query data with query
select * from table1
The result will be
val1-2,val1-3
val2-2,val2-3
val3-2,val3-3
It seems the first column has been ignored. I think hive just use values as output not keys. Any ideas?
You are correct. One of the limitations of Hive right now is that ignores the keys from the Sequence file format. By right now, I am referring to Hive 0.7 but I believe it's a limitation of Hive 0.8 and Hive 0.9 as well.
To circumvent this, you might have to create a new input format for which the key is null and the value is the combination of your present key and value. Sorry, I know this was not the answer you were looking for!
It should be fields terminated by ','
instead of fields terminated by '\t', I think.

Resources