Import flat files with key=value pair in Hive - hadoop

I have raw files in HDFS in the format
name=ABC age=10 Location=QWERTY
name=DEF age=15 Location=IWIORS
How do I import data from these flat files into a Hive table with columns 'name' and 'location' only.

You can do the following.
In table declaration, use:
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ' ' --space
        MAP KEYS TERMINATED BY '='
Also your table will have a single column with data type as Map.
So when you can retireve data from the single column using the key.
Other option:
Write your own SerDe. Link below explain the process for JSON data. I am sure you can customize it for your requirements:
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

Related

Clickhouse : Inserting data with missing columns from parquet file

I have hundreds of different parquet files that I want to add to a single table on a Clickhouse database. They all contain the same type of data, but some of them have a few missing columns.
Is there still a way to add the data directly from those parquet files using a query such as
cat {file_path} | clickhouse-client --query="INSERT INTO table FORMAT Parquet"?
If I try doing this, I get an error like this one :
Code: 8. DB::Exception: Column "column_name" is not presented in input data: data for INSERT was parsed from stdin
I tried adding to the missing column a NULL or DEFAULT value when creating the table, but I still get the same result, and the exception results in not adding any data from the concerned parquet file.
Is there an easy way to do this with Clickhouse, or do I just have either to fix my parquet files, or preprocessing my data and inserting it with another type of query format that doesn't use parquet?

How to handle new line characters in hive?

I am exporting table from Teradata to Hive.. The table in the teradata Has a address field which has New line characters(\n).. initially I am exporting the table to mount filesystem path from Teradata and then I am loading the table into hive... Record counts are mismatching between teradata table and hive table, Since new line characters are presented in hive.
NOTE: I don't want to handle this through sqoop to bring the data I want to handle the new line characters while loading Into hive from local path.
I got this to work by creating an external table with the following options:
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
ESCAPED BY '\\'
STORED AS TEXTFILE;
Then I created a partition to the directory that contains the data files. (my table uses partitions)
i.e.
ALTER TABLE STG_HOLD_CR_LINE_FEED ADD PARTITION (part_key='part_week53') LOCATION '/ifs/test/schema.table/staging/';
NOTE: Be sure that when creating your data file you use '\' as the escape character.
Load data command in Hive only copies the data directly into the hdfs table location.
The only reason Hive would split a new line is if you only defined the table stored as TEXT, which by default uses new lines as record separators, not field separators.
To redefine the table you need something like
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY 'x'
LINES TERMINATED BY 'y'
Where, x and y are, hopefully, escape characters around fields containing new lines, and record delimiters, respectively

Result of Hive unbase64() function is correct in the Hive table, but becomes wrong in the output file

There are two questions:
I use unbase64() to process data and the output is completely correct in both Hive and SparkSQL. But in Presto, it shows:
Then I insert the data to both local path and hdfs, and the the data in both output files are wrong:
The code I used to insert data:
insert overwrite directory '/tmp/ssss'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
select * from tmp_ol.aaa;
My question is:
1. Why the processed data can be shown correctly in both hive and SparkSQL but Presto? The Presto on my machine can display this kind of character.
Why the data cannot be shown correctly in the output file? The files is in utf-8 format.
You can try using CAST (AS STRING) over output of unbase64() function.
spark.sql("""Select CAST(unbase64('UsImF1dGhvcml6ZWRSZXNvdXJjZXMiOlt7Im5h') AS STRING) AS values FROM dual""").show(false)```

Does field delimiter matter in binary file formats in Hive?

In textfile format, data is stored in text format with fields delimited by field delimiter. That's why we prefer non-readable delimiter like CTRL^A.
But is there any effect of using field delimiter while creating hive table in rcfile, orc, avro & sequencefile.
In some hive tutorials, I saw usage of delimiter in these binary file formats too.
Example:
create table olympic_orcfile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as orcfile;
Does field delimiter ignored or it matter in binary file formats in Hive?
Ignored by RCFILE, ORC and AVRO but does matter for SEQUENCEFILE.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Resources