Clickhouse : Inserting data with missing columns from parquet file - parquet

I have hundreds of different parquet files that I want to add to a single table on a Clickhouse database. They all contain the same type of data, but some of them have a few missing columns.
Is there still a way to add the data directly from those parquet files using a query such as
cat {file_path} | clickhouse-client --query="INSERT INTO table FORMAT Parquet"?
If I try doing this, I get an error like this one :
Code: 8. DB::Exception: Column "column_name" is not presented in input data: data for INSERT was parsed from stdin
I tried adding to the missing column a NULL or DEFAULT value when creating the table, but I still get the same result, and the exception results in not adding any data from the concerned parquet file.
Is there an easy way to do this with Clickhouse, or do I just have either to fix my parquet files, or preprocessing my data and inserting it with another type of query format that doesn't use parquet?

Related

Cannot read timestamp data from s3 table with parquet data through hive , LongWritable cannot be cast to TimestampWritableV2

So I'm trying to read a table thats pointed to an s3 bucket with parquet files. The table ddl has input format as : org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
and output format as: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
I'm getting this error when doing a simple select * from table.
org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritableV2
I'm able to query the data in Athena and I've searched endlessly on google. One solution I've seen is to recreate the table but change the data type to string and then convert to timestamp when querying the table.

Result of Hive unbase64() function is correct in the Hive table, but becomes wrong in the output file

There are two questions:
I use unbase64() to process data and the output is completely correct in both Hive and SparkSQL. But in Presto, it shows:
Then I insert the data to both local path and hdfs, and the the data in both output files are wrong:
The code I used to insert data:
insert overwrite directory '/tmp/ssss'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
select * from tmp_ol.aaa;
My question is:
1. Why the processed data can be shown correctly in both hive and SparkSQL but Presto? The Presto on my machine can display this kind of character.
Why the data cannot be shown correctly in the output file? The files is in utf-8 format.
You can try using CAST (AS STRING) over output of unbase64() function.
spark.sql("""Select CAST(unbase64('UsImF1dGhvcml6ZWRSZXNvdXJjZXMiOlt7Im5h') AS STRING) AS values FROM dual""").show(false)```

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Moving partial data across hadoop instances

I have to copy a certain chunk of data from one hadoop cluster to another. I wrote a hive query which dumps the data into hdfs. After copying the file to the destination cluster, I tried to load the data using the command "load data inpath '/a.txt' into table data". I got the following error message
Failed with exception Wrong file format. Please check the file's format.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
I had dumped the data as a sequence file. Can anybody let me know what am I missing here ?
You should use STORED AS SEQUENCEFILE while creating the table if you want to store sequence file in the table. And you have written that you have dumped data as Sequence file but your file name is a.txt. I didn't get that.
If you want to load a text file into a table that expects Sequence file as the data source you could do one thing. First create a normal table and load the text file into this table. Then do :
insert into table seq_table select * from text_table;

Resources