Insert data into a ClickHouse table from a file with deduplication - clickhouse

I'm importing data into a ClickHouse table from CSV files.
cat data.csv | clickhouse-client --config-file=config.xml --query="INSERT INTO data_pool FORMAT CSVWithNames"
Often CSV files contain duplicate entries that are already in the ClickHouse table. What is the most efficient way to insert new data from a CSV file, skipping the entries already in the table?

Something like that
cat data.csv | clickhouse-client --echo --mn --config-file=config.xml --query="CREATE TABLE tmp_data_pool LIKE data_pool ENGINE=ReplcaingMergeTree() ORDER BY f1,f2,f3...fieldN; INSERT INTO tmp_data_pool FORMAT CSVWithNames; INSERT INTO data_pool SELECT * FROM tmp_data_pool FINAL; DROP TABLE tmp_data_pool SYNC"
look details about FINAL and ReplcacingMergeTree()
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree/
https://clickhouse.com/docs/en/operations/settings/settings#max-final-threads
and
https://github.com/ClickHouse/ClickHouse/pull/19375
to improve FINAL performance optimizations

Related

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Hive, create table ___ like ___ stored as ___

I have a table in hive stored as text files. I want to move all the data into another table with the same schema but stored as sequence files.
How do I create the second table? I wanted to use the hive create table like command but it doesn't support as sequencefile
hive> create table test_sq like test_t stored as sequencefile;
FAILED: ParseException line 1:33 missing EOF at 'stored' near 'test_t'
I am looking for a programmatic way so that I can replicate the same process for more tables.
CREATE TABLE test_t LIKE test_sq;
It just copies the source table definition.The new table contains no rows. As you said you have to move all the data. In this case above query is not suitable;
try this,
CREATE TABLE test_sq row format delimited fields terminated by '|' STORED AS sequencefile AS select * from test_t;
Target cannot be partitioned table.
Target cannot be external table.
It copies the structure as well as the data
Note - if you don't want to give row format delimited then remove from query. You can give where clause also in query to copy selected rows;
Try using create + insert together.
Use the normal DDL statement to create the table.
CREATE TABLE test2 (a INT) STORED AS SEQUENCEFILE
then use
INSERT INTO test2 AS SELECT * FROM test;
test is the table with Textfile as data format and 'test2' is the table with SEQUENCEFILE data format.

How to use Parquet in my current architecture?

My current system is architected in this way.
Log parser will parse raw log at every 5 minutes with format TSV and output to HDFS. I created Hive table out of the TSV file from HDFS.
From some benchmark, I found that Parquet can save up to 30-40% of the space usage. I also found that I can create Hive table out of Parquet file starting Hive 0.13. I would like know if I can convert TSV to Parquet file.
Any suggestion is appreciated.
Yes, in Hive you can easily convert from one format to another by inserting from one table to the other.
For example, if you have a TSV table defined as:
CREATE TABLE data_tsv
(col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
And a Parquet table defined as:
CREATE TABLE data_parquet
(col1 STRING, col2 INT)
STORED AS PARQUET;
You can convert the data with:
INSERT OVERWRITE TABLE data_parquet SELECT * FROM data_tsv;
Or you can skip the Parquet table DDL by:
CREATE TABLE data_parquet STORED AS PARQUET AS SELECT * FROM data_tsv;

Hive - How to load data from a file with filename as a column?

I am running the following commands to create my table ABC and insert data from all files that are in my designated file path. Now I want to add a column with filename, but I can't find any way to do that without looping through the files or something. Any suggestions on what the best way to do this would be?
CREATE TABLE ABC
(NAME string
,DATE string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
hive -e "LOAD DATA LOCAL INPATH '${DATA_FILE_PATH}' INTO TABLE ABC;"
Hive does have virtual columns, which include INPUT__FILE__NAME. The link shows how to use this in a statement.
To fill another table with the filename as a column. Assuming your location of your data is hdfs://hdfs.location:port/data/folder/filename1
DROP TABLE IF EXISTS ABC2;
CREATE TABLE ABC2 (
filename STRING COMMENT 'this is the file the row was in',
name STRING,
date STRING);
INSERT INTO TABLE ABC2 SELECT split(INPUT__FILE__NAME,'folder/')[1],* FROM ABC;
You can alter the split to change how much of the full path you actually want to store.

Resources