How to insert records chunk by chunk in Vertica - vertica

I have a CSV file containing large amount of data.
I am trying bulk insert by chunks.
Is it possible to insert 1000 records from that CSV file using COPY command of Vertica? And looping every 1000 records then next.
I am using :
copy bulk from local 'C:\FileUpload\bulk.csv'
delimiter ','
ENCLOSED BY '"'
null as ''
exceptions 'C:\FileUpload\bulk.log'
Thanks

Related

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

SQLLDR with Sequence

I am using sqlldr for uploading data to oracle table from a csv.
For one column i need to use SEQUENCE.
My control file looks like this.
LOAD DATA
INFILE "D:\WORKER\ADS4014_GeneralJobWorkers_1426054269727.csv"
BADFILE dataFile.bad
APPEND INTO TABLE GJobWorkers
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(JobId,UserId,EmpId,ChkInTme,ChkOutTme,tDate,TotalHrs,JobCode,Modified_Time,GJ_SEQ "jobs_GJ.nextval")
But i get error in the log file :
Error on table GJobWorkers, column GJ_SEQ.
Column not found before end of logical record (use TRAILING NULLCOLS)
Can anyone help me doing this?
Thanks in advance.

Insert into Parquet file generates 512 MB files. How to generate 1 GB file?

I am testing Parquet file format and inserting data into Parquet file using Impala external table.
Following is the parameter set that may affect the Parquet file size:
NUM_NODES: 1
PARQUET_COMPRESSION_CODEC: none
PARQUET_FILE_SIZE: 1073741824
I am using following insert statement to write into Parquet file.
INSERT INTO TABLE parquet_test.parquetTable
PARTITION (pkey=X)
SELECT col1, col2, col3 FROM map_impala_poc.textTable where col1%100=X;
I want to generate file size of approximately 1 GB and partitioned data accordingly so that each partition has little less than 1 GB of data in Parquet format. But, this insert operation doesn't generate single file of more than 512 MB. It writes 512 MB of data into one file and then creates another file and writes rest of the data to another file. What can be done to write all the data into single file?
try setting parquet size in the same session you are executing the query
set PARQUET_FILE_SIZE=1g;
INSERT INTO TABLE parquet_test.parquetTable ...

Hive gzip file decompression

I have loaded bunch of .gz file into HDFS and when I create a raw table on top of them I am seeing strange behavior when counting number of rows. Comparing the result of the count(*) from the gz table versus the uncompressed table results in ~85% difference. The table that has the file gz compressed has less records. Has anyone seen this?
CREATE EXTERNAL TABLE IF NOT EXISTS test_gz(
col1 string, col2 string, col3 string)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
LOCATION '/data/raw/test_gz'
;
select count(*) from test_gz; result 1,123,456
select count(*) from test; result 7,720,109
I was able to resolve this issue. Somehow the gzip files were not fully getting decompressed in map/reduce jobs (hive or custom java map/reduce). Mapreduce job would only read about ~450 MB of the gzip file and write out the data out to HDFS without fully reading the 3.5GZ file. Strange, no errors at all!
Since the files were compressed on another server, I decompressed them manually and re-compressed them on the hadoop client server. After that, I uploaded the newly compressed 3.5GZ file to HDFS, and then hive was able to fully count all the records reading the whole file.
Marcin

Resources