Vertica performance issue while copying parquet files from S3 - performance

I have 66 parquet files wit GZIP compression consisting of 2 billion records in S3, I am using a Vertica copy command to copy data from s3 to Vertica as below.
COPY schema.table ( col1, col2, col3, col4, col5, col6 ) FROM 's3://test_path/*' PARQUET ABORT ON ERROR DIRECT NO COMMIT
We have 4 Vertica nodes. In order to copy these 2 billion rows, it takes 45+ mins. Vertica documentation says that loading files from S3 run in multithread on multiple nodes by default. I was told by our DBA that the way to reach the best performance is running 66 queries (1 query per file) in parallel, this way each query would be running on a different node and each query would be loading a different file.
Vertica Copy command is called programmatically from Java, I don't want to run copy command per files, this becomes a hassle to maintain transaction and also, files might increase to 1000+ during peak load.
I want to bring down the execution time using only one COPY command, any pointers and help will be really appreciated.
Thanks

Try this one:
COPY schema.table ( col1, col2, col3, col4, col5, col6 )
FROM 's3://test_path/*' ON ANY NODE PARQUET
ABORT ON ERROR DIRECT;
I don't have any idea how your DBA got the idea you should run one command per file ...
The command above will involve each of the 4 nodes in the parsing phase, usually even with several parallel parsers per node. And once the parsing phase is completed, the data will be segmented according to the table's projections' segmentation scheme across the nodes, each node will sort and encode its own data and finally write it to disk - and will commit at the end. Just remember the ON ANY NODE directive after the file glob string.
With a few 1000 instead of your 60 files you might eventually get to a point where the performance deteriorates (with a few terabyte's worth of data, for example, depending on the RAM size of your nodes) - and then you might want to split into several commands using the usual divide-and-conquer approach. Just try it for starters ...

Related

Spoon run slow from Postgres to Oracle

I have an ETL Spoon that read a table from Postgres and write into Oracle.
No transformation, no sort. SELECT col1, col2, ... col33 from table.
350 000 rows in input. The performance is 40-50 rec/sec.
I try to read/write the same table from PS to PS with ALL columns (col1...col100) I have 4-5 000 rec/sec
The same if I read/write from Oracle to Oracle: 4-5 000 rec/sec
So, for me, is not a network problem.
If I try with another table Postgres and only 7 columns, the performances are good.
Thanks for the help.
It happened same in my case also, while loading data from Oracle and running it on my local machine(Windows) the processing rate was 40 r/s but it was 3000 r/s for Vertica database.
I couldn't figure it out what was the exact problem but I found a way to increase the row count. It worked from me. you can also do the same.
Right click on the Table Input steps, you will see "Change Number Of Copies to Start"
Please include below in the where clause, This is to avoid duplicates. Because when you choose the option "Change Number Of Copies to Start" the query will be triggered N number of time and return duplicates but keeping below code in where clause will get only distinct records
where ora_hash(v_account_number,10)=${internal.step.copynr}
v_account_number is primary key in my case
10 is, say for example you have chosen 11 copies to start means, 11 - 1 = 10 so it is up to you to set.
Please note this will work, I suggest you to use on local machine for testing purpose but on the server definitely you will not face this issue. so comment the line while deploying to servers.

Avoiding Data Duplication when Loading Data from Multiple Servers

I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"

How can I download all observations from Hue/Hive output?

I am struggling with such a problem. My output table after executing a query on Hue/Hive has 1,2 mln of observations. When I try to download results as an .csv format there is only a possibility to download firt 1 mln of observations. I know that I can execute a query, select firs 0,9 mln of observations and download results and then execute a query to extract last 0,3 mln of observations and download results and merge then in for example R statistical package. But maybe anyone knows how to do it in a single approach?
You could bump the limit to more than 1 million but beware it might slowdown Hue: https://github.com/cloudera/hue/blob/master/desktop/conf.dist/hue.ini#L741
An alternative is to do a CREATE TABLE AS SELECT ... (this will scale but won't be CSV by default)
The easy solution for this would be to save the output in a HDFS directory and then download data from there.Use a query like this to store the results:
insert overwrite directory "$path" select * from ...

Forcing a reduce phase or a second map reduce job in hive

I am running a hive query of the following form:
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT /*+ MAPJOIN(...) */ * FROM ...
Because of the MAPJOIN, the result does not require a reduce phase. The map phase uses about 5000 mappers, and it ends up taking about 50 minutes to complete the job. It turns out that most of this time is spent copying those 5000 files to the local directory.
To try to optimize this, I replaced SELECT * ... with SELECT DISTINCT * ... (I know in advance that my results are already distinct, so this doesn't actually change my result), in order to force a second map reduce job. The first map reduce job is the same as before, with 5000 mappers and 0 reducers. The second map reduce job now has 5000 mappers and 3 reducers. With this change, there are now only 3 files to be copied, rather than 5000, and the query now only takes a total of about 20 minutes.
Since I don't actually need the DISTINCT, I'd like to know whether my query can be optimized in a less kludge-y way, without using DISTINCT?
What about wrapping you query with another SELECT, and maybe a useless WHERE clause to make sure it kicks off a job.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT *
FROM (
SELECT /*+ MAPJOIN(...) */ *
FROM ..
) x
WHERE 1 = 1
I'll run this when I get a chance tomorrow and delete this part of the answer if it doesn't work. If you get to it before me then great.
Another option would be to take advantage of the virtual columns for file name and line number to force distinct results. This complicates the query and introduces two meaningless columns, but has the advantage that you no longer have to know in advance that your results will be distinct. If you can't abide the useless columns, wrap it in another SELECT to remove them.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT {{enumerate every column except the virutal columns}}
FROM (
SELECT DISTINCT /*+ MAPJOIN(...) */ *, INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
FROM ..
) x
Both solutions are more kludge-y than what you came up with, but have the advantage that you are not limited to queries with distinct results.
We get another option if you aren't limited to Hive. You could get rid of the LOCAL and write the results to HDFS, which should be fast even with 5000 mappers. Then use hadoop fs -getmerge /result/dir/on/hdfs/ to pull the results into the local filesystem. This unfortunately reaches out of Hive, but maybe setting up a two step Oozie workflow is acceptable for your use case.

Load multiple data files into multiple tables using single control file using sql loader

I have requirement to load billions of record into 5 different tables , each one of these tables have different data files. These 5 tables will be populated daily and will be truncated next day before loading fresh data.
Que1 : How do I load data into 5 different table using 5 different data files using 1 control fle?
Que2: Do I need 5 different discard, log and bad files to keep track of these 5 different loads?
Que3 : What is better and efficient way to load billions of records daily - using 5 different control table , 5 discard ,5 log file OR just 1 control table will solve the purpose.
Que4: What if one of 5 load fails then I need to rerun the sqloader for all 5 tables again?
Note : As of now we are loading data into one table but it is taking 5-6 hours to load, so we are looking for better performance.
I will be running sqlldr from shell script.
there are 4 different data files containg data for 1day, 7day, 15day
LOAD DATA
replace
INTO TABLE T1_1DAY_STG
FIELDS TERMINATED BY X'05'
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
MM_INTERVAL,
STRATEGY_ID ,
AGGREGATE_DATE date "YYYY-MM-DD"
)
INTO TABLE T1_7DAY_STG
FIELDS TERMINATED BY X'05'
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
MM_INTERVAL,
STRATEGY_ID ,
AGGREGATE_DATE date "YYYY-MM-DD"
)
I am planning for shell script like this
echo "start SQL loader" >> ${LOG_FILE} 2>&1
DCTL=$( eval echo \${TX_SQLLDR_${i}_CTL_SP} )
DDATA=$( eval echo \${TX_${i}_DATA_FILE_SP} )
DLOG=$( eval echo \${TX_${i}_DATA_FILE_LOG_SP} )
DBAD=$( eval echo \${TX_${i}_DATA_FILE_BAD_SP} )
DDISCARD=$( eval echo \${TX_${i}_DATA_FILE_DISCARD_SP} )
${ORACLE_HOME}/bin/sqlldr ${ORACLE_USER}/${ORACLE_PASSWD}#${ORACLE_SID} control=${CTL_DIR}/${DCTL} data=${DATA_DIR}/${DDATA} log=${LOG_DIR}/${DLOG} bad=${LOG_DIR}/${DBAD} discard=${LOG_DIR}/${DDISCARD} errors=${ERRNUM} direct=true silent=FEEDBACK > ${TMP_LOG_FILE} 2>&1
Thanks
Sandy
I would definitely look at using external tables, as they have better support for parallel direct path inserts. If you are loading from multiple files and there is some data element in the file that allows you to determine which table the data is going to be loaded into then you can use the following elements for best performance:
NOLOGGING -- since you're reloading every day anyway
DIRECT PATH -- which you're already doing
Parallel read of data files -- http://docs.oracle.com/cd/B28359_01/server.111/b28319/et_concepts.htm#i1007483
Multitable Insert
Parallel Insert
No statistics gathering on the loaded table -- lock the table statistics without gathering them and rely on dynamic sampling.
If you absolutely have to use SQL*Loader then consider splitting your data files into multiple smaller files and using parallel direct path sql*loader sessions, but be aware that this means running multiple sql*loader processes.

Resources