What is the significance of $conditions clause in sqoop import command?
select col1, col2 from test_table where \$CONDITIONS
Sqoop performs highly efficient data transfers by inheriting Hadoop’s parallelism.
To help Sqoop split your query into multiple chunks that can be
transferred in parallel, you
need to include the $CONDITIONS placeholder in the where clause of your query.
Sqoop
will automatically substitute this placeholder with the generated conditions specifying
which slice of data should be transferred by each individual task.
While you could skip
$CONDITIONS by forcing Sqoop to run only one job using the --num-mappers 1 param‐
eter, such a limitation would have a severe performance impact.
For example:-
If you run a parallel import, the map tasks will execute your query
with different values substituted in for $CONDITIONS. one mapper
may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and
the next mapper may execute "select bla from foo WHERE (id >= 10000
AND id < 20000)" and so on.
Related
I have a huge oracle table (Transaction), the data in my oracle table has skew data on the column "Customer id" due to which the few mappers take time in hours to finish the job while other mappers finish the job in minutes. I couldn't see any other option to avoid the skewing data as this is the only column can be splited by. We can combine other columns like Customer ID, Batch ID, SEQ NUM to come with multi column split but I understood that sqoop doesn't support multi column in split by.
My objective is to pull the transaction data for a specific period (i.e. batch date unique for a month of data).
I tried the below options in sqoop with 10 mappers.
--split-by "my column name" //for example customer id
--where "my query condition" //for example batch date
Now I am thinking of using the ROWID which might split the rows evenly between the mappers. I thought of using the boundary query to get the MIN & MAX ROW ID. Below is Sqoop command I want to use.
sqoop import \
--table Transaction \
--split-by ROWID \
--where "BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY')" \
--boundary-query "SELECT MIN(ROWID) AS MIN, MAX(ROWID) AS MAXL FROM Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') GROUP BY CUSTOMERID, BATCHNO,BATCHSEQNO " \
--num-mappers 10 \
--target-dir /user/trans
Need advise if this would be right option or is there any other way.
Also I would like to know if we can use multi split-by column name by any chance.
Providing --boundary-query will only save your time in evaluating minimun and maximun value. All mappers will have the same range query.
In your case, sqoop will generate boundary query like -
SELECT MIN(ROWID), MAX(ROWID) FROM (Select * From Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') ) t1
You can try this query and your custom boundary query on your JDBC client to check which one is faster and use that one.
Now coming to uneven mappers load.
Yes, you are right. Currently, sqoop doesn't support multi-column in split by. you have to choose one column. If ROWID is evenly distributed (I am assuming yes), you should use it.
So, you query looks good. Just check compare--boundary-query.
Edit
There is no proper java type issue with ROWID type of Oracle.
Add --map-column-java ROWID=String in your import command to map this to Java's String.
Do you have index on SEQ NUM, if so then you can use SEQ-NUM in --split-by (i am assuming that SEQ-NUM no generating randomly it is populating in incremental fashion for each transaction ). so your sqoop command may look like this
sqoop import \
--table Transaction \
--split-by SEQ-NUM \
--where "BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY')" \
--num-mappers 10 \
--target-dir /user/trans
I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce
I've got a table in Hbase let's say "tbl" and I would like to query it using
Hive. Therefore I mapped a table to hive as follows:
CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");
Queries like:
select * from tbl", "select id from tbl", "select id, data
from tbl
are really fast.
But queries like
select id from tbl where substr(id, 0, 5) = "12345"
select id from tbl where data["777"] IS NOT NULL
are incredibly slow.
In the contrary when running from Hbase shell:
"scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"
it is lightning fast!
When I looked into the mapred job generated by hive on jobtracker I
discovered that "map.input.records" counts ALL the items in Hbase table,
meaning the job makes a full table scan before it even starts any mappers!!
Moreover, I suspect it copies all the data from Hbase table to hdfs to
mapper tmp input folder before executuion.
So, my questions are - Why hbase storage handler for hive does not translate
hive queries into appropriate hbase functions? Why it scans all the records
and then slices them using "where" clause? How can it be improved?
Any suggestions to improve the performance of Hive queries(mapped to HBase Table).
Can we create secondary index on HBase tables?
We are using HBase and Hive integration and trying to tune the performance of Hive queries.
Lots of questions!, I'll try to answer all and give you a few performance tips:
The data is not copied to the HDFS, but the mapreduce jobs generated by HIVE will store their intermediate data in the HDFS.
Secondary indexes or alternative query paths are not supported by HBase (more info).
Hive will translate everything into MapReduce jobs which need time to be distributed & initialized, if you have a very small number of rows its possible that a simple SCAN operation in the Hbase shell is faster than a Hive query but on big datasets, distributing the job among the datanodes is a must.
The Hive HBase handler doesn't do a very good job when extracting the start & stop row keys from the query, queries like substr(id, 0, 5) = "12345" won't use start & stop row keys.
Before executing your queries, run a EXPLAIN [your_query]; command and check for the filterExpr: part, if you don't find it, your query will perform a full table scan. On a side note, all expresions within the Filter Operator: will be transformed into the appropiate filters.
EXPLAIN SELECT * FROM tbl WHERE (id>='12345') AND (id<'12346')
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
tbl
TableScan
alias: tbl
filterExpr:
expr: ((id>= '12345') and (id < '12346'))
type: boolean
Filter Operator
....
Fortunately, there is an easy way to make sure start & stop row keys are used when you're looking for row-key prefixes, just convert substr(id, 0, 5) = "12345" to a simpler query: id>="12345" AND id<"12346", it will be detected by the handler and start & stop row keys will be provided to the SCAN (12345, 12346)
Now, here are a few tips in order to speed up your queries (by a lot):
Make sure you set the following properties to take advantage of batching to reduce the number of RPC calls (the number depends on the size of your columns)
SET hbase.scan.cache=10000;
SET hbase.client.scanner.cache=10000;
Make sure you set the following properties to run a distributed job in your task trackers instead of running local job.
SET mapred.job.tracker=[YOUR_JOB_TRACKER]:8021;
SET hbase.zookeeper.quorum=[ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];
Reduce the amount of columns of your SELECT statement to the minimum. Try not to SELECT *
Whenever you want to use start & stop row keys to prevent full table scans, always provide key>=x and key<y expressions (don't use the BETWEEN operator)
Always EXPLAIN SELECT your queries before executing them.
A hive insert statement of the following form:
insert into my_table select * from my_other_table;
is using ONE reducer - even though just prior the following had been executed:
set mapreduce.job.reduces=80;
Is there a way to force hive to use more reducers? there is not clear reason why this particular query would do a single reducer - given there is no ORDER BY clause at the end.
BTW the source and destination tables are both
stored as parquet
SELECT * FROM table; in Hive does not use any reducers - it is a map-only job.
One way to force Hive to use reducers in a SELECT * is to GROUP BY all of the fields, for example:
SELECT field1, field2, field3 FROM table GROUP BY field1, field2, field3;
Though, note that this will remove duplicate records.
In the insert query you mentioned hive will try to write all the data into a single file. cause its just a select * statement. Hence 1 reducer.
But if you use bucketing. hive will use the same number of reducers as your buckets.
If you have 128 buckets hive will fire 128 reducer jobs and ultimately it will create 128 files.
whats the importance of $CONDITIONS in every sqoop import query ?
ex:
select col1, col2 from test_table where \$CONDITIONS
What if I need to put my own where condition in the query and it works ?
The significance is explained in the Sqoop User Guide and in Apache Sqoop cookbook. To put it into nutshell, Sqoop needs a placeholder in the query to populate it with generated slices to enable parallel import.
It will act as a place holder where sqoop replaces it to make the query works parallel. It is must to put $CONDTIONS when you are importing query results. Still you can place your condition as below,
Select col1,col2 from table where $CONDITIONS and "your condtion"
Note: you can directly see in stack trace what the condition sqoop places here at $CONDITIONS. Eg. The first thing you can see is the condition "where 1=0" to fetch the metadata or schema from source table.