Equivalent of 'IN' or 'NOT' operator in sqoop - sqoop

I have a sqoop job torun, The conditions includes :
WHERE cond1='' AND date = '2-12-xxxx' AND date = '3-12-xxxx' AND date = '3-12-xxxx'.
Is there a IN conditional in sqoop similar to sql?

You can run sqoop import using --query and pass any query to get the data.
In --where you have to pass conditions like this --where "cond1='value' and cond2 in (<comma seperated values>)".
If you use where condition on table, it will apply like this select * from <table> where <condition specified in where clause> to fetch the data and hence you can pass any valid conditions in where.

Related

Sqoop --where condition with multiple clause

I'm trying to get data from Oracle and import in Hadoop table. I'm making changes inn existing sqoop, i have to use --where to filter the record. For now we have in where date=somedate condition, now i need to add another condition like date = somedate and status ='Active'. I have make this change in --where. I'm not allowed to use --query 🥺.
Can you guys help me on this ?
You can try like this
--query "select * from table where status = 'Active' AND date=somedate AND $CONDITIONS"
use --where condition wrapped in double quotes like below.
--where " date = somedate and status ='Active'"
And good news is you can add as many conditions as possible. In fact you can add subquery as well - it should be syntactically correct in database.
This syntax is helpful for me:
--query "select * from table where date=somedate AND status ='Active' AND ($CONDITIONS)"
there is no need to use --where.

Parametrizing a sub query with jdbc PreparedStatement

I have the following query where the first argument itself is a subquery
The java code is:
String query = select * from (?) where ROWNUM < ?
PreparedStatement statement = conn.preparedStatement(query)
statement.setString(1, "select * from foo_table")
statement.setInt(2, 3)
When I run the java code, I get an exception. What alternatives do I have for making the first subquery statement.setString(1, "select * from foo_table") a parameter?
This is not possible, parameter placeholders can only represent values, not object names (like table names, column names, etc) nor subselects or other query elements.
You will need to dynamically create the query to execute using string concatenation, or other string formatting/templating options.

How do I store the result of a query into a variable in HiveQL and then use it in another select statement?

How do I store the result of a query into a variable in HiveQL and then use it in another select statement?
For example, whenever I store a normal variable and use it in a select statement it works just fine.
SET a=1; SELECT CASE WHEN b > ${hiveconf:a} THEN NULL ELSE 1 from my_table
But when I try and put a query into the variable, its seems to store the query instead of running it and storing the result. This then results in an error.
SET a=SELECT MAX(num) FROM my_other_table; SELECT CASE WHEN b > ${hiveconf:a} THEN NULL ELSE 1 from my_table
The error being: cannot recognize input near 'select' 'max' '(' in select clause
Does anyone know a work around to this? I am using Hive 0.13
You cant do that only by hive.
If your hive query is controlled by outer script like shell or python.You can perform the first query, get the output and then put it in the next sql.
Or you can change your sql to use join.Your example code can be changed to
select case when b > t.a then NULL else 1 from my_table
join (select max(num) a from my_other_table) t

Is there a Hive equivalent of SQL “LIKE ANY ( SUBQUERY )”

While Hive doesn't supports multi-value LIKE queries which are supported in SQL : ex.
SELECT * FROM user_table WHERE first_name LIKE ANY ( 'root~%' , 'user~%' );
We can convert it into equivalent HIVE queries as :
SELECT * FROM user_table WHERE first_name LIKE 'root~%' OR first_name LIKE 'user~%'
Does anyone know an equivalent solution that Hive does support in case sub-query is used with LIKE ? Have a look at below example :
SELECT * FROM user_table WHERE first_name LIKE ANY ( SELECT expr FROM exprTable);
As It doesn't have values in expression, I can't use same approach for generating multiple LIKE expression separated with OR / AND operator. Initially I thought to write HIVE UDF for it ? Can you please help me supporting such expression and finding HIVE equivalent ?
You can use Hive's RLIKE relational operator as shown below,
SELECT * FROM user_table WHERE first_name RLIKE 'root~|user~|admin~';
Hope this helps!
This is a case involving theta joins in Hive. There is a wiki page for this and a jira request. Please go through the details here on this page: https://cwiki.apache.org/confluence/display/Hive/Theta+Join
Your case is similar to the Side-Table Similarity case given on the page.
You need to convert the expr values into a map and then use regular expression to find the like. Alternatively you can also use union all with all the like expressions in separate SQL - the query might become tedious so you can programatically generate it.
What about this using EXISTS
SELECT * FROM user_table WHERE EXISTS ( SELECT * FROM exprTable WHERE first_name LIKE expr );

what are the following commands in sqoop?

Can anyone tell me what is the use of --split-by and boundary query in sqoop?
sqoop import --connect jdbc:mysql://localhost/my --username user --password 1234 --query 'select * from table where id=5 AND $CONDITIONS' --split-by table.id --target-dir /dir
--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.
Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.
--boundary-query : By default sqoop will use query select min(), max() from to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
Reason to use : If --split-by is not giving you the optimal performance you can use this to improve the performance further.
--split-by is used to distribute the values from table across the mappers uniformly i.e. say u have 100 unique records(primary key) and if there are 4 mappers, --split-by (primary key column) will help to distribute you data-set evenly among the mappers.
$CONDITIONS is used by Sqoop process, it will replace with a unique condition expression internally to get the data-set.
If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.
Sqoop allows you to import data in parallel and --split-by and --boundary-query allow you more control. If you're just importing a table then it'll use the PRIMARY KEY however if you're doing a more advanced query, you'll need to specify the column to do the parallel split.
i.e.,
sqoop import \
--connect 'jdbc:mysql://.../...' \
--direct \
--username uname --password pword \
--hive-import \
--hive-table query_import \
--boundary-query 'SELECT 0, MAX(id) FROM a' \
--query 'SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND $CONDITIONS'\
--num-mappers 3
--split-by a.id \
--target-dir /data/import \
--verbose
Boundary Query lets you specify an optimized query to get the max, min. else it will attempt to do MIN(a.id), MAX(a.id) ON your --query statement.
The results will be (if min=0, max=30) is 3 queries that get run in parallel:
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 0 AND 10;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 11 AND 20;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 21 AND 30;
Split by :
why it is used? -> to enhance the speed while fetching the data from rdbms to hadoop
How it works? -> By default there are 4 mappers in sqoop , so the import works parallely. The entire data is divided into equal partitions. Sqoop considers primary key column for splitting the data and then finds out the maximum and minimum range from it and then makes the 4 ranges for 4 mappers to work.
Eg. 1000 records in primary key column and max value =1000 and min value -0 so sqoop will create 4 ranges - (0-250) , (250-500),(500-750),(750-1000) and depending on values of column the data will be partitioned and given to 4 mappers to store it on HDFS.
so if in case the primary key column is not evenly distributed so with split-by you can change the column-name for evenly partitioning.
In short: Used for partitioning of data to support parallelism and improve performance
Also if we specify --query value within double quotes(" "), we need to precede $CONDITIONS with a slash(\)
--query "select * from table where id=5 AND \$CONDITIONS"
or else
--query 'select * from table where id=5 AND $CONDITIONS'

Resources