sqoop2 import very large postgreSQL table failed - hadoop

I am trying to use sqoop transfer from cdh5 to import large postgreSQL table to HDFS. The whole table is about 15G.
First, I tried to import just use the basic information, by entering schema and table name, it didn't work. I always get GC overhead limit exceeded. I tried to change the JVM heap size on Cloudera manager configuration for Yarn and sqoop to maximum (4G), still no help.
Then, I am trying to use sqoop transfer SQL statement to transfer partly of the table, I added SQL statement in the field as the following:
select * from mytable where id>1000000 and id<2000000 ${CONDITIONS}
(partition column is id).
The statement is failed, actually any kind of statements with my own "where" condition were having the error: "GENERIC_JDBC_CONNECTOR_0002:Unable to execute the SQL statement"
Also I tried to use the boundary query, I can use "select min(id), 1000000 from mutable", and it worked, but I tried to use "select 1000000, 2000000 from mytable" to select data further ahead which caused the sqoop server crash and down.
Could someone help? How to add where condition? or how to use the boundary query. I have searched in many places, I didn't find any good document about how to write SQL statement with sqoop2. Also is that possible to use direct on sqoop2?
Thanks

Related

sqoop import fails with numeric overflow

sqoop import job failed caused by: java.sql.SQLException: Numeric Overflow
I have to load Oracle table, it has column type NUMBER in Oracle,without scale, and it's converted to DOUBLE in hive. This is the biggest possible size for both, Oracle and Hive numeric values. The question is how to overcome this error?
OK, my first answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values.
But now I suspect that your Oracle data contains shit, and specifically NaN values, as a result of calculation errors.
See that post for example: When/Why does Oracle adds NaN to a row in a database table
And Oracle even has distinct "Not-a-Number" categories to represent "infinity", to make things even more complicated.
But on Java side, BigDecimal does not support NaN -- from the documentation, in all conversion methods...
Throws:
NumberFormatException - if value is infinite or NaN.
Note that the JDBC driver masks that exception and displays NumericOverflow instead, to make things more complicated to debug...
So your issue looks like that one: Solr Numeric Overflow (from Oracle) -- but unfortunately SolR allows to skip errors, while Sqoop does not; so you cannot use the same trick.
In the end, you will have to "mask" these NaN values with Oracle function NaNVL, using a free-form query in Sqoop:
$ sqoop import --query 'SELECT x, y, NANVL(z, Null) AS z FROM wtf WHERE $CONDITIONS'
Edit: that answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values. That was not the case, see alternate answer.
In theory, it can be solved.
From the Oracle documentation about "Copying Oracle tables to Hadoop" (within their Big Data appliance), section "Creating a Hive table" > "About datatype conversion"...
NUMBER
INT when the scale is 0 and the precision is less than 10
BIGINT when the scale is 0 and the precision is less than 19
DECIMAL when the scale is greater than 0 or the precision is greater than 19
So you must find out what is the actual range of values in your Oracle table, then you will be able to specify the target Hive column either a BIGINT or a DECIMAL(38,0) or a DECIMAL(22,7) or whatever.
Now, from the Sqoop documentation about "sqoop - import" > "Controlling type mapping"...
Sqoop is preconfigured to map most SQL types to appropriate Java or
Hive representatives. However the default mapping might not be
suitable for everyone and might be overridden by --map-column-java
(for changing mapping to Java) or --map-column-hive (for changing
Hive mapping).
Sqoop is expecting comma separated list of mappings (...) for
example $ sqoop import ... --map-column-java id=String,value=Integer
Caveat #1: according to SQOOP-2103, you need Sqoop V1.4.7 or above to use that option with Decimal, and you need to "URL Encode" the comma, e.g. for DECIMAL(22,7)
--map-column-hive "wtf=Decimal(22%2C7)"
Caveat #2: in your case, it is not clear whether the overflow occurs when reading the Oracle value into a Java variable, or when writing the Java variable into the HDFS file -- or even elsewhere. So maybe --map-column-hive will not be sufficient.
And again, according to that post which points to SQOOP-1493, --map-column-java does not support Java type java.math.BigDecimal until at least Sqoop V1.4.7 (and it's not even clear whether it is supported in that specific option, and whether it is expected as BigDecimal or java.math.BigDecimal)
In practice, since Sqoop 1.4.7 is not available in all distros, and since your problem is not well diagnosed, it may not be feasible.
So I would advise to just hide the issue by converting your rogue Oracle column to a String, at read time.
Cf. documentation about "sqoop - import" > "Free-form Query Imports"...
Instead of using the --table, --columns and --where arguments, you can
specify a SQL statement with the --query argument (...) Your query must include the token $CONDITIONS (...) For example:
$ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b ON a.id=b.id WHERE $CONDITIONS' ...
In your case, SELECT x, y, TO_CHAR(z) AS z FROM wtf plus the appropriate formatting inside TO_CHAR so that you don't lose any information due to rounding.

How to execute select query on oracle database using pi spark?

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:
sqlContext.read.format("jdbc")
.option("url","jdbc:oracle:thin:user/password#dbserver:port/dbname")
.option("dbtable","SCHEMA.TABLE")
.option("driver","oracle.jdbc.driver.OracleDriver")
.load().show()
Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?
Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!
You can use subquery in dbtable option
.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")
Here is similar question, but about MySQL
In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

Read Oracle Cluster name from Oracle RAC using SQL query

I'd like to know what is my RAC cluster name using SQL query. I've found out that it can be retrieved using Oracle tool cemutlo -n or just ocrdump (see http://www.br8dba.com/tag/how-to-display-oracle-cluster-name/). However, it's not possible in this case, because on target environment, I can only execute SQL queries and I don't have access to DBMS installation directory.
I've found out (here https://community.oracle.com/thread/2510788?tstart=0) that it can be done using some unusual queries:
SELECT a.ID, a.CLUSTER_ID FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_OC('CLUS_OC_1_15',NULL,NULL,1,0,0)) a
select * from table(dbms_data_mining.get_model_details_km('CLUS_KM_1_25'))
However, they don't work on my environment and I'm unable to create new model.
Most preferably, I'd just read this from some kind of v$/gv$ tables - but I can't find it there. I guess that's because cluster is far below DBMS.
Finally, I found out that there is no way to do that :(.

How does --direct parameter in Sqoop export work with Vertica?

I got Too many ROS containers ... error when exporting large amount of data from HDFS to Vertica. I know there is a direct option for vsql COPY which will bypass the WOS and load data into ROS containers. I also notice the --direct in Sqoop Export, see this Sqoop User Guide. I'm just wondering if these two "direct" have same function.
I have tried modify Vertica configuration parameters like MoveOutInterval, MergeOutInterval... But this didn't help much.
So does anyone know if direct mode of Sqoop export will help to solve the ROS containers issue. Thanks!
--direct is only supported by specific database connectors. Since there isn't one for Vertica, you would be using the Generic JDBC one. I really doubt using --direct does anything... but if you really want to test this you can look at the statement sent in query_requests.
select *
from query_requests
where request_type = 'LOAD'
and start_timestamp > clock_timestamp() - interval '1 hour'
That will show you all load statements within the last hour. The sqoop statements should get converted to a COPY. I would really hope anyhow! If it is a bunch of INSERT ... VALUES statements then I highly suggest NOT using it. If it is not producing a COPY then you'll need to change the query above to look for the INSERT.
select *
from query_requests
where request_type = 'QUERY'
and request ilike 'insert%'
and start_timestamp > clock_timestamp() - interval '1 hour'
Let me know what you find here. If it is doing INSERT...VALUES then I can tell you how to fix it (but it is a bit of work).

Avoiding Data Duplication when Loading Data from Multiple Servers

I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"

Resources