How does --direct parameter in Sqoop export work with Vertica? - vertica

I got Too many ROS containers ... error when exporting large amount of data from HDFS to Vertica. I know there is a direct option for vsql COPY which will bypass the WOS and load data into ROS containers. I also notice the --direct in Sqoop Export, see this Sqoop User Guide. I'm just wondering if these two "direct" have same function.
I have tried modify Vertica configuration parameters like MoveOutInterval, MergeOutInterval... But this didn't help much.
So does anyone know if direct mode of Sqoop export will help to solve the ROS containers issue. Thanks!

--direct is only supported by specific database connectors. Since there isn't one for Vertica, you would be using the Generic JDBC one. I really doubt using --direct does anything... but if you really want to test this you can look at the statement sent in query_requests.
select *
from query_requests
where request_type = 'LOAD'
and start_timestamp > clock_timestamp() - interval '1 hour'
That will show you all load statements within the last hour. The sqoop statements should get converted to a COPY. I would really hope anyhow! If it is a bunch of INSERT ... VALUES statements then I highly suggest NOT using it. If it is not producing a COPY then you'll need to change the query above to look for the INSERT.
select *
from query_requests
where request_type = 'QUERY'
and request ilike 'insert%'
and start_timestamp > clock_timestamp() - interval '1 hour'
Let me know what you find here. If it is doing INSERT...VALUES then I can tell you how to fix it (but it is a bit of work).

Related

sqoop import fails with numeric overflow

sqoop import job failed caused by: java.sql.SQLException: Numeric Overflow
I have to load Oracle table, it has column type NUMBER in Oracle,without scale, and it's converted to DOUBLE in hive. This is the biggest possible size for both, Oracle and Hive numeric values. The question is how to overcome this error?
OK, my first answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values.
But now I suspect that your Oracle data contains shit, and specifically NaN values, as a result of calculation errors.
See that post for example: When/Why does Oracle adds NaN to a row in a database table
And Oracle even has distinct "Not-a-Number" categories to represent "infinity", to make things even more complicated.
But on Java side, BigDecimal does not support NaN -- from the documentation, in all conversion methods...
Throws:
NumberFormatException - if value is infinite or NaN.
Note that the JDBC driver masks that exception and displays NumericOverflow instead, to make things more complicated to debug...
So your issue looks like that one: Solr Numeric Overflow (from Oracle) -- but unfortunately SolR allows to skip errors, while Sqoop does not; so you cannot use the same trick.
In the end, you will have to "mask" these NaN values with Oracle function NaNVL, using a free-form query in Sqoop:
$ sqoop import --query 'SELECT x, y, NANVL(z, Null) AS z FROM wtf WHERE $CONDITIONS'
Edit: that answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values. That was not the case, see alternate answer.
In theory, it can be solved.
From the Oracle documentation about "Copying Oracle tables to Hadoop" (within their Big Data appliance), section "Creating a Hive table" > "About datatype conversion"...
NUMBER
INT when the scale is 0 and the precision is less than 10
BIGINT when the scale is 0 and the precision is less than 19
DECIMAL when the scale is greater than 0 or the precision is greater than 19
So you must find out what is the actual range of values in your Oracle table, then you will be able to specify the target Hive column either a BIGINT or a DECIMAL(38,0) or a DECIMAL(22,7) or whatever.
Now, from the Sqoop documentation about "sqoop - import" > "Controlling type mapping"...
Sqoop is preconfigured to map most SQL types to appropriate Java or
Hive representatives. However the default mapping might not be
suitable for everyone and might be overridden by --map-column-java
(for changing mapping to Java) or --map-column-hive (for changing
Hive mapping).
Sqoop is expecting comma separated list of mappings (...) for
example $ sqoop import ... --map-column-java id=String,value=Integer
Caveat #1: according to SQOOP-2103, you need Sqoop V1.4.7 or above to use that option with Decimal, and you need to "URL Encode" the comma, e.g. for DECIMAL(22,7)
--map-column-hive "wtf=Decimal(22%2C7)"
Caveat #2: in your case, it is not clear whether the overflow occurs when reading the Oracle value into a Java variable, or when writing the Java variable into the HDFS file -- or even elsewhere. So maybe --map-column-hive will not be sufficient.
And again, according to that post which points to SQOOP-1493, --map-column-java does not support Java type java.math.BigDecimal until at least Sqoop V1.4.7 (and it's not even clear whether it is supported in that specific option, and whether it is expected as BigDecimal or java.math.BigDecimal)
In practice, since Sqoop 1.4.7 is not available in all distros, and since your problem is not well diagnosed, it may not be feasible.
So I would advise to just hide the issue by converting your rogue Oracle column to a String, at read time.
Cf. documentation about "sqoop - import" > "Free-form Query Imports"...
Instead of using the --table, --columns and --where arguments, you can
specify a SQL statement with the --query argument (...) Your query must include the token $CONDITIONS (...) For example:
$ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b ON a.id=b.id WHERE $CONDITIONS' ...
In your case, SELECT x, y, TO_CHAR(z) AS z FROM wtf plus the appropriate formatting inside TO_CHAR so that you don't lose any information due to rounding.

Avoiding Data Duplication when Loading Data from Multiple Servers

I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"

sql cannot operate correctly in Redshift

I want to run sql like:
CREATE TEMPORARY TABLE tmptable AS SELECT * FROM redshift_table WHERE date > #{date};
I can run this sql in command line in Redshift, but if I run it in my program, it doesn't work correctly. When I change CREATE TEMPORARY TABLE to CREATE TABLE it works correctly.
I am using mybatis as OR mapper and driver is:
org.postgresql.Driver
org.postgresql:postgresql:9.3-1102-jdbc41
What's wrong?
I am assuming the #date is an actual date in your actual query.
Having said that, there is not reason this command doesnt work, its as per the syntax listed here,
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_AS.html
Have you tried posting it on AWS Redshift forums, generally they are quite responsive. Please update this thread too if you find something, this is quite an interesting issue, thanks!

sqoop2 import very large postgreSQL table failed

I am trying to use sqoop transfer from cdh5 to import large postgreSQL table to HDFS. The whole table is about 15G.
First, I tried to import just use the basic information, by entering schema and table name, it didn't work. I always get GC overhead limit exceeded. I tried to change the JVM heap size on Cloudera manager configuration for Yarn and sqoop to maximum (4G), still no help.
Then, I am trying to use sqoop transfer SQL statement to transfer partly of the table, I added SQL statement in the field as the following:
select * from mytable where id>1000000 and id<2000000 ${CONDITIONS}
(partition column is id).
The statement is failed, actually any kind of statements with my own "where" condition were having the error: "GENERIC_JDBC_CONNECTOR_0002:Unable to execute the SQL statement"
Also I tried to use the boundary query, I can use "select min(id), 1000000 from mutable", and it worked, but I tried to use "select 1000000, 2000000 from mytable" to select data further ahead which caused the sqoop server crash and down.
Could someone help? How to add where condition? or how to use the boundary query. I have searched in many places, I didn't find any good document about how to write SQL statement with sqoop2. Also is that possible to use direct on sqoop2?
Thanks

Sqoop importing zero decimals as 0E-22

When I import a table using hadoop and sqoop from my MSSQL database and that table has decimal columns any columns that are zero (eg 0.000000000000..) are saved as "0E-22".
This is quite a pain as when casting the value to a decimal in my Map or Reduce it throws an exception. So I either have to export the column as a varchar or to a check before trying to cast it. Neither are ideal.
Has anyone encountered this before and got a work around?
Thanks
I would suggest trying soon to be released Sqoop 1.4.3, where we fixed SQOOP-830
that might help you as well.
It is a strange case, there is some workaround. You can rewrite your_table, i.e.
INSERT OVERWRITE TABLE your_table
SELECT columns_no_need_for_change , CASE WHEN possible_bad_column = '0E-22' THEN '0' ELSE possible_bad_column END FROM your_table
Let us know how if you succeed or not. GL!

Resources