PostgreSQL Sqoop import + data line break issue

PostgreSQL Sqoop import + data line break issue - hadoop

We are trying to import PostgreSQL data using apache sqoop in to Hadoop environment. On which, identified that direct(keyword: --direct) mode of SQOOP import using the PostgreSQL COPY operation to fast import the data in to HDFS. If the column is having a line breaker(\n) as a value then the QUOTE is added in the column value(example as below:1) which was considered as another record in HIVE table(LOAD DATA INPATH). Is there alternative is available to make this work?
E1: Sample data in HDFS (tried importing with: Default or --input-escaped-by '\' or --input-escaped-by '\n' doesn't help)
value1,"The some data
has line break",value3
Hive table considered it as 2 records.(provided:--hive-delims-replacement '' seems HDFS level data has \n hive detects as new record)
value1 "the same data NULL
has line break" value3 NULL
It seems apache retired this project seems it no longer support bug fixes or any release.
Any of you faced the same problem or any one could help me on this?
Note: I am able to import using non-direct and select query mode.

You could try exporting your data to a non-text format (e.g. Parquet, "-as-parquetfile" sqoop flag). That would fix the issue with new lines.

Related

Sqoop date issue when importing from oracle

I'm trying to import a huge table from oracle 10g to HDFS (GCS since i'm using sqoop with Google Cloud Dataproc) as AVRO. Everything works fine when the table doesnt have any date columns, but when it does some dates are imported very wrong.
Like: Oracle data -> 30/07/76 and HDFS data -> 14976-07-30 20:02:00.0
Like: Oracle data -> 26/03/84 and HDFS data -> 10384-03-26 20:32:34.0
I'm already mapping the date fields as String to bring them like that. I was importing using the default sqoop way that is bringing the date fields as epoch ints but the conversion was incorrect too.
Like: Oracle data -> 01/01/01 and HDFS data -> -62135769600000 when it should be 978314400000
Please, hope someone help me to fix this issue.
Thanks
Aditional information:
Sqoop command that i'm running
import -Dmapreduce.job.user.classpath.first=true -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect=$JDBC_STR --username=$USER --password=$PASS --target-dir=gs://sqoop-dev-out-files/new/$TABLE --num-mappers=10 --fields-terminated-by="\t" --lines-terminated-by="\n" --null-string='null' --null-non-string='null' --table=$SCHEMA.$TABLE --as-avrodatafile --map-column-java="DATACADASTRO=String,DATAINICIAL=String,DATAFINAL=String"
Sqoop version: 1.4.7
JDBC version: 6

I think your date in oracle is 01/01/0001, try to_char(COLUMN,'DD/MM/YYYY').
My issue is that my date is really 01/01/0001, because of user mistyping, and I can't update the column in the origin oracle database.
My issue is that converting to unix should have come -62135596800000 but instead, it comes -62135769600000(30/12/0000).
At first, I thought that was a timezone issue but it is two days difference.

sqoop import fails with numeric overflow

sqoop import job failed caused by: java.sql.SQLException: Numeric Overflow
I have to load Oracle table, it has column type NUMBER in Oracle,without scale, and it's converted to DOUBLE in hive. This is the biggest possible size for both, Oracle and Hive numeric values. The question is how to overcome this error?

OK, my first answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values.
But now I suspect that your Oracle data contains shit, and specifically NaN values, as a result of calculation errors.
See that post for example: When/Why does Oracle adds NaN to a row in a database table
And Oracle even has distinct "Not-a-Number" categories to represent "infinity", to make things even more complicated.
But on Java side, BigDecimal does not support NaN -- from the documentation, in all conversion methods...
Throws:
NumberFormatException - if value is infinite or NaN.
Note that the JDBC driver masks that exception and displays NumericOverflow instead, to make things more complicated to debug...
So your issue looks like that one: Solr Numeric Overflow (from Oracle) -- but unfortunately SolR allows to skip errors, while Sqoop does not; so you cannot use the same trick.
In the end, you will have to "mask" these NaN values with Oracle function NaNVL, using a free-form query in Sqoop:
$ sqoop import --query 'SELECT x, y, NANVL(z, Null) AS z FROM wtf WHERE $CONDITIONS'

Edit: that answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values. That was not the case, see alternate answer.
In theory, it can be solved.
From the Oracle documentation about "Copying Oracle tables to Hadoop" (within their Big Data appliance), section "Creating a Hive table" > "About datatype conversion"...
NUMBER
INT when the scale is 0 and the precision is less than 10
BIGINT when the scale is 0 and the precision is less than 19
DECIMAL when the scale is greater than 0 or the precision is greater than 19
So you must find out what is the actual range of values in your Oracle table, then you will be able to specify the target Hive column either a BIGINT or a DECIMAL(38,0) or a DECIMAL(22,7) or whatever.
Now, from the Sqoop documentation about "sqoop - import" > "Controlling type mapping"...
Sqoop is preconfigured to map most SQL types to appropriate Java or
Hive representatives. However the default mapping might not be
suitable for everyone and might be overridden by --map-column-java
(for changing mapping to Java) or --map-column-hive (for changing
Hive mapping).
Sqoop is expecting comma separated list of mappings (...) for
example $ sqoop import ... --map-column-java id=String,value=Integer
Caveat #1: according to SQOOP-2103, you need Sqoop V1.4.7 or above to use that option with Decimal, and you need to "URL Encode" the comma, e.g. for DECIMAL(22,7)
--map-column-hive "wtf=Decimal(22%2C7)"
Caveat #2: in your case, it is not clear whether the overflow occurs when reading the Oracle value into a Java variable, or when writing the Java variable into the HDFS file -- or even elsewhere. So maybe --map-column-hive will not be sufficient.
And again, according to that post which points to SQOOP-1493, --map-column-java does not support Java type java.math.BigDecimal until at least Sqoop V1.4.7 (and it's not even clear whether it is supported in that specific option, and whether it is expected as BigDecimal or java.math.BigDecimal)
In practice, since Sqoop 1.4.7 is not available in all distros, and since your problem is not well diagnosed, it may not be feasible.
So I would advise to just hide the issue by converting your rogue Oracle column to a String, at read time.
Cf. documentation about "sqoop - import" > "Free-form Query Imports"...
Instead of using the --table, --columns and --where arguments, you can
specify a SQL statement with the --query argument (...) Your query must include the token $CONDITIONS (...) For example:
$ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b ON a.id=b.id WHERE $CONDITIONS' ...
In your case, SELECT x, y, TO_CHAR(z) AS z FROM wtf plus the appropriate formatting inside TO_CHAR so that you don't lose any information due to rounding.

Table or view not found-convert hive table to spark dataframe

I am trying to do the following operation:
import hiveContext.implicits._
val productDF=hivecontext.sql("select * from productstorehtable2")
println(productDF.show())
The error I am getting is
org.apache.spark.sql.AnalysisException: Table or view not found:
productstorehtable2; line 1 pos 14
I am not sure why that is occurring.
I have used this in spark configuration
set("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
and the location when I do describe formatted productstorehtable2
hdfs://quickstart.cloudera:8020/user/hive/warehouse/productstorehtable2
I have used this code for creating the table
create external table if not exists productstorehtable2
(
device string,
date string,
word string,
count int
)
row format delimited fields terminated by ','
location 'hdfs://quickstart.cloudera:8020/user/cloudera/hadoop/hive/warehouse/VerizonProduct2';
I use sbt (with spark dependencies) to run application. My OS is CentOS and I have spark 2.0
Could someone help me out in spotting where I am going wrong?
edit:
when I perform println(hivecontext.sql("show tables")) it just outputs a blank line
Thanks

How does --direct parameter in Sqoop export work with Vertica?

I got Too many ROS containers ... error when exporting large amount of data from HDFS to Vertica. I know there is a direct option for vsql COPY which will bypass the WOS and load data into ROS containers. I also notice the --direct in Sqoop Export, see this Sqoop User Guide. I'm just wondering if these two "direct" have same function.
I have tried modify Vertica configuration parameters like MoveOutInterval, MergeOutInterval... But this didn't help much.
So does anyone know if direct mode of Sqoop export will help to solve the ROS containers issue. Thanks!

--direct is only supported by specific database connectors. Since there isn't one for Vertica, you would be using the Generic JDBC one. I really doubt using --direct does anything... but if you really want to test this you can look at the statement sent in query_requests.
select *
from query_requests
where request_type = 'LOAD'
and start_timestamp > clock_timestamp() - interval '1 hour'
That will show you all load statements within the last hour. The sqoop statements should get converted to a COPY. I would really hope anyhow! If it is a bunch of INSERT ... VALUES statements then I highly suggest NOT using it. If it is not producing a COPY then you'll need to change the query above to look for the INSERT.
select *
from query_requests
where request_type = 'QUERY'
and request ilike 'insert%'
and start_timestamp > clock_timestamp() - interval '1 hour'
Let me know what you find here. If it is doing INSERT...VALUES then I can tell you how to fix it (but it is a bit of work).

Using Sqoop incremental import as chunk-wise

Is it really possible to import chunk-wise data through sqoop incremental import?
Say I have a table with rowid 1,2,3..... N (here N is 100) and now I want to import it as chunk. Like
1st import: 1,2,3.... 20
2nd import: 21,22,23.....40
last import: 81,82,83....100
I have read about the sqoop job with incremental import and also know the --last-value parameter but do not know how to pass the chunk size. For the above example, chunk size here is 20.

I ended up by writing a script which will modify the parameter file with new where clause after each successful sqoop run. I'm running both through Oozie coordinator. I wanted to use --boundary-query but it doesn't work with chunk. That's why I had to do this work-around. Details of this work-around can be found here:
http://tmusabbir.blogspot.com/2013/05/chunk-data-import-incremental-import-in.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PostgreSQL Sqoop import + data line break issue - hadoop

You could try exporting your data to a non-text format (e.g. Parquet, "-as-parquetfile" sqoop flag). That would fix the issue with new lines.

Related

Sqoop date issue when importing from oracle

sqoop import fails with numeric overflow

Table or view not found-convert hive table to spark dataframe

How does --direct parameter in Sqoop export work with Vertica?

Using Sqoop incremental import as chunk-wise

Categories

Resources