Is it really possible to import chunk-wise data through sqoop incremental import?
Say I have a table with rowid 1,2,3..... N (here N is 100) and now I want to import it as chunk. Like
1st import: 1,2,3.... 20
2nd import: 21,22,23.....40
last import: 81,82,83....100
I have read about the sqoop job with incremental import and also know the --last-value parameter but do not know how to pass the chunk size. For the above example, chunk size here is 20.
I ended up by writing a script which will modify the parameter file with new where clause after each successful sqoop run. I'm running both through Oozie coordinator. I wanted to use --boundary-query but it doesn't work with chunk. That's why I had to do this work-around. Details of this work-around can be found here:
http://tmusabbir.blogspot.com/2013/05/chunk-data-import-incremental-import-in.html
Related
What is the best way to do this so that we waste the least time possible in both exporting and importing?
Taking into account that we are talking about a huge table with data from more than a decade.
What I've been planning so far:
directory=dumps
dumpfile=foo.dmp
parallel=8
logfile=foo_exp.log
tables=FOO
query=FOO:"WHERE TSP <= sysdate"
content=DATA_ONLY
The import part:
directory=dumps
dumpfile=foo.dmp
parallel=8
logfile=foo_imp.log
remap_table=FOO:FOO_REPARTITIONED
table_exists_action=REPLACE
Both scripts are going to be run like this:
nohup expdp USER/PWD#sid parfile=export.par &
nohup impdp USER/PWD#sid parfile=import.par &
Is the parallel parameter going to work as expected? Do I need to take anything else into account?
You need to consider some things
The parallel parameter from Datapump will not work unless you specify multiple dump files using the option %U. So in your case:
directory=dumps
dumpfile=foo_%U.dmp
parallel=8
logfile=foo_exp.log
tables=FOO
query=FOO:"WHERE TSP <= sysdate"
content=DATA_ONLY
From the documentation
The value that you specify for integer should be less than, or equal
to, the number of files in the dump file set (or you should specify
either the %U or %L substitution variables in the dump file
specifications).
Also, take in consideration the following restrictions':
This parameter is valid only in the Enterprise Edition of Oracle
Database 11g or later.
To export a table or table partition in parallel (using parallel
query, or PQ, worker processes), you must have the
DATAPUMP_EXP_FULL_DATABASE role.
Transportable tablespace metadata cannot be exported in parallel.
Metadata cannot be exported in parallel when the NETWORK_LINK
parameter is also used. The following objects cannot be exported in
parallel: TRIGGER, VIEW OBJECT_GRANT, SEQUENCE, CONSTRAINT
REF_CONSTRAINT.
So in your case set the parameter to a value adequate for the Hardware you have available in your server.
Update
Sorry for taking so much time to answer, but I was kind of busy. You were mentioning issues during the import. Well, if the structure of the tables is not the same ( for example, the partition key ) that might have an effect in the import operation. Normally in this case, I would suggest to be smart and speed up the import by splitting the operation in two steps:
First Step - Import Datapump into normal table
directory=dumps
dumpfile=foo_%U.dmp
parallel=8
logfile=foo_imp.log
remap_table=FOO:TMP_FOO
table_exists_action=TRUNCATE
TRANSFORM=DISABLE_ARCHIVE_LOGGING:Y
ACCESS_METHOD=DIRECT_PATH
content=DATA_ONLY
Be sure to have the table TMP_FOO created before starting the operation. The first step is to import the datapump file ( only data ) into a non partitoned table using direct path and without logging.
Second Step - Direct Path Insert from TMP_FOO into your final table
alter session enable parallel dml ;
alter session force parallel query;
insert /*+append parallel(a,8) */ into your_partitioned_table a
select /*+parallel(b,8) */ * from tmp_foo b ;
commit;
I think this would make the time go down.
We are trying to import PostgreSQL data using apache sqoop in to Hadoop environment. On which, identified that direct(keyword: --direct) mode of SQOOP import using the PostgreSQL COPY operation to fast import the data in to HDFS. If the column is having a line breaker(\n) as a value then the QUOTE is added in the column value(example as below:1) which was considered as another record in HIVE table(LOAD DATA INPATH). Is there alternative is available to make this work?
E1: Sample data in HDFS (tried importing with: Default or --input-escaped-by '\' or --input-escaped-by '\n' doesn't help)
value1,"The some data
has line break",value3
Hive table considered it as 2 records.(provided:--hive-delims-replacement '' seems HDFS level data has \n hive detects as new record)
value1 "the same data NULL
has line break" value3 NULL
It seems apache retired this project seems it no longer support bug fixes or any release.
Any of you faced the same problem or any one could help me on this?
Note: I am able to import using non-direct and select query mode.
You could try exporting your data to a non-text format (e.g. Parquet, "-as-parquetfile" sqoop flag). That would fix the issue with new lines.
I'm trying to import specific rows from a full '.dmp' file using the parfile parameter.
Import command:
IMP userid=user/password#db parfile=parfile.dat
parfile.dat file:
But I'm receiving the error below when executing the IMP command:
What can be the problem?
Is it possible to use a condition using the old IMP command?
If yes, why it is not working?
Thank you for your help,
As long as PARFILE is a valid parameter for the original IMP utility, QUERY is not - which answers your question:
is it possible to use a condition using the old IMP command?
No, it is not.
If yes, why is it not working?
Because it is not supported.
As you're on 12c, here's the Original Import documentation. Have a look at its Parameters section - you won't find QUERY in there (to see the list of all parameters, expand the tree node on the left hand side of the screen).
So, what to do?
Use Data Pump instead, if possible
If all you have is the DMP file created with the original EXP (and you can't obtain a new, data pump one), import the whole table and write a query which will select data from it, using WHERE clause you meant to use in the PARFILE.
Alternatively, delete all rows that don't satisfy that condition.
sqoop import job failed caused by: java.sql.SQLException: Numeric Overflow
I have to load Oracle table, it has column type NUMBER in Oracle,without scale, and it's converted to DOUBLE in hive. This is the biggest possible size for both, Oracle and Hive numeric values. The question is how to overcome this error?
OK, my first answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values.
But now I suspect that your Oracle data contains shit, and specifically NaN values, as a result of calculation errors.
See that post for example: When/Why does Oracle adds NaN to a row in a database table
And Oracle even has distinct "Not-a-Number" categories to represent "infinity", to make things even more complicated.
But on Java side, BigDecimal does not support NaN -- from the documentation, in all conversion methods...
Throws:
NumberFormatException - if value is infinite or NaN.
Note that the JDBC driver masks that exception and displays NumericOverflow instead, to make things more complicated to debug...
So your issue looks like that one: Solr Numeric Overflow (from Oracle) -- but unfortunately SolR allows to skip errors, while Sqoop does not; so you cannot use the same trick.
In the end, you will have to "mask" these NaN values with Oracle function NaNVL, using a free-form query in Sqoop:
$ sqoop import --query 'SELECT x, y, NANVL(z, Null) AS z FROM wtf WHERE $CONDITIONS'
Edit: that answer assumed that your Oracle data was good, and your Sqoop job needed specific configuration to cope with NUMBER values. That was not the case, see alternate answer.
In theory, it can be solved.
From the Oracle documentation about "Copying Oracle tables to Hadoop" (within their Big Data appliance), section "Creating a Hive table" > "About datatype conversion"...
NUMBER
INT when the scale is 0 and the precision is less than 10
BIGINT when the scale is 0 and the precision is less than 19
DECIMAL when the scale is greater than 0 or the precision is greater than 19
So you must find out what is the actual range of values in your Oracle table, then you will be able to specify the target Hive column either a BIGINT or a DECIMAL(38,0) or a DECIMAL(22,7) or whatever.
Now, from the Sqoop documentation about "sqoop - import" > "Controlling type mapping"...
Sqoop is preconfigured to map most SQL types to appropriate Java or
Hive representatives. However the default mapping might not be
suitable for everyone and might be overridden by --map-column-java
(for changing mapping to Java) or --map-column-hive (for changing
Hive mapping).
Sqoop is expecting comma separated list of mappings (...) for
example $ sqoop import ... --map-column-java id=String,value=Integer
Caveat #1: according to SQOOP-2103, you need Sqoop V1.4.7 or above to use that option with Decimal, and you need to "URL Encode" the comma, e.g. for DECIMAL(22,7)
--map-column-hive "wtf=Decimal(22%2C7)"
Caveat #2: in your case, it is not clear whether the overflow occurs when reading the Oracle value into a Java variable, or when writing the Java variable into the HDFS file -- or even elsewhere. So maybe --map-column-hive will not be sufficient.
And again, according to that post which points to SQOOP-1493, --map-column-java does not support Java type java.math.BigDecimal until at least Sqoop V1.4.7 (and it's not even clear whether it is supported in that specific option, and whether it is expected as BigDecimal or java.math.BigDecimal)
In practice, since Sqoop 1.4.7 is not available in all distros, and since your problem is not well diagnosed, it may not be feasible.
So I would advise to just hide the issue by converting your rogue Oracle column to a String, at read time.
Cf. documentation about "sqoop - import" > "Free-form Query Imports"...
Instead of using the --table, --columns and --where arguments, you can
specify a SQL statement with the --query argument (...) Your query must include the token $CONDITIONS (...) For example:
$ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b ON a.id=b.id WHERE $CONDITIONS' ...
In your case, SELECT x, y, TO_CHAR(z) AS z FROM wtf plus the appropriate formatting inside TO_CHAR so that you don't lose any information due to rounding.
I am trying to use sqoop transfer from cdh5 to import large postgreSQL table to HDFS. The whole table is about 15G.
First, I tried to import just use the basic information, by entering schema and table name, it didn't work. I always get GC overhead limit exceeded. I tried to change the JVM heap size on Cloudera manager configuration for Yarn and sqoop to maximum (4G), still no help.
Then, I am trying to use sqoop transfer SQL statement to transfer partly of the table, I added SQL statement in the field as the following:
select * from mytable where id>1000000 and id<2000000 ${CONDITIONS}
(partition column is id).
The statement is failed, actually any kind of statements with my own "where" condition were having the error: "GENERIC_JDBC_CONNECTOR_0002:Unable to execute the SQL statement"
Also I tried to use the boundary query, I can use "select min(id), 1000000 from mutable", and it worked, but I tried to use "select 1000000, 2000000 from mytable" to select data further ahead which caused the sqoop server crash and down.
Could someone help? How to add where condition? or how to use the boundary query. I have searched in many places, I didn't find any good document about how to write SQL statement with sqoop2. Also is that possible to use direct on sqoop2?
Thanks