I'm importing a table from mysql to hive. The table has 2115584 rows. During the import I see
13/03/20 18:34:31 INFO mapreduce.ImportJobBase: Retrieved 2115584 records.
But when I do a count(*) on the imported table I see that it has 49262250 rows. What is going on?
Update: the import works correctly when --direct is specified.
Figured it out. From the sqoop user manual:
Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters) present in them. You can use the --hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data.
I just specified --hive-drop-import-delims and it works now.
Related
I am trying to move records with null values in a particular column to a particular table and non-null records to another while SQOOP import. tried to explore on goolge but there is not much beyond --null-string and --null-non-String params but that will just replace with the defined characters ...
I can think of following ways to handle it
once importing into hive, run a dedup to filter out the records but this is something to be tried in worst case.
handling at sqoop level itself(no clue on this)
could any expert here can help me with the above ask.
ENV details : its a plain Apache hadoop cluster. sqoop version 1.4.6
We can try making use of --query option along with the sqoop-import command
--query select * from table where column is null and $CONDITIONS
And in a similar way for not null condition also.
There will be 2 sqoop import jobs here.
When importing data from RDMS to Hadoop using sqoop. If my source system contains junk charactesrs how can we replace them
Eg: 1,punâ€,travel,
The definition of junk characters can vary based on data being stored and usage of the data. Sqoop import allows dropping Hive delimiters (via --hive-drop-import-delims option) or replacing Hive delimiters (via --hive-delims-replacement option). Other forms of data processing would need to be done after import job has landed data on Hadoop.
Per the Sqoop documentation:
--hive-drop-import-delims: Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement: Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
I am trying to use Sqoop job for importing data from Oracle and one of the column in Oracle table is of data type CLOB which contains newline characters.
In this case, the option --hive-drop-import-delims is not working. Hive table doesn’t read the /n characters properly.
Please suggest how I can import CLOB data into target directory parsing all the characters properly.
Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.
I can import SQL tables to Hive, however when I try to SQL view, I am getting errors.
Any ideas ?
From Sqoop documentation:
If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
I think it's your case here. Provide additional option like:
--split-by tablex_id
where tablex_id is a field in your view that can be used as a primary key or index.
There is no any special command to import View form RDBMS.
Use simple import command with --split-by that sqoop refers to same as primary key.
you can import the rdbms views data but u have to pass -m 1 or --split-by in your command.