Hash columns while copying table - sqoop

I need to hash certain columns (like email) while copying MySQL tables to HDFS using Sqoop.
Is there a built-in option in sqoop?
If not, how can this be achieved?
EDIT-1
Currently I could think of a very crude way to achieve this: passing a SQL query (instead of table-name) like following to sqoop
SELECT
`name`,
SHA1(`email`) AS `email`,
`dob`
FROM
`my_db`.`users`
Not sure if this would work at all [will update once I've tried]
Even if it works, it (most probably) would require generating SQL-query specific to underlying DB (MySQL, PostgreSQL etc)

Is there a built-in option in sqoop?
No
If not, how can this be achieved?
Approach-1: use SQL-query as already described in question
Approach-2: another straight-forward way would be to perform a 2-step import
do a sqoop import into a Hive temp-table
create a new Hive table from this temp table and perform hashing in the process (a good approach would be Hive CTAS)

Related

After performing sqoop import from rdbms , how to check whether the data is properly imported or not in hive

Is there any tools available?
Normally I check by doing manual checks like count(*), min , max , doing select where query in both rdbms and hive table. Is there any other way?
Please use --validate in sqoop import or export to get row count between source and destination.
Update: Column Level checking.
There is no in built parameter in sqoop to achieve this.But you can do this as below:
1.Store the data imported in a temp table.
Use shell script for below:
2.Get the data from source table and compare it with temp table using shell variables.
3.If it matches,then copy the data from temp to original table

Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements).
However, next time when I try to read the same parquet table into Spark dataframe, the new column which was added to the parquet table using Hive Alter Table command is not showing up in the df.printSchema output.
Based on initial analysis, it seems that there might be some conflict, and Spark is using its own schema instead of reading the Hive metastore.
Hence, I tried the below options :
Changing the spark setting:
spark.sql.hive.convertMetastoreParquet=false
and Refreshing the spark catalog:
spark.catalog.refreshTable("table_name")
However, the above two options are not solving the problem.
Any suggestions or alternatives would be super helpful.
This sounds like a bug described in SPARK-21841. JIRA description also contains the idea for a possible workaround:
...Interestingly enough it appears that if you create the table
differently like:
spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 1")
Run your alter table on mydb.t1 val t1 = spark.table("mydb.t1")
Then it works properly...
To fix this solution, you have to use the same alter command used in hive to spark-shell as well.
spark.sql("alter table TABLE_NAME add COLUMNS (col_A string)")

Sqoop query from hive table to mysql

Is there a way in sqoop to query data from hive table and write result to RDBMS table.
For example - I want to execute this query
SELECT MAX(DATE) FROM hivedbname.hivetablename
and write(insert or update) the result(in this case,the maximum date)to a table in MYSQL DB.
I know that we can use python or any other programming lang. to achieve this. But, I just want to know is this possible with sqoop.
Thanks

How to transfer data & metadata from Hive to RDBMS

There are more than 300 tables in my hive environment.
I want to export all the tables from Hive to Oracle/MySql including metadata.
My Oracle database doesn't have any tables corresponding to these Hive tables.
Sqoop import from Oracle to Hive creates tables in Hive if the table doesn't exists.But Sqoop export from Hive to Oracle doesn't create table if not exists and fails with an exception.
Is there any option in Sqoop to export metadata also? or
Is there any other Hadoop tool through which I can achieve this?
Thanks in advance
The feature you're asking for isn't in Spark. I don't know of a current hadoop tool which can do what you're asking either unfortunately. A potential workaround is using the "show create table mytable" statement in Hive. It will return the create table statements. You can parse this manually or pragmatically via awk and get the create tables in a file, then run this file against your oracle db. From there, you can use sqoop to populate the tables.
It won't be fun.
Sqoop can't copy metadata or create table in RDBMS on the basis of Hive table.
Table must be there in RDBMS to perform sqoop export.
Why is it so?
Mapping from RDBMS to Hive is easy because hive have only few datatypes(10-15). Mapping from multiple RDBMS datatypes to Hive datatype is easily achievable. But vice versa is not that easy. Typical RDBMS has 100s of datatypes (that too different in different RDBMS).
Also sqoop export is newly added feature. This feature may come in future.

Why we need to move external table to managed hive table?

I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.

Resources