Accesing remote server get data and put it in hdfs - hadoop

floks
Here i have a client question. I want to get the tables from sql server(RDBMS) to my hdfs (hadoop cluster). But the servers are in different location.
1)Which is the best way to access the serve,but data is in huge amount.
2)connecting with one sever is okay, we have many servers around the globe we have to get the data from this servers.
3)Can we connect with sqoop remotly to get the data to HDFS.

Your question is a little bit unclear, but yes, you can use sqoop to import the data from your servers into HDFS. You need to specify the connection parameters when importing the data:
sqoop import --connect <JDBC connection string> --table <tablename> --username <username> --password <password>
If you need to do multiple imports from multiple servers, I suggest you try Oozie to automate these imports. You can find a tutorial to achieve that here.

Before writing down the sqoop import, you need to have a user for each of the remote node which is to be identified by your local DB. For Ex:
create user 'username'#'<ip of remote node>' IDENTIFIED BY 'password';
You must also have to ensure about the grant permissions to these users depending on your requirement.
It's then you can frame the sqoop import, one example as below:
$SQOOP_HOME/bin/sqoop import --connect jdbc:mysql://<ip address of remote server node> :port_number/<database_name> --username user --password password --table <table to import>
This question is 5 months old for this answer so I'm hoping the issue would have been resolved but in case someone wanted to go to a step by step procedure for this requirement.
Regards,
Adil

Related

sqoop import-all-tables - with SQL Server imports system tables

I am trying to use sqoop import-all-tables to get the data from SQL Server into HDFS from a particular database.
After it imports all the expected tables from the DB successfully, it also tries to import system tables in the DB. Is there a way t force sqoop to import only non-system tables?
Thanks.
It looks like a couple of system tables are listed as user tables. Hence the issue.

What is the actual use case of using eval in production? If we need to query the DB, we can directly access it. Why would someone go to sqoop?

I would like to know the importance of eval in sqoop. As per the command, we can query the remote database through sqoop. But I would like know real use case of it specially in production as I don't see any.
First of all, sqoop eval tool is for evaluation purpose only.
As per sqoop documentation:
Warning
The eval tool is provided for evaluation purpose only. You can use it to verify database connection from within the Sqoop or to test simple queries. It’s not suppose to be used in production workflows.
Regarding use case of eval:
You can preview the result of SQL queries on the console. This will help the developer to preview sqoop import queries.
Sqoop eval is used to verify established connection database and preview query result.
sqoop eval --connect "connection_url" --username * --password --query "select count() from Table_name"
Don't use in prod environment not a good practice.
Its just to verify your connection

Sqoop data to non-default azure storage

Is it possible to import data from SQL database using Sqoop into a different blob storage, other than the default HDInsight cluster blob storage?
Even if I set azure storage access to "Public Blob", I get an error message "Container testcontainer in account nondefaultstorage.blob.core.windows.net not found, and we can't create it using anoynomous credentials."
This is the sqoop command I am running:
import
--connect jdbc:sqlserver://sqlServerName;user=sqlLogin;password=sqlPass;database=sqlDbName
--table tableName
--target-dir wasb://testcontainer#nondefaultstorage.blob.core.windows.net/data/csv
It shall work with linked storage accounts, or public containers. Public blob won't work because container data is not available. For more information on the 3 access types, see https://azure.microsoft.com/en-us/documentation/articles/storage-manage-access-to-resources/#restrict-access-to-containers-and-blobs
Please note that PublicContainer and PublicBlob only grant Read access to all people, you still need Shared Access Signature or Shared Key when writing.

Oozie cannot access metastore database in HUE

I'm on CDH4, in HUE, I have a database in Metastore Manager named db1. I can run Hive queries that create objects in db1 with no problem. I put those same queries in scripts and run them through Oozie and they fail with this message:
FAILED: SemanticException 0:0 Error creating temporary folder on: hdfs://lad1dithd1002.thehartford.com:8020/appl/hive/warehouse/db1.db. Error encountered near token 'TOK_TMP_FILE'
I created db1 in the Metastore Manager as HUE user db1, and as HUE user admin, and as HUE user db1, and nothing works. The db1 user also has a db1 ID on the underlying Linux cluster, if that helps.
I have chmod'd the /appl/hive/warehouse/db1.db to read, write, execute to owner, group, other, and none of that makes a difference.
I'm almost certain it's a rights issue, but what? Oddly, I have this working under another ID where I had hacked some combination of things that seemed to have worked, but I'm not sure how. It was all in HUE, so if possible, I'd like a solution doable in HUE so I can easily hand it off to folks who prefer to work at the GUI level.
Thanks!
Did you also add hive-site.xml into your Files and Job XML fields? Hue has great tutorial about how to run Hive job. Watch it here. Adding of hive-site.xml is described around 4:20.
Exact same error on Hadoop MapR.
Root cause : Main database and temporary(scrat) database were created by different users.
Resolution : Creating both folders with same ID might help with this.

Sqoop Import Using Greenplum

I want to import data into hive from greenplum using sqoop.
I am able to import successfully data from default schema of greenplum for user.
But, I am not able to fetch data from table present in other schemas of greenplum.
I tried various option.
Can you please help?
Thanks in advance.
Which sqoop-version do you use?
With v1.4.3 you can set schema-parameter.
With v1.4.2 you can use freeform query (--query) with schema.
I tried and it works fine.
Sqoop itself don't have a notion of "schema". Some specialized connectors (PostgreSQL, Microsoft SQL Server) are exposing user ability to specify a schema, but as Sqoop don't have specialized connector for Greenplum it won't help you here.
You should be able to use query-based import instead of table and specify the schema name in the query, e.g. something like:
sqoop import --query "select * from schema.tablename where $CONDITIONS"
You can take the advantage of custom schema
try with
--schema <<schema_name>>

Resources