how to export data to mainframe from hadoop - hadoop

Is there a way to export data from hadoop to mainframe using sqoop. I am pretty new to mainframe.
I understand that we can sqoop in the data from mainframe to hadoop. I skimmed through the sqoop documentation but doesnt say anything about export
appreciate your help.

This appears to cover export: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_export_literal
While I've not used sqoop, it appears to use a JDBC connection to a mainframe database. If you have that and the mainframe data table is already created (note in the doc: "The target table must already exist in the database."), then you should be able to connect to the mainframe database as the export destination. Many mainframe data sources (e.g. Db2 z/OS) support this.

Related

Does Vertica HCatalog Connector support non-standard Hive's StorageHandler?

I'm looking for a way to get HBASE data available/queriable in Vertica. I have seen that Vertica has a good integration with Hive's Metastore - HCatalog Connector.
The connector can read a table definition out of Hive Metastore and use the description to read the data directly.
The question is whether the connector supports the reading of Hive external tables configured with non-standard StorageHandler, HBaseStorageHandler in particular.
I have tried this long time ago and I was able to read Hive external tables using the HiveHBaseStorageHandler ( i think the name of the jar is hive-hbase-handler.jar) . Please give it a try and let us know. You need to place this jar in /opt/vertica/packages/hcat/lib/ .

HDFS for Teradata

As per my understanding, HDFS is useful for the data that is unstructured and large in quantity. I wanted to know, is it possible to use HDFS with Teradata, as Teradata is RDBMS and hence not so Unstructured?
Also, how does HDFS come into picture with the database anyway. Is it that the File System contains data or , how exactly does it work in simple terms? Thanks
With Teradata DB itself - no.
However:), Teradata is providing so-called UDA (Unified Data Architecture), where Teradata, Aster DB and Hadoop(HDFS) are interconnected and can work together almost seamlessly :).
In general, if you want to work with unstructured data only, choose Aster. Which is product of Teradata and you can be connect with HDFS directly. HDFS is used here as a cheap and fast data storage.
Even more interesting solution will come up with the new Aster version (6), where AFS (Aster File system) is going to be implemented. ASR is a distributed filesystem similar to HDFS. I'm looking forward to give a try as well ;)
To add some more details to the answer of xhudik.
To connect Teradata with Hadoop, you need a connector. One is called Teradata QueryGrid for Hadoop. It is an addon to Teradata DWH and connects to HCatalog. And HCatalog connects to HDFS.
You can also use the Teradata Connector for Hadoop, which is a SQOOP extension and so you can connect to Teradata from Hadoop.

Is there a way to use a JDBC as a input resource for Hadoop's MapReduce?

I have data in a PostgreSQL DB and I'd like to get it, treat it and save it to a HBase DB. Is it possible to distribute somehow the JDBC operation in a Map operation?
Yes you can do that by DBInputFormat:
DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases.
The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop’s formalization of a data source; it can mean files formatted in a particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database.
LINK
I think you're looking for Sqoop, which is designed to import from SQL servers to HDFS stack technologies. It puts the data it gets from a JDBC connection into HDFS, thereby splitting it across your Hadoop NameNodes. I believe this is what you are looking for.
SQl to hadOOP = SQOOP, get it?
Sqoop can import into HBase. See this link.

SQOOP export command VS DB2 LOAD CLIENT

I have a scenario where I have copy data from hive to db2. There are two ways I can implement this. One is using sqoop export command and another is db2 load client. I need to know which is best approach with respect to performance. Please give me suggestion.
Sqoop can be used to transfer large sized data file in HDFS concurrently (using mappers) to db2. I have no idea about db2 load client.
Depends.. If using DB2 LUW, with the sqoop connector it can be faster depending on how many clusters you have available (mappers). DB2 Load (at least in the z world) can do parrallel loading so depending on how many cp's on the database system, that could be faster. So I guess it depends on your environment (the database system vs the hadoop cluster).

hadoop hive question

I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database.
Is any setting i need to configure?
Thanks in advance.
Make sure you run hive from the same directory every time because when you launch hive CLI for the first time, it creates a metastore derby db in the current directory. This derby DB contains metadata of hive tables. If you change directories, you will have unorganized metadata for hive tables. Also the Derby DB cannot handle multiple sessions. To allow for concurrent Hive access you would need to use a real database to manage the Metastore rather than the wimpy little derbyDB that comes with it. You can download mysql for this and change hive properties for jdbc connection to mysql type 4 pure java driver.
Try emailing the Hive userlist or the IRC channel.
You probably need to setup the central Hive metastore (by default, Derby, but it can be mySQL/Oracle/Postgres). The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.
For more information, see http://wiki.apache.org/hadoop/HiveDerbyServerMode
Examine your hadoop logs. For me this happened when my hadoop system was not setup properly. The namenode was not able to contact the datanodes on other machines etc.
Yeah, it's due to the metastore not being set up properly. Metastore stores the metadata associated with your Hive table (e.g. the table name, table location, column names, column types, bucketing/sorting information, partitioning information, SerDe information, etc.).
The default metastore is an embedded Derby database which can only be used by one client at any given time. This is obviously not good enough for most practical purposes. You, like most users, should configure your Hive installation to use a different metastore. MySQL seems to be a popular choice. I have used this link from Cloudera's website to successfully configure my MySQL metastore.

Resources