Consuming External Webservices data in Hadoop - hadoop

I have the following requirement that I plan to fulfill through Hadoop frameworks.
I have 40% of data sitting in a SQL Server Database
I have 20% of data available through a Web service
I have the rest 40% available through another database.
The data from the three sources need to be joined together to make a fourth data set , that I need to send to a 2 systems - one through Webservice call , another thru direct database import.
To achieve the above feature, Im planning to use Hadoop platform that we already have. The database pulls and push can be managed through Sqoop. The transformation is managed through SQL queries written through Hive. All of this is orchestrated through Oozie workflow.
In the complete gamut of things, what I would like to get help on is -
a. Is it a good approach to directly invoke a Webservice to fetch the data from hadoop? Or should I not use hadoop at all , if it involves fetching data from external webservices? I dont believe as there are ways to make it work but I would like your views.
b. If this approach is good, how can I materialize this? One option is to provide a oozie action that can invoke the webservice and write the response to the HDFS location. Are there any other better options?

Customize an InputFormat and record reader for the webservice, by this way, hadoop just regard it as normal input. What you have to do first is to find a good way to split input from webservice into small ones, because mapreduce would start as many tasks as you have inputsplits.
at the same time, there may already have jdbc inputformat for you DB

Related

How to expose Hadoop job and workflow metadata using Hive

What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8
If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]
If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.

HDFS into Cassandra

is it possible to migrate/replicate/copy/move processed files (using PIG) from local HDFS (lets say 192.168.0.10) to a cassandra (192.168.0.20)?
What I have in mind is that I literally create a java application to parse the file and re-insert them into cassandra.
Is there any other way in doing so?
thanks alot!
Writing a Java program to migrate Hadoop data to Cassandra tables is actually a overkill. It would become more worse if you happen to perform the same periodically.
Instead , we can utilize a very useful feature of Hive which helps us to integrate Hive tables with external data sources. Its Storage Handler Api of hive, which integrates with external data sources like Cassandra/Oracle/Mysql etc.
There is already an Hive-Cassandra Storage Handler API implementation available , which we can very well reuse, kindly find the same in below url.
https://github.com/tuplejump/cash/tree/master/cassandra-handler
The idea is to create Hive external table which is configured with storage handler specs about the remote Cassandra host/table details.
Any write/read performed to this external table , will be handled by Hive through mapreduce jobs which talks with the Cassandra.
I hope this is the ideal way to integrate Hive and Cassandra which takes very less efforts from us and very efficient too.
Hope this helps.
There are several ways to move the data from Hadoop to Cassandra.
Using Java HDFS API and Cassandra API (inefficient).
Using Java MapReduce program (Parallel loading).
Using Pig (Parallel loading).
Using Hive (Parallel loading).
Using Spark (Parallel loading).
Out of all Pig is easier way to load the data from HDFS to Cassandra.
Pig has a storage type called CassandraStorage. It allows us to load the data into Cassandra in parallel.
Please see this link for more information:
https://wiki.apache.org/cassandra/HadoopSupport#Pig

siebel applications hadoop connectivity

I would like to understand does hadoop support for siebel applications , can any body share experience in doing that. I looked for online documentation and not able to find any proper link to explain this so posting question here
I have and siebel application run with Oracle database, I would like to replace with HAdoop ..is it possible ?
No is the answer.
Basically Hadoop isn't a database at all.
Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system it provides an API for processing all stored data using something called as Map-Reduce.

Hadoop with Hive

We want to develop one simple Java EE web application with log file analysis using Hadoop. The following are Approach following to develop the application. But we are unable to through the approach.
Log file would be uploaded into Hadoop server from client machine using sftp/ftp.
Call the Hadoop Job to fetch the log file and process the log file into HDFS file system.
While processing the log file the content will stored into HIVE database.
Search the log content by using HIVE JDBC connection from client web application
We browsed so many sample to full fill some of the steps. But we are not having any concrete sample are not available.
Please suggest the above approach is correct or not and get the links for sample application developed in Java.
I would point out a few thing:
a) You need to merge log files or in some other ways take care that you do not have too much of them. Consider Flume (http://flume.apache.org/) which is built to accept logs from various sources and put them into HDFS.
b) If you go with ftp - you will need some scripting to take data from FTP and put into HDFS.
c) Main problem I see is- to run Hive job as result of the client's web request. Hive request is not interactive - it will take at least dozens of seconds, and probably much more.
I also would be vary of concurrent requests - you proabbly can not run more then a few in parallel
According to me, you can do one thing that:
1)Instead of accepting logs from various sources and put them into HDFS, You can put into one database say SQL Server and from that you can import your data into Hive (or HDFS) using Sqoop.
2) This will reduce your effort for writing the various job to bring the data into HDFS.
3) Once the data come in Hive, you can do whatever you want.

Hadoop Basics: What do I do with the output?

(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)
I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.
Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.
This question: ( Hadoop and MySQL Integration ) calls the practice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....
Enlighten me.
At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.
I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.
If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera
Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).
Hope that helps.

Resources