(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)
I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.
Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.
This question: ( Hadoop and MySQL Integration ) calls the practice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....
Enlighten me.
At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.
I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.
If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera
Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).
Hope that helps.
Related
I would like to know how Hadoop components are used in real time.
here are my questions:
data Importing/export:
I know the options available in Sqoop but like to know how Sqoop is used in real time implementations (in common)
if I'm correct
1.1 sqoop commands placed in shell script and being called from schedulers/event triggers. can I have real time code-example on this, specifically passing parameters to Sqoop dynamically (such as table name) in shell script.
1.2 believe Ooozie workflow could also be used. any examples please
Pig
how pig commands are commonly called in real time? via java programs?
any realtime code-examples would be a great help
if I am correct Pig is commonly used for data quality checks/cleanups on staging data before loading them in to actual hdfs path or as hive tables.
and we could see pig scripts in shell scripts (in real time projects)
please correct me or add if I missed any
Hive
where we will see Hive commands in real time scenarios?
in shell scripts or in java api calls for reporting?
HBase
Hbase commands are commonly called as api calls in languages like Java.
am I correct?
sorry for too many questions. I don't see any article/blog on how these components are used in real time scenarios.
Thanks in advance.
The reason you don't see articles on the use of those components for realtime scenarios, is because those components are not realtime oriented, but batch oriented.
Scoop: not used in realtime - it is batch oriented.
I would use something like Flume to ingest data.
Pig, Hive: Again, not realtime ready. Both are batch oriented. The setup time of each query/script can take tens of seconds.
You can replace both with something like Spark Streaming (it even supports Flume).
HBase: It is a NoSQL database on top of HDFS. Can be used for realtime. Quick on inserts. It can be used from spark.
If you want to use those systems to help realtime apps, think of something like a Lambda architecture, that has a batch layer (using hive, pig and what not) and a speed layer, using streaming/realtime technologies.
Regards.
I have the following requirement that I plan to fulfill through Hadoop frameworks.
I have 40% of data sitting in a SQL Server Database
I have 20% of data available through a Web service
I have the rest 40% available through another database.
The data from the three sources need to be joined together to make a fourth data set , that I need to send to a 2 systems - one through Webservice call , another thru direct database import.
To achieve the above feature, Im planning to use Hadoop platform that we already have. The database pulls and push can be managed through Sqoop. The transformation is managed through SQL queries written through Hive. All of this is orchestrated through Oozie workflow.
In the complete gamut of things, what I would like to get help on is -
a. Is it a good approach to directly invoke a Webservice to fetch the data from hadoop? Or should I not use hadoop at all , if it involves fetching data from external webservices? I dont believe as there are ways to make it work but I would like your views.
b. If this approach is good, how can I materialize this? One option is to provide a oozie action that can invoke the webservice and write the response to the HDFS location. Are there any other better options?
Customize an InputFormat and record reader for the webservice, by this way, hadoop just regard it as normal input. What you have to do first is to find a good way to split input from webservice into small ones, because mapreduce would start as many tasks as you have inputsplits.
at the same time, there may already have jdbc inputformat for you DB
I have an application with SAS where I pull the data from Oracle and produce report to excel using Base SAS and SAS macros. Now the problem is day by day my database is getting huge and fetching data from Oracle is taking more time, as a result my jobs are running slow.
So I want my application to be built on Hadoop for Reporting and analysis purpose. Can someone please suggest me any approach and what are the tools I need to use for this.
The short answer is: it depends.
For unloading data from Oracle I would recommend you to use Sqoop (http://sqoop.apache.org/), it is designed for this specific use case and can even do incremental loads and can create Hive table for unloaded data
When the data is unloaded, you can use Impala to build the report you need. Impala can natively work with Hive tables, so the sings are really simple. Of course, you would have to rewrite your SAS code to a set of SQL statements that would run on top of Impala.
Next, if you need visualization tool to run on top of it, you can either try something like Tableau or any other tool that is capable of utilizing ODBC/JDBC to connect to Impala
Finally, I think Hadoop + Sqoop + Impala would cover your needs. But I'd recommend you also to take a look at the MPP databases, because using SAS means you have pretty structured data and MPP database would be a better fit for this case
I would like to understand does hadoop support for siebel applications , can any body share experience in doing that. I looked for online documentation and not able to find any proper link to explain this so posting question here
I have and siebel application run with Oracle database, I would like to replace with HAdoop ..is it possible ?
No is the answer.
Basically Hadoop isn't a database at all.
Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system it provides an API for processing all stored data using something called as Map-Reduce.
We want to develop one simple Java EE web application with log file analysis using Hadoop. The following are Approach following to develop the application. But we are unable to through the approach.
Log file would be uploaded into Hadoop server from client machine using sftp/ftp.
Call the Hadoop Job to fetch the log file and process the log file into HDFS file system.
While processing the log file the content will stored into HIVE database.
Search the log content by using HIVE JDBC connection from client web application
We browsed so many sample to full fill some of the steps. But we are not having any concrete sample are not available.
Please suggest the above approach is correct or not and get the links for sample application developed in Java.
I would point out a few thing:
a) You need to merge log files or in some other ways take care that you do not have too much of them. Consider Flume (http://flume.apache.org/) which is built to accept logs from various sources and put them into HDFS.
b) If you go with ftp - you will need some scripting to take data from FTP and put into HDFS.
c) Main problem I see is- to run Hive job as result of the client's web request. Hive request is not interactive - it will take at least dozens of seconds, and probably much more.
I also would be vary of concurrent requests - you proabbly can not run more then a few in parallel
According to me, you can do one thing that:
1)Instead of accepting logs from various sources and put them into HDFS, You can put into one database say SQL Server and from that you can import your data into Hive (or HDFS) using Sqoop.
2) This will reduce your effort for writing the various job to bring the data into HDFS.
3) Once the data come in Hive, you can do whatever you want.