siebel applications hadoop connectivity - hadoop

I would like to understand does hadoop support for siebel applications , can any body share experience in doing that. I looked for online documentation and not able to find any proper link to explain this so posting question here
I have and siebel application run with Oracle database, I would like to replace with HAdoop ..is it possible ?

No is the answer.
Basically Hadoop isn't a database at all.
Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system it provides an API for processing all stored data using something called as Map-Reduce.

Related

How to start exploring BigData, Hadoop and its ecosystem components?

I have just started exploring BigData technology and the Hadoop framework.
But, getting confused with so many ecosystem components and framework. Could you please advise to get a structured start for learning ?
I mean which ecosystem component should one focus? Any in particular or all?
Help much appreciated!
Ranit
I wrote this answer on Quora few months back. Hope this will help:
1. Go through some introductory videos on Hadoop
Its very important to have some high level idea of hadoop before directly starting working on it. These introductory videos will help in understanding the scope of Hadoop and the use cases where it can be applied. There are a lot of resources available online for the same and going through any of the videos will be beneficial.
2. Understanding MapReduce
The second thing which helped me was to understand what Map Reduce is and how it works. It is explained very nicely in this paper: http://static.googleusercontent....
Another nice tutorial is available here : http://ksat.me/map-reduce-a-real...
For points 1 and 2, go through first four lectures for week one video lectures. The whole concept of distributed computing and map reduce is explained very nicely here. https://class.coursera.org/mmds-001/lecture
3. Getting started with Cloudera VM
Once you understand the basics of Hadoop, you can download the VM provided by cloudera and starting running some hadoop commands on it. You can download the VM from this link: http://www.cloudera.com/content/...
It would be nice to get familiar with basic Hadoop commands on the VM and understanding how it works.
4. Setting up the standalone/Pseudo distributed Hadoop
I would recommend setting up your own standalone Hadoop on your machine once you are familiar with Hadoop using the VM. The steps for installing are explained very nicely on this blog by Michael G. Noll : Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll
5. Understanding the Hadoop Ecosystem
It would be nice to get familiar with other components in the Hadoop ecosystem like Apache Pig, Hive, Hbase, Flume-NG, Hue etc. All these serve different purposes and having some information on all these will be really helpful in building any product around the hadoop ecosystem. You can install all these easily on your machine and get started with them. Cloudera VM by has most of these installed already.
6. Writing Map Reduce Jobs
Once you are done with steps 1-5, I don't think writing Map Reduce would be a challenge. It is explained thoroughly in The Definitive Guide. If MapReduce really interests you a lot, I would suggest reading this book Mining Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman : Page on Stanford
I would recommend going for Hadoop first, it's the basis for a lot of those other systems out there. Check out the main site: http://hadoop.apache.org/ and check out Cloudera, they provide a Virtual image (called CDH), that comes with everything pre-installed, so you can jump into action without having to deal with installation problems: http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html
After that, I would look into HDFS, just to understand a bit more how Hadoop stores that data, and then it would depend on what type of problems you're trying to solve, each particular system tackles a specific and (usually) different problem:
Hive / Cassandra: For database-like interaction
Pig: For data transformation.
Spark: For real time data analysis
Check out this link for more details: http://www.cloudera.com/content/cloudera/en/training/library/apache-hadoop-ecosystem.html
I hope you find that useful.
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy - From wikipedia
Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
1.Hadoop Common: The common utilities that support the other Hadoop modules.
2.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
3.Hadoop YARN: A framework for job scheduling and cluster resource management.
4.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
Structured : Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig™: A high-level data-flow language and execution framework for parallel computation
Hive Vs PIG comparison can be found at my other post in this question
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends , summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows,run a scheduled jobs, crunch big volumes of data,aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.
Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing framework. Common uses cases for Apache Spark include real-time queries, event stream processing, iterative algorithms, complex operations and machine learning.
Mahout™: A Scalable machine learning and data mining library.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine
I have covered only some of key components of Hadoop ecosystem. If you like to have a look at all component of ecosystem, have a look at this ecosystem table
If above table is very difficult to digest, have a look at minified version of ecosystem at this article
But to understand all of these system, I would like you to start with Apache website first and explore other articles later.
Big data is not a technology in itself, instead it is a concept.
You can think of database, database is not a technology in itself, it is a concept. Oracle, DB2 etc are database technologies.
So coming back to big data, this concept is used to deal with huge data which is difficult to be analyzed using traditional databases or technologies. People think hadoop as synonym of bigdata but again let me tell you that Hadoop is nothing but a technology developed by Apache to implement bigdata concept.
Hadoop has its own file system called hdfs and it uses mapreduce to solve bigdata problems. Apart from Hadoop there is hive which is similar to sql but internally it uses map reduce. Hbase is similar to nosql database. Pig is scripting language which uses mapreduce internally.
There are many licensed version for big data like MapR, Hortonworks, Cloudera etc.
So start learning with Hadoop - HDFS, Mapreduce, Yarn, Hive.
Things I did to learn Hadoop.
a) Install Hadoop from scratch. I mean download CentOs, Hadoop , JAVA etc., and install them manually.
b) Understand how HDFS works.
c) Understand how MapReduce works.
d) Write word count in JAVA.
This will help you get started.

Hadoop with Hive

We want to develop one simple Java EE web application with log file analysis using Hadoop. The following are Approach following to develop the application. But we are unable to through the approach.
Log file would be uploaded into Hadoop server from client machine using sftp/ftp.
Call the Hadoop Job to fetch the log file and process the log file into HDFS file system.
While processing the log file the content will stored into HIVE database.
Search the log content by using HIVE JDBC connection from client web application
We browsed so many sample to full fill some of the steps. But we are not having any concrete sample are not available.
Please suggest the above approach is correct or not and get the links for sample application developed in Java.
I would point out a few thing:
a) You need to merge log files or in some other ways take care that you do not have too much of them. Consider Flume (http://flume.apache.org/) which is built to accept logs from various sources and put them into HDFS.
b) If you go with ftp - you will need some scripting to take data from FTP and put into HDFS.
c) Main problem I see is- to run Hive job as result of the client's web request. Hive request is not interactive - it will take at least dozens of seconds, and probably much more.
I also would be vary of concurrent requests - you proabbly can not run more then a few in parallel
According to me, you can do one thing that:
1)Instead of accepting logs from various sources and put them into HDFS, You can put into one database say SQL Server and from that you can import your data into Hive (or HDFS) using Sqoop.
2) This will reduce your effort for writing the various job to bring the data into HDFS.
3) Once the data come in Hive, you can do whatever you want.

What is Facebook's HiPal data analytics tool, and how does it work?

What are all the knowledge management features of Facebook's HiPal data analytics tool, and how does it work? Is it purely architectured for hadoop environment or can be used with other DBs?
Though this is just speculation as HiPal has not been released to the public.
HiPal is a UI for a SQL-like program called HIVE. Hive is a program that allows you to run SQL-like queries on files in a Hadoop File System. Hadoop is a distributed map/reduce architecture used for large(many terabytes) data sets.
But as it's not open source we can't get our hands on it. But this wouldn't be used for other database systems.
http://www.facebook.com/note.php?note_id=89508453919
Facebook uses Hive (http://borthakur.com/ftp/hadoopworld.pdf) to process data. Hive is a SQL-like framework interface that runs on top of Hadoop, created by the Facebook team themselves, and latter on donated to the apache community.
They say they analyse 20 PB of data with Hive/Hadoop.
Here is a quick start guide:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Hadoop Basics: What do I do with the output?

(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)
I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.
Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.
This question: ( Hadoop and MySQL Integration ) calls the practice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....
Enlighten me.
At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.
I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.
If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera
Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).
Hope that helps.

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?
I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.
With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

Resources