Pros and cons of integrating hadoop with OBIEE - hadoop

I am studying about integrating hadoop with OBIEE. However I am unable to find any good article highlighting pros and cons of integrating Hadoop with OBIEE.if anyone has this information kindly share the link/details.

Pros: You can get your data from Hadoop
Cons: Pointless, unless your data is in Hadoop
As a question this really doesn't make much sense. You integrate OBIEE with wherever your data is, in order to analyse it.

+1 to Robin. The point if a source-agnostic tool is to analyze data wherever it lies.
Pushing data to a new storage "just because" isn't adding value. You have to have a reason like performance, explicit physical modelling (multidimensional cubes for example) or the likes.

Related

Exasol vs HBase

I'm quite new to BigData architecture so please don't be to harsh on me.
I am trying to figure out the best alternative to build a BI Architecture able to deal with huge amounts of data. As I see it, the solution has to be clustered/horizontally scalable to cope with system growing. I would like to be able to interact with the system using SQL, so HBase + Hive (or even Pig, not for sql but not to need to manually write MR tasks) could be a solution. What would be the benefits/disadvantages of such an architecture opposed to, for instance, Exasolution and their In-Memory - MPP - Columnar solution.
Are there other alternatives which might have some extra-benefits? What about maintenance and configuration? Any Microsoft solution (I may find customer specific needs regarding this)
Sorry for posting such an open question, but I would like to see some discussion so that I can learn from you as much as possible.
Though being an EXASOL guy, I will not start to try to convince you that EXASOL is the one and only good solution out there. It heavily depends on the use case you are trying to implement, and the requirements you have to fulfill.
Hadoop is a very flexible, scalable system and used very often for storing and processing huge volumes of data.
EXASOL in contrast is a specialized RDBMS for complex analytic query processing.
I think that these two options don't really directly compete but complement each other. In many cases companies need a scalable data lake to store and preprocess there data, or to query it in rather simply ways. Once you want to enter the real-time business with complex analytics, where dozens, hundreds or even thousands of analysts are running lots of queries, then an in-memory RDBMS is a great choice.
King, the producer of Candy Crush, combines these two worlds to a powerful data management eco system. They store petabytes of data within Hadoop and use EXASOL on top as an in-memory layer for hundreds of terabytes of data. You can read more about that exciting use case here: http://bit.ly/1TR8APY
Another important difference of these two worlds is the complexity. While EXASOL is tuning-free because it is a specialized system (similar to an appliance) for a certain use case running SQL queries or R/Python/Java in-database-analytics, the Hadoop stack is much more complex. You'll need a certain level of know how to setup, maintain and tune this system. This doesn't need to be a reason for any of the two option. As mentioned, it heavily depends on what you want.
From a price perspective, Hadoop is free and so it should be much cheaper than an in-memory db such as EXASOL, right? Wait a minute, it's not that easy. Again, you have to consider the whole picture. How much data you really want to store, how much of that needs to be queried for analysis, how much hardware would you need to buy, how many people do you have to be hired and trained for the operation or the analytics deployed on the system.
Summary
To summarize my thoughts, the world is too complicated to directly compare these two technologies. Depending on the use case and your personal requirements, either one or the other could be the better option. And in my opinion, the trend in the market is combining such systems to a data mgmt eco systems where you get the best out of the two worlds... Actually three worlds, because the world of operational data processing of NoSQL solutions should also be mentioned here.
I hope that helped a bit. If you need any further details especially about EXASOL, don't hesitate to contact me or connect with me on LinkedIn: de.linkedin.com/in/exagolo

Monolithic ETL to distributed/scalable solution and OLAP cube to Elasticsearch/Solr

I am relatively a newbie to big data processing looking for some specific guidance from the SO community.
We are currently setup with a monolithic/sequential ETL, needless to say it is not scalable as our data grows. What are our options (sure distributing and parallelizing are but need specifics)? I have played with Hadoop and it may be appropriate to use here, but I am wondering what are some of the other options out there? May be something that's easier to transition to for a database developer?
Kind of related to question above is we also have an OLAP cube for aggregated data. Is Elasticsearch or Solr good candidates for replacing an OLAP cube? Has anyone successfully done this? What are the gotchas?
same kind of use case currently we are working on.
our approach may be use full.
step 1: we are sqooping data to Hdfs from dbs
step 2: ETL logic in Pig scripting
step 3: building index on aggregated table data to solr.
step 4: search on solr through web interface.
in our use case we are developing pig jobs to perform transformation logic storing them to final folders incrementally. later MR indexer tool will index the data to solr.we are using cloudera-search. let me know if any thing.

Neo4j and Cluster Analysys

I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
Thanks for sharing
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j
I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.
let me introduce Reco4J (http://www.reco4j.org), is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.
Cheers,
Alessandro

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

Hadoop Hypercube

Hey,
i am starting a hadoop based hypercube with a flexible number of dimensions.
Does anybody know any existing approaches for this?
I just found PigOLAPSketch, but there is no code to use it.
Another approach is Zohmg from lastfm, which uses hbase, but seems to be very dead.
I think i will start a pig solution, maybe you have some advices?
This would be very cool/useful. OpenTSDB is an HBase time-series database that might be interesting to look at, they have a clever approach to secondary indexing.
You can also look at gpu based database https://www.kinetica.com/
but this is not open source, requires separate appliances and movement of data from Hadoop to Kinetica infrastructure.

Resources