Which is better HBASE or Neo4j - hadoop

Hi guys i am analyzing few things for doing a Proof of concept. I want to convert a Employee payroll database to Nosql. Which is better to use HBASE or Neo4j?
Or if you guys have any other suggestion please tell me

For your task at hand (Payroll) and of these two choices i would recommend you to go with Neo4j.
HBase is for truly big datasets (hundreds of gigabytes / terabytes). Payroll dataset is tiny.
Hbase is not an actual database. It's a data storage. You will have to manually code and navigate links between entities, enforce foreign keys, transactions etc.
Hbase is geared more towards batch processing of large volumes of not structured data rather than OLTP (what you need for Payroll).

'Analytics' is a pretty broad term. If you want to analyze relationships, or discover information via relationships Neo would be good. Hbase is typically used with Hadoop for doing big data analytics, and where you know the relationship between data but want to analyze large tables of data. But I would think you need to understand both of the technologies a little better before going forward, they're not in the same category.

Related

what should be considered before choosing hbase?

I am very new in big data space.
We got suggestion from team we should use hbase instead of RDBMS for high performance . We do not have any idea what should/must be considered before switching RDMS to hbase. Any ideas?
One of my favourite book describes..
Coming to #Whitefret's last point : There is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
In this context Hbase supports CP
However, for switching RDBMS to HBASE you can use SQOOP.
It's a difficult question, there are many things to consider.
Can you optimize your RDBMS? Adding indexes, denormalization of joins that cost too much ... There are many path to consider and I am no expert.
Is your data big? This is very vague, and you have a space between RDBMS and Big Data where you can't be sure which one to use. Millions of rows can still be handled by RDBMS efficiently.
Do you need relation in you data? NoSQL database don't use relation, this can be hard for people from a SQL background. There are frameworks that gives SQL to HBase, but it is a bad idea in general to have a RDBMS model when using Big Data
If you can answer those questions and you think NoSQL is the drill, ask your team how they feel about it. NoSQL database comes with problem you would never meet in the SQL world. They should build a prototype first to understand how all this works, and maybe make some training available for them.
In Summary:
- Find if you need non relational database
- Choose the right one (is Hbase really what you need?, why not consider Cassandra or MongoDB?)
HBase like all NoSQL DB come with great new features but sadly nothing is free (not even mentionning the money cost).
In HBase, you really should check if all the query that you might want to do can be fullfilled with the HBase data model. An important thing to consider is the schema design (the modelisation of the rowkey most and foremost).
I advice you to read this really good paper :
http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf
I think that a really good answer to your question can be found on the HBase official site.
"HBase isn’t suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. "
https://hbase.apache.org/book.html

Indexing and searching large set of different format of documents using hadoop and related technologies

We in our organization trying to develop some competency around big data Hadoop and related eco system.
We are thinking doing a proof of concept in which our objective would be to store, index and search on large set of PDF files, email docs and word docs. First of all i would like to know that is this a big data use case?
If it is, then is it a hadoop use case? If it is then what all technologies we should go for?
We tried storing PDF in HDFS and successfully created lucene indexes via mapper jobs in parallel and store indexes at data node local temporary directory.
But we are not sure if we doing it right way or not, how to make it a proper big data Hadoop use case, and struggling around making a decision on tech stack whether Hadoop or a no SQL db or Elasticsearch or SOLR etc etc...
Our objective is to do a proof of concept around searching on large set of different format of docs and we wanted to use Hadoop if possible... Can anybody please help us to get right direction?
Thanks
If you are not going to do any analysis on the data on files stored in HDFS, Hadoop might not be your right choice. If you have unstructured or semi-structured data and want to crunch these into tables for future analysis, you can use HDFS with Hive/Pig to extract them. you might not need NoSQL unless you want high-availability or consistency which in your case I don't think so.

Can I do cluster analyses with Hadoop and can I hook Hadoop to MongoDB

I am a newbie to Hadoop, I went through a few blogs and skimmed through a couple of books on a subject. To guide my further studying I need answer to these two questions:
How much I can really do with Map-Reduce? From examples I see I can do min(), max(), sum(), count(). You can probably as easy to do average() and even standard_deviation(), but is that it? What if I want to run a query such that “customers who bought X also bought Y” (sort of join table to itself in SQL terminology). What if I want to do graph analyses or cluster analyses is Haddop’s map-reduce of any help or I am still pretty much on my own?
If I have existing database, let's say it is big (1 petabyte) and distributed, let’s say it is MongoDB with clusters, shards and all that. Can I hook Haddop to my existing MondoDB shards, or do I need to copy my data (and respectively keep it synchronized as it changes). The latter, if that is what I really need to do, sounds like expensive process, is there anything in Hadoop to help me do it.
Detailed elaborative answer or a link to such will be much appreciated.
MapReduce is pretty general structure for computation and while it is not appropriate for every situation, it is flexible in being used for a wide variety of problems. For examples of what you have described, you can refer to Mahout: http://mahout.apache.org/users/clustering/clusteringyourdata.html
Using the mongodb connector you can directly access a mongo database as a mapreduce inputformat without having to sync the data to HDFS: http://docs.mongodb.org/ecosystem/tools/hadoop
Alternatively mongo itself allows you to write queries in mapreduce to be directly executed by the database. I'd recommend aggregating the data as much as possible in this way before interfacing with hadoop/hdfs to reduce the potential amount of data transferred between the two systems.

Using Hadoop & related projects to analyze usage patterns that constantly change

We're strategizing on how to analyze user "interest" (clicks, likes, etc) on 1M+ items on our site to generate a "similar items" list.
In order to process a large amount of raw data we're learning about Hadoop, Hive, and related projects.
My question is regarding this concern: Hadoop/Hive and the like seem to be geared more towards data dumps, followed by processing cycles. Presumably the end of the processing cycle is something to the extend of an indexed graph of links between related items.
If I'm on track so far, how is data typically processed in these scenarios: I.e.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Do we stream data as it comes in, analyze it and update the data store?
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Is this use case better addressed by Cassandra than Hive/HDFS?
I'm looking to better understand the common approach to this kind of big data processing.
I think this is a good use case for Hadoop family of tools.
It looks to me like HDFS and Flume might be obvious choices, I would look into either HBase or Hive depending on what kinds of analysis you are interested in, how flexible you are in organizing the data
and querying it.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Answer: Hadoop is very good for this. I would use HBase for this, but there are other choices.
Do we stream data as it comes in, analyze it and update the data store?
Answer: Flume is good for this.
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Answer: You have options to do both. Bulk would probably be a MapReduce job on HDFS where piece-by-piece could be managed through HBase column-family values or Hive rows. If you give more details, I could be more precise.
Is this use case better addressed by Cassandra than Hive/HDFS?
Answer: Cassandra and HBase are both implementations of Google's BigTable. I think that choice depends on
how do you need to organize, access, analyze and update data. I can provide more guidance if needed.
HBase is usually better for semi-structured, high R/W processing.
DHFS is generally good choice for flexible, scalable storage of data dumps as you call them.
Flume is applicable for moving streaming data.
I would also consider looking into Titan and HBase if you are thinking graph.
Hive would be applicable if you are interested in tabular-oriented data and using SQL-like queries.

Assessing and comparing Hadoop for Business Intelligence Design considerations

I am considering various technologies for data warehousing and business intelligence, and have come upon this radical tool called Hadoop. Hadoop doesn't seem to be exactly built for BI purposes, but there are references of it having potential in this field. ( http://www.infoworld.com/d/data-explosion/hadoop-pitched-business-intelligence-488).
However little information I have got from the internet, my gut tells me that hadoop can become a disruptive technology in the space of traditional BI solutions. There really is sparse information regarding this topic, and hence I wanted to gather all the Guru's thoughts here on the potential of Hadoop as a BI tool as compared to traditional backend BI infrastructure like Oracle Exadata, vertica etc. For starters, I would like to ask the following question -
Design Considerations - How would designing a BI solution with Hadoop be different from traditional tools? I know it should be different, as I read one cannot create schemas in Hadoop. I also read that a major advantage will be the complete elimination of ETL tools for Hadoop (is this true?) Do we need Hadoop + pig + mahout to get a BI solution??
Thanks & Regards!
Edit - Breaking down into multiple questions. Will start with the one i think most imp.
Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.
I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.
The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?
hth
Update: As a response to #Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.
Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).
Hadoop is very well suited for storing colossal files that can represent fact tables. These tables can be partitioned by placing individual files representing the table into separate directories. Hive understands such file structures and allows to query them like partitioned tables. You can phrase your BI questions to the Hadoop data in the form of SQL queries via Hive, but you will still need to write and run an occasional MapReduce job.
From business perspective, you should consider Hadoop if you have a lot of low-value data. There are many cases when RDBMS / MPP solutions are not cost effective.
You also should consider Hadoop as a serious option if your data is not structured (HTMLs for example).
We are creating a comparison matrix for BI tools for Big Data / Hadoop
http://hadoopilluminated.com/hadoop_book/BI_Tools_For_Hadoop.html
It is work in progress and would love any input.
(disclaimer : I am the author of this online book)

Resources