Mapreduce Vs Spark Vs Storm Vs Drill - For Small files - hadoop

I know spark does the in memory computation and is much faster then MapReduce.
I was wondering how well does spark work for say records < 10000 ?
I have huge number of files around ( each file having around 10000 records , say 100 column file) coming into my hadoop data platform and i need to perform some data quality checks before i load then into hbase.
I do the data quality check in hive which uses MapReduce at the back-end. For each file it takes about 8 mins and thats pretty bad for me.
Will spark give me a better performance lets say 2-3 mins ?
I know I got to do a bench marking but i was trying to understand the basics here before i really get going with spark.
As I recollect creating RDD's for the first time will be an overhead and since i got to create a new RDD for each incoming file that going to cost me a bit.
I am confused which would be the best approach for me - spark , drill, storm or Mapreduce itself ?

I am just exploring the performance of Drill vs Spark vs Hive over around millions of records. Dill & Spark both are around 5-10 times faster in my case (I did not perform any performance test over cluster with significant RAM, I just tested on single node) The reason for fast computation - both of them perform the in-memory computation.
The performance of drill & spark is almost comparable in my case. So, I can't say which one is better. You need to try this at your end.
Testing on Drill will not take much time. Download the latest drill, install on your mapr hadoop cluster, add hive-storage plugin and perform the query.


The best way to filter large data sets

I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria.
The relevant tables look roughly like this:
trade_metadata 18m rows, 10 GB
trade_economics 18m rows, 15 GB
business_event 18m rows, 11 GB
trade_business_event_link 18m rows, 3 GB
One of our reports is now taking ages to run ( > 5 hours). The underlying proc has been optimized time and again but new criteria keep getting added so we start struggling again. The proc is pretty standard - join all the tables and apply a host of where clauses (20 at the last count).
I was wondering if I have a problem large enough to consider big data solutions to get rid of this optimize-the-query game every few months. In any case, the volumes are only going up. I have read up a bit about Hadoop + HBase, Cassandra, Apache Pig etc. but being very new to this space, am a little confused about the best way to proceed.
I imagine this is not a map-reduce problem. HBase does seem to offer Filters but I am not sure about their performance. Could the enlightened folks here please answer a few questions for me:
Is the data set large enough for big data solutions (Do I need entry into the billion club first?)
If it is, would HBase be a good choice to implement this?
We are not moving away from Oracle anytime soon even though the volumes are growing steadily. Am I looking at populating the HDFS every day with a dump from the relevant tables? Or is delta write possible everyday?
Thanks very much!
Welcome to the incredibly varied big data eco-system. If your dataset size is big enough that it is taxing your ability to analyze it using traditional tools, then it is big enough for big data technologies. As you have probably seen, there are a huge number of big data tools available with many of them having overlapping capabilities.
First of all, you did not mention if you have a cluster set-up. If not, then I would suggest looking into the products by Cloudera and Hortonworks. These companies provide Hadoop distributions that include many of the most popular big data tools(hbase, spark, sqoop, etc), and make it easier to configure and manage the nodes that will make up your cluster. Both companies provide their distributions free of charge, but you will have to pay for support.
Next you will need to get your data out of Oracle and into some format in the hadoop cluster to analyze it. The tool often used to get data from a relational database and into the cluster is Sqoop. Sqoop has the ability to load your tables into HBase, Hive, and files on the Hadoop Distributed Filesystem (HDFS). Sqoop also has the ability to do incremental imports for updates instead of whole table loads. Which of these destinations you choose affects which tools you can use in the next step. HDFS is the most flexible in that you can access it from PIG, MapReduce code you write, Hive, Cloudera Impala, and others. I have found HBase to be very easy to use, but others highly recommend Hive.
An aside: There is a project called Apache Spark that is expected to be the replacement for Hadoop MapReduce. Spark claims 100x speedup compared to traditional hadoop mapreduce jobs. Many projects including Hive will run on Spark giving you the ability to do SQL-like queries on big data and get results very quickly (Blog post)
Now that your data is loaded you need to run those end of day reports. If you choose Hive, then you can reuse a lot of your sql knowledge instead of having to program Java or learn Pig Latin (not that it’s very hard). Pig Translates Pig Latin to MapReduce jobs (as does Hive’s Query Language for now), but, like Hive, Pig can target Spark as well. Regardless of which tool you choose for this step, I recommend looking into Oozie to automate the ingestion, analaytics, and movement of results back out of the cluster (sqoop export for this). Oozie allows you to schedule recurring workflows like yours so you can focus on the results not the process. The full capabilities of Oozie are documented here.
There are a crazy number of tools at your disposal, and the speed of change in this eco-system can give you whip-lash. Both cloudera and Hortonworks provide Virtual Machines you can use to try their distributions. I strongly recommend spending less time deeply researching each tool and just trying some of the them (like Hive, Pig, Oozie,...) to see what works best for your application).

What can I expect about hive and hadoop in performance?

I'am actually trying to implement a solution with Hadoop using Hive on CDH 5.0 with Yarn. So my architecture is:
1 Namenode
3 DataNode
I'm querying ~123 millions rows with 21 columns
My node are virtualized with 2vCPU #2.27 and 8 GO RAM
So I tried some request and i got some result, and after that i tried the same requests in a basic MySQL with the same dataset in order to compare the results.
And actually MySQL is very faster than Hive. So I'm trying to understand why. I know I have some bad performance because of my hosts. My main question is : is my cluster well sizing ?
Do i need to add same DataNode for this amount of data (which is not very enormous in my opinion) ?
And if someone try some request with appoximately the same architecture, you are welcome to share me your results.
Thanks !
I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion
That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.
Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.
Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.
I'm saying that with 2 datanodes i get the same performance than with 3
That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.
Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.

Time taken by MapReduce jobs

I am new to hadoop and mapreduce.I have a problem in running my data in hadoop Mapreduce. I want the results to be given in milliseconds. Is there any way that i can execute my Mapreduce jobs in milliseconds?
If not then what is the minimum time hadoop mapreduce can take in a fully distributed multi-cluster(5-6 nodes).
File size to be analyzed in hadoop mapreduce is around 50-100Mb
Program is written in Pig.Any suggesstions?
For adhoc realtime querying of data use Imapala, Apache Drill (WIP). Drill is based on Google Dremel.
Hive jobs get converted into MapReduce, so Hive is also batch oriented in nature and not real time. A lot of work is going on improve the performance of Hive (1 and 2) though.
it's not possible(afaik). hadoop is not meant for real time stuff on the first place. it is best suitable for batch jobs. the mapreduce framework needs some time to accept and setup the job, which you can't avoid. and i don't think it's a wise decision to get ultra high end machines to setup a hadoop cluster. also, the framework has to do a few things before actually starting the job, creating the logical splits of your data, for instance.

Performance comparison : Hive & MapReduce

Hive provides an abstraction layer over java Map Reduce job , so it should have performance issue when compared to Java Map Reduce Jobs.
Do we have any benchmark to compare the performance of Hive Query & Java Map Reduce Jobs ?
Actual use-cases scenario with run time data , would be real help .
Your premise that " so it should have performance issue when compared to Java Map Reduce Jobs." is wrong......
Hive (and Pig and crunch and other map/reduce abstractions) would be slower than a fully tuned hand written map/reduce.
However, unless you're experienced with the Hadoop and map/reduce, the chances are, that the map/reduce you'd write would be slower on non-trivial queries vs. what Hive et. al. will do
I did some small test in a VM some time back and I couldn't really notice any difference. Maybe Hive was a few seconds slower sometimes but I can't really tell if that was Hives performance or my VM that was hanging due to low memory. I think that one thing to keep in mind is, Hive will always determine the fastest way to do a MapReduce job. Now, when you write small MapReduce jobs, you'll probably be able to find the fastest way yourself. But with large complex jobs (with joins, etc.) will you always be able to compete with Hive?
Also, the time you need to write a MapReduce job of multiple classes and methods seems to take ages in comparison with writing a HiveQL query.
On the other hand, I had the feeling that when I wrote the job myself it was easier to know what was going on.
If you've small dataset on your machine and want to process using Apache Hive, execution of Job on small dataset would be slow as compared to process the same dataset using Hadoop MapReduce. Performance of hive slightly degrades, if you consider small datasets. Whereas, for large datasets, Apache Hive performace would be better as compared to MapReduce.
While processing datasets in MapReduce, data-set is stored in HDFS. MapReduce has no database of its own, as Hive has meta-store. From Hive's Metastore, data can be shared with Impala, Beeline, JDBC and ODBC drivers.

Is Hadoop Suited to Serve 100 byte Records Out of 50GB Dataset?

We have a question on whether Hadoop is suitable for simple tasks that require no application running, but require very fast reads and writes of small amount of data.
The requirement is to be able to write roughly a 100-200 bytes long messages with couple of indexes at rate 30 per second, at the same time to be able to read (search by those two indexes) at rate roughly 10 per seconds. The read queries must be very fast - 100-200 milliseconds max per query and return few matching records.
The total data volume is expected to reach 50-100 gb and is to be maintained at this rate by removing older records (something like daily task to delete records that are older than 14 days)
As you can see the total data volume is not really that big, but we are concerned that the search speed of Hadoop may be slower than our need anyway.
Is Hadoop a solution for this?
Hadoop, alone, is very bad at serving out many small segments of data. However, HBase is an indexed table database-like system meant to be run on top of Hadoop. It is excellent at serving out small indexed files. I would research that as a solution.
Another problem to keep an eye on is that importing data into HDFS or HBase is not trivial. It can slow your cluster down quite a bit, so if Hadoop is your choice, you have to also solve how to get those 75GB into HDFS so Hadoop can touch them.
As Sam noted HBase is the Hadoop stack solution that can handle your requirements. However I wouldn't go with Hadoop if these are your only requirements from the data.
You can go with other NoSQL solutions like MongoDB or CouchDB or even MySQL or Postgres
