Hadoop/Spark for build large analytics report - hadoop

I know nothing about distributed processing engines, so it's pretty hard to understand does it suit for my needs.
I have a huge table in relation database, users work with it every day (crud operations and search).
And now there is a new task - have a possibility to build huge aggregate report for a one-two year period on demand. And do it fast.
All this table records for last two years are too big to fit in memory, so I should split computations into chunks, right?
I don't want to reinvent the wheel, so my question is,
does distributed processing systems like Hadoop are suit for this kind of tasks?

It may.
The non Hadoop way would be to create semi aggregate report which you can use for other aggregate.
I.e using 30 daily aggregate to create 1 monthly aggregate.
In some cases it may not be possible so you can pull the data to your spark cluster or such and do your aggregate.
Usually relational database won't give you the data locality features so you can move the data to some nosql database like Cassandra or hbase or elasticsearch.
Also a big key question is do you want the answer to be real time? Unless you go through some effort like job server etc spark or Hadoop jobs are usually batch job. Means you submit the job and get the answer later (spark streaming is an exception.)

Related

For Hadoop: which data storage?

Currently I am working on a solution for my internship to handle up to 100.000.000 records a day with about 10 columns. I have to save each record, and after 15 days we have about 1.500.000.000 records.
The situation:
So, every day I receive about 100.000.000 (maybe a few millions more) records, with these records I have to do some calculations/analyzing. To do this, I am thinking about to use Hadoop for MapReduce and distributed computing. With the MapReduce pattern I can make sets of 100.000 records each, and distribute them over the cluster to do some distributed analyzing/calculations
I don't know if this is a good solution, but if you have something else I have to think about, please tell me.
Beside this, I also have to store all these records and use them every month to improve the algorithm for the calculations I do every day. What store is best for this situation? I am thinking about HBase or CouchDB because I think they fit my requirements well.
Actually , Hadoop is not a database.Hadoop is a framework that enables the distributed processing of large data sets across clusters of commodity servers.
It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Hadoop is best known for MapReduce and its distributed file system (HDFS)
Hbase is a distributed, column-oriented database. Hbase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries.
Hive is a distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for queering the data.
**What you can do is:
using Hbase for storage
using hive for analytics
you can also integrate both and use hive queries (based on sql) to store in hbase.

comparing data with last 5 versions of feed data in C* using datastax,hadoop,hive

I have a lot of data saved into Cassandra on a daily basis and I want to compare one datapoint with last 5 versions of data for different regions.
Lets say there is a price datapoint of a product and there are 2000 products in a context/region(say US). I want to show a heat map dash board showing when the price change happened for different regions.
I am new to hadoop, hive and pig. Which path would help me achieve my goal and some details appreciated.
Thanks.
This sounds like a good use case for either traditional mapreduce or spark. You have relatively infrequent updates, so a batch job running over the data and updating a table that in turn provides the data for the heatmap seems like the right way to go. Since the updates are infrequent, you probably don't need to worry about spark streaming- just a traditional batch job run a few times a day is fine.
Here's some info from datastax on reading from cassandra in a spark job: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSCcontext.html
For either spark or mapreduce, you are going to want to leverage the (spark or MR) framework's ability to partition the task- if you are manually connecting to cassandra and reading/writing the data like you would from a traditional RDBMS, you are probably doing something wrong. If you write your job correctly, the framework will be responsible for spinning up multiple readers (one for each node that contains the source data that you are interested in), distributing the calculation tasks, and routing the results to the appropriate machine to store them.
Some more examples are here:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkIntro.html
and
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/byoh/byohIntro.html
Either way, MapReduce is probably a little simpler, and Spark is probably a little more future proof.

The best way to filter large data sets

I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria.
The relevant tables look roughly like this:
trade_metadata 18m rows, 10 GB
trade_economics 18m rows, 15 GB
business_event 18m rows, 11 GB
trade_business_event_link 18m rows, 3 GB
One of our reports is now taking ages to run ( > 5 hours). The underlying proc has been optimized time and again but new criteria keep getting added so we start struggling again. The proc is pretty standard - join all the tables and apply a host of where clauses (20 at the last count).
I was wondering if I have a problem large enough to consider big data solutions to get rid of this optimize-the-query game every few months. In any case, the volumes are only going up. I have read up a bit about Hadoop + HBase, Cassandra, Apache Pig etc. but being very new to this space, am a little confused about the best way to proceed.
I imagine this is not a map-reduce problem. HBase does seem to offer Filters but I am not sure about their performance. Could the enlightened folks here please answer a few questions for me:
Is the data set large enough for big data solutions (Do I need entry into the billion club first?)
If it is, would HBase be a good choice to implement this?
We are not moving away from Oracle anytime soon even though the volumes are growing steadily. Am I looking at populating the HDFS every day with a dump from the relevant tables? Or is delta write possible everyday?
Thanks very much!
Welcome to the incredibly varied big data eco-system. If your dataset size is big enough that it is taxing your ability to analyze it using traditional tools, then it is big enough for big data technologies. As you have probably seen, there are a huge number of big data tools available with many of them having overlapping capabilities.
First of all, you did not mention if you have a cluster set-up. If not, then I would suggest looking into the products by Cloudera and Hortonworks. These companies provide Hadoop distributions that include many of the most popular big data tools(hbase, spark, sqoop, etc), and make it easier to configure and manage the nodes that will make up your cluster. Both companies provide their distributions free of charge, but you will have to pay for support.
Next you will need to get your data out of Oracle and into some format in the hadoop cluster to analyze it. The tool often used to get data from a relational database and into the cluster is Sqoop. Sqoop has the ability to load your tables into HBase, Hive, and files on the Hadoop Distributed Filesystem (HDFS). Sqoop also has the ability to do incremental imports for updates instead of whole table loads. Which of these destinations you choose affects which tools you can use in the next step. HDFS is the most flexible in that you can access it from PIG, MapReduce code you write, Hive, Cloudera Impala, and others. I have found HBase to be very easy to use, but others highly recommend Hive.
An aside: There is a project called Apache Spark that is expected to be the replacement for Hadoop MapReduce. Spark claims 100x speedup compared to traditional hadoop mapreduce jobs. Many projects including Hive will run on Spark giving you the ability to do SQL-like queries on big data and get results very quickly (Blog post)
Now that your data is loaded you need to run those end of day reports. If you choose Hive, then you can reuse a lot of your sql knowledge instead of having to program Java or learn Pig Latin (not that it’s very hard). Pig Translates Pig Latin to MapReduce jobs (as does Hive’s Query Language for now), but, like Hive, Pig can target Spark as well. Regardless of which tool you choose for this step, I recommend looking into Oozie to automate the ingestion, analaytics, and movement of results back out of the cluster (sqoop export for this). Oozie allows you to schedule recurring workflows like yours so you can focus on the results not the process. The full capabilities of Oozie are documented here.
There are a crazy number of tools at your disposal, and the speed of change in this eco-system can give you whip-lash. Both cloudera and Hortonworks provide Virtual Machines you can use to try their distributions. I strongly recommend spending less time deeply researching each tool and just trying some of the them (like Hive, Pig, Oozie,...) to see what works best for your application).

Incremental MapReduce with Hadoop and HBase

I've been working with CouchDB for a while, and I'm considering doing a little academic project in HBase / Hadoop. I read some material on them, but could not find a good answer for one question:
In Both Hadoop/HBase and CouchDB use MapReduce as their main method of query. However, there is a significant difference: CouchDB does that incrementally, using views, indexing every new data that is added to the database, while Hadoop (from all the examples I saw), is typically used to perform full queries on entire data-sets. What I'm missing is the ability to use Hadoop MapReduce to build, and mainly, maintain indexes, such as CouchDB's views. I saw some examples of how MapReduce can be used for creating an initial index, but nothing about incremental updates.
I believe the main challenge here is to run the indexing job only on rows that changed since a given timestamp (the time of the last indexing job). This would make these jobs run for a short amount of time, allowing them to run frequently, keeping the index relatively up-to-date.
I expected this usage pattern to be very common, and was surprised not to see anything about it online. I already saw IndexedHbase and HbaseIndexed, which both provide secondary indexing on HBase based on non-key rows. This is not what I need. I need the programmatic ability to define the index arbitrarily, based on the contents of one or more rows.
One way could be to use Timestamp as the rowkey. This will allow you to work on rows based on some given time.
Since i'm talking about rowkey based on TS, use hashing to avoid hotspotting.

high volume transaction data informative pattern generation

I am trying to figure out informative data patterns from large volume transactional data.
Typically my data is set of records with well defined columns (like sender, receiver, amount, currency address etc - I have around 40-50 different columns), data volume will be multi million (may be 100s of millions) records and my aim is to generate informative transactional patterns from this like - who is purchasing particular item the most, highest volume transaction recipients, expense patterns, who is getting more transactions from same another sender etc.
Earlier I was planning to load data in relational database (Oracle/MySQL) and write complex SQLs to fetch this information but by looking at volume during my proof of concept, it doesn't seem to be much scalable.
I was trying to get more information on distributive data processing using Hadoop etc. I just started reading Hadoop, up to my initial understanding Hadoop is well suited for unstrcutured data processing and might not be much useful for relational data processing.
Any pointers/suggestions on open source technology which I can quickly experiment with.
Hadoop can be used for structured/unstructured data processing. Also, it's not a database to maintain relationships, indexes like a traditional RDBMS.
With millions of rows HBase or Cassandra coupled with/without Hive can be used for batch querying. Batch querying in Hadoop had been there for some time and is mature.
For interactive querying Drill or Imapala can be used. Note that Drill development has just started and is in incubator stage. While, Imapala has been just announced by Cloudera. Here is some interesting info for real time engines.
Note that there are lot of other open source frameworks which might fit the requirements, but only a couple of them are mentioned here. Based on detailed requirement analysis and the pros and cons of the different frameworks, the appropriate framework has to be chosen.

Resources