Performing update operations on Hadoop - hadoop

Hadoop is not designed to do updates. I tried with hive it has to do insert overwrite which is a costly operation also we can do some work around using Map reduce which again is a costly operation.
Is their any other tool or way by which i can do frequent updates on Hadoop or can i use spark for the same. Please help me i am not getting enough information about this even after googling 100 times. Thanks in advance.

If you need to update realtime on Hadoop, Hbase is the solution you might want to take a look at, Hive is not meant for random/frequent updates its more of a Data crunching tool not a replacement of RDBMS

Related

The best way to filter large data sets

I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria.
The relevant tables look roughly like this:
trade_metadata 18m rows, 10 GB
trade_economics 18m rows, 15 GB
business_event 18m rows, 11 GB
trade_business_event_link 18m rows, 3 GB
One of our reports is now taking ages to run ( > 5 hours). The underlying proc has been optimized time and again but new criteria keep getting added so we start struggling again. The proc is pretty standard - join all the tables and apply a host of where clauses (20 at the last count).
I was wondering if I have a problem large enough to consider big data solutions to get rid of this optimize-the-query game every few months. In any case, the volumes are only going up. I have read up a bit about Hadoop + HBase, Cassandra, Apache Pig etc. but being very new to this space, am a little confused about the best way to proceed.
I imagine this is not a map-reduce problem. HBase does seem to offer Filters but I am not sure about their performance. Could the enlightened folks here please answer a few questions for me:
Is the data set large enough for big data solutions (Do I need entry into the billion club first?)
If it is, would HBase be a good choice to implement this?
We are not moving away from Oracle anytime soon even though the volumes are growing steadily. Am I looking at populating the HDFS every day with a dump from the relevant tables? Or is delta write possible everyday?
Thanks very much!
Welcome to the incredibly varied big data eco-system. If your dataset size is big enough that it is taxing your ability to analyze it using traditional tools, then it is big enough for big data technologies. As you have probably seen, there are a huge number of big data tools available with many of them having overlapping capabilities.
First of all, you did not mention if you have a cluster set-up. If not, then I would suggest looking into the products by Cloudera and Hortonworks. These companies provide Hadoop distributions that include many of the most popular big data tools(hbase, spark, sqoop, etc), and make it easier to configure and manage the nodes that will make up your cluster. Both companies provide their distributions free of charge, but you will have to pay for support.
Next you will need to get your data out of Oracle and into some format in the hadoop cluster to analyze it. The tool often used to get data from a relational database and into the cluster is Sqoop. Sqoop has the ability to load your tables into HBase, Hive, and files on the Hadoop Distributed Filesystem (HDFS). Sqoop also has the ability to do incremental imports for updates instead of whole table loads. Which of these destinations you choose affects which tools you can use in the next step. HDFS is the most flexible in that you can access it from PIG, MapReduce code you write, Hive, Cloudera Impala, and others. I have found HBase to be very easy to use, but others highly recommend Hive.
An aside: There is a project called Apache Spark that is expected to be the replacement for Hadoop MapReduce. Spark claims 100x speedup compared to traditional hadoop mapreduce jobs. Many projects including Hive will run on Spark giving you the ability to do SQL-like queries on big data and get results very quickly (Blog post)
Now that your data is loaded you need to run those end of day reports. If you choose Hive, then you can reuse a lot of your sql knowledge instead of having to program Java or learn Pig Latin (not that it’s very hard). Pig Translates Pig Latin to MapReduce jobs (as does Hive’s Query Language for now), but, like Hive, Pig can target Spark as well. Regardless of which tool you choose for this step, I recommend looking into Oozie to automate the ingestion, analaytics, and movement of results back out of the cluster (sqoop export for this). Oozie allows you to schedule recurring workflows like yours so you can focus on the results not the process. The full capabilities of Oozie are documented here.
There are a crazy number of tools at your disposal, and the speed of change in this eco-system can give you whip-lash. Both cloudera and Hortonworks provide Virtual Machines you can use to try their distributions. I strongly recommend spending less time deeply researching each tool and just trying some of the them (like Hive, Pig, Oozie,...) to see what works best for your application).

Can I do cluster analyses with Hadoop and can I hook Hadoop to MongoDB

I am a newbie to Hadoop, I went through a few blogs and skimmed through a couple of books on a subject. To guide my further studying I need answer to these two questions:
How much I can really do with Map-Reduce? From examples I see I can do min(), max(), sum(), count(). You can probably as easy to do average() and even standard_deviation(), but is that it? What if I want to run a query such that “customers who bought X also bought Y” (sort of join table to itself in SQL terminology). What if I want to do graph analyses or cluster analyses is Haddop’s map-reduce of any help or I am still pretty much on my own?
If I have existing database, let's say it is big (1 petabyte) and distributed, let’s say it is MongoDB with clusters, shards and all that. Can I hook Haddop to my existing MondoDB shards, or do I need to copy my data (and respectively keep it synchronized as it changes). The latter, if that is what I really need to do, sounds like expensive process, is there anything in Hadoop to help me do it.
Detailed elaborative answer or a link to such will be much appreciated.
MapReduce is pretty general structure for computation and while it is not appropriate for every situation, it is flexible in being used for a wide variety of problems. For examples of what you have described, you can refer to Mahout: http://mahout.apache.org/users/clustering/clusteringyourdata.html
Using the mongodb connector you can directly access a mongo database as a mapreduce inputformat without having to sync the data to HDFS: http://docs.mongodb.org/ecosystem/tools/hadoop
Alternatively mongo itself allows you to write queries in mapreduce to be directly executed by the database. I'd recommend aggregating the data as much as possible in this way before interfacing with hadoop/hdfs to reduce the potential amount of data transferred between the two systems.

SequenceFile alternative/extension that allows in-place updates

I like the convenience of a database where you can update a row in-place. But Hadoop relies on sequence files that are capable of being consumed in parallel.
I liked the idea of HBase where I can rewrite just one row; as well as being input to a map-reduce job. But HBase is not something a newb must mess with, right? What is a good tool/method for this?
I don't think it's very difficult to learn and use HBase.
Coming to your original question. The reason why we use HBase is same as the reason behind using any other DB, i.e random, real-time read/write access, which HDFS lacks like any other FS. And this is true for any filesystem, not just for HDFS. You could take ext4 & MySQL paradigm as an example.
And when you say re-write in HBase it is actually not update. You either put a new version of a cell or delete a cell and put new data at the same location.
And you can't say that Hadoop relies on sequence files to provide you the parallelism. Parallelism is something which is provided by Hadoop by virtue of its nature, i'e a distributed platform. You can handle almost any kind of file using Hadoop with almost euqal parallelism. The only advantage with sequence files is that they are more suitable for MapReduce processing as they are already in key/vale pairs.
You have to take it with a pinch of salt, but frankly speaking Hadoop doesn't understand update. If you could elaborate your use case a bit more, maybe I could suggest something better.

Hadoop/Hbase performance improvement of bulk loading

I am loading 10 million records into Hbase table through importsv tool from hadoop multinode cluster. Right now it is taking 5 minutes for this task. But i was wondering how i could improve the performance of this. The importtsv tool does not seem like using reducers at all. I was wondering if i could anyway force this to use reducers, it could improve performance or any other way which you think would improve the performance would be appreciated. Thank you.
Try Importtsv with HfileOutPutFormat , completeBulkLoadTool.
when it comes to performance, there is no easy answer. If the 5 minutes equals to the speed of the network, or the speed of the hard disk, you have to move the source data to somewhere else or change the hardware.
I don't know importsv. I would suggest you to try multi-way load. Take a look at Sqoop.
You can get best HBase bulk load performance with use of HFileOutputFormat and CompleteBulkLoad
Check here.

getting close to real-time with hadoop

I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop.
You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).
Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.
Take a look at HBase, which provides a structured queriable datastore on top of HDFS, similar to Google's BigTable. http://hadoop.apache.org/hbase/
It could be that you just need some help with managing replication and sharding of data. Check out Gizzard, a middleware to do just that: http://github.com/twitter/gizzard
Processing can always be done beforehand. If that means you materialize too much data, maybe something like Lucandra can help -- Lucene running on top of Cassandra as a backend? http://github.com/tjake/Lucandra
If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.
Hadoop is completely the wrong tool for this kind of requirement. It is explicitly optimised for large batch jobs that run for several minutes up to hours or even days.
FWIW, HDFS has nothing to do with the overhead. It's the fact that Hadoop jobs deploy a jar file onto every node, setup a working area, start each job running, pass information via files between stages of the computation, communicate progress and status with the job runner, etc., etc.
This query is old but it begs an answer. Even if there are millions of documents but are not changing in real-time like FAQ docs, Lucene + SOLR for distribution should pretty much suffice the need. Hathi Trust indexes billions of documents using the same combination.
It is a completely different problem if the index is changing in real time. Even Lucene will have problems dealing with updating its index and you have to look at real time search engines. There has been some attempt at reworking Lucene for real time and maybe it should work. You can also look at HSearch, a real time distributed search engine built on Hadoop and HBase, hosted at http://bizosyshsearch.sourceforge.net

Resources