Is Hadoop the right tech for this? - hadoop

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.

Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).

Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.

I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.

If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Related

MapReduce for same task/different data

We have a system that is made up of multiple PostgreSQL databases. Each database has the same tables, i.e., schema, but only carries a share of the data (and not the full data!).The reason for distributing the data is that our customers run queries that are rather complex and perform up to 100 calculations per row.
By distributing the data to multiple databases, we want to lower the amount of work processed by each database, and ultimately speed up search. At the end, we combine the results of each database to create the final results.
A friend of mine has recommended looking at MapReduce (Hadoop). In my opinion, map-reduce only makes sense if the single workers share the same data but perform different type of work on it (corresponds to multiple instruction, single data).
In our case, however, the workers should perform the same task, but perform that task on various data (corresponds to single instruction, multiple data).
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Yes.
I think you have a misconception about Hadoop and MapReduce. A MapReduce job does indeed work on the same type of data (i.e., "same tables"), but different segments of that data. The parallel Map and Reduce tasks are the same tasks over different portions of the data. MapReduce is most definitely "single instruction, multiple data" from your definition.
Hadoop is by no means a drop-in replacement for a SQL database. They do different things in different ways. Here are some other things to note:
Note that MapReduce is only really going to do batch analytics for you. Things like rollups and counts and aggregates. You won't be able to retrieve or search with MapReduce effectively. Also, updating data in Hadoop is not a typical way you want to do things-- you treat things as more "append only". For any of that, you'll probably want to look at HBase.
Hadoop's file system segments the data for you. From a file system perspective, it'll look like files in folders that contain CSV (or some other file format). Files get split up into blocks, which can then be operated on separately with map tasks. You won't have to manually shard the data like you are now.
Take a look at Hive. It's a abstraction layer on top of MapReduce that interprets a light version of SQL into MapReduce under the covers. It should allow you to convert some of your logic a bit easier.

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce.
I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. When do we need native map reduce? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce?
Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce, for processing structured data you could use Pangool, it also simplifies things like JOIN. Also Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance. But it requires more time to code and introduce changes.
Apache Pig is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like:
Get top N elements for each group;
Calculate total per each group and than put that total against each row in the group;
Use Bloom filters for JOIN optimisations;
Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs by doing more stuff in a single Job)
Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that stores and partitions data. But its tables can be read from Pig or Standard MapReduce.
One more thing, Hive and Pig are not well suited to work with hierarchical data.
Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries.
It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. You could write UDFs or do something and achieve that.
There is no clear distinction as such among the usage of these tools. It totally depends on your particular use-case. Based on your data and the kind of processing you need to decide which tool fits into your requirements better.
Edit :
Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. The format of the files holding this data was somewhat weird. Some part of the data was EBCDIC encoded, while rest of the data was in binary format. It was basically a flat binary file with no delimiters like\n or something. I had a tough time finding some way to process these files using Pig or Hive. As a result I had to settle down with MR. Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you.
So, like I said earlier it basically depends on your use case. For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable.
Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured.
Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive.
And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. IMHO, MR turns out to be better in such situations as well. For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with.
HTH
Mapreduce:
Strengths:
works both on structured and unstructured data.
good for writing complex business logic.
Weakness:
long development type
hard to achieve join functionality
Hive :
Strengths:
less development time.
suitable for adhoc analysis.
easy for joins
Weakness :
not easy for complex business logic.
deals only structured data.
Pig
Strengths :
Structured and unstructured data.
joins are easily written.
Weakness:
new language to learn.
converted into mapreduce.
Hive
Pros:
Sql like
Data-base guys love that.
Good support for structured data.
Currently support database schema and views like structure
Support concurrent multi users, multi session scenarios.
Bigger community support. Hive , Hiver server , Hiver Server2, Impala ,Centry already
Cons:
Performance degrades as data grows bigger not much to do, memory over flow issues. cant do much with it.
Hierarchical data is a challenge.
Un-structured data requires udf like component
Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc
Pig:
Pros:
Great script based data flow language.
Cons:
Un-structured data requires udf like component
Not a big community support
MapReudce:
Pros:
Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code.
Most of the times MR yields better performance.
MR support for hierarchical data is great especially implement tree like structures.
Better control at partitioning / indexing the data.
Job chaining.
Cons:
Need to know api very well to get a better performance etc
Code / debug / maintain
Scenarios where Hadoop Map Reduce is preferred to Hive or PIG
When you need definite driver program control
Whenever the job requires implementing a custom Partitioner
If there already exists pre-defined library of Java Mappers or Reducers for a job
If you require good amount of testability when combining lots of large data sets
If the application demands legacy code requirements that command physical structure
If the job requires optimization at a particular stage of processing by making the best use of tricks like in-mapper combining
If the job has some tricky usage of distributed cache (replicated join), cross products, groupings or joins
Pros of Pig/Hive :
Hadoop MapReduce requires more development effort than Pig and Hive.
Pig and Hive coding approaches are slower than a fully tuned Hadoop MapReduce program.
When using Pig and Hive for executing jobs, Hadoop developers need not worry about any version mismatch.
There is very limited possibility for the developer to write java level bugs when coding in Pig or Hive.
Have a look at this post for Pig Vs Hive comparison.
All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). PIG and HIVE uses MR/SPARK/TEZ underneath. So all the things which MR can do may or may not be possible in Hive and PIG.
Here is the great comparison.
It specifies all the use case scenarios.

cassandra and hadoop - realtime vs batch

As per http://www.dbta.com/Articles/Columns/Notes-on-NoSQL/Cassandra-and-Hadoop---Strange-Bedfellows-or-a-Match-Made-in-Heaven-75890.aspx
Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions.
What are the differences in the architecture/implementation of Cassandra and Hadoop which account for this sort of difference in usage. (in lay software professional terms)
I wanted to add, because I think there might be a misleading statement here saying Cassandra might perform good for reads.
Cassandra is not very good at random reads either, it's good compared to other solutions out there in how can you read randomly over a huge amount of data, but at some point if the reads are truly random you can't avoid hitting the disk every single time which is expensive, and it may come down to something useless like a few thousand hits/second depending on your cluster, so planning on doing lots of random queries might not be the best, you'll run into a wall if you start thinking like that. I'd say everything in big data works better when you do sequential reads or find a way to sequentially store them. Most cases even when you do real time processing you still want to find a way to batch your queries.
This is why you need to think beforehand what you store under a key and try to get the most information possible out of a read.
It's also kind of funny that statement says transaction and Cassandra in the same sentence, cause that really doesn't happen.
On the other hand hadoop is meant to be batch almost by definition, but hadoop is a distributed map reduce framework, not a db, in fact, I've seen and used lots of hadoop over cassandra, they're not antagonistic technologies.
Handling your big data in real time is doable but requires good thinking and care about when and how you hit the database.
Edit: Removed secondary indices example, as last time I checked that used random reads (though I've been away from Cassandra for more than a year now).
The Vanilla hadoop consists of a Distributed File System (DFS) at the core and libraries to support Map Reduce model to write programs to do analysis. DFS is what enables Hadoop to be scalable. It takes care of chunking data into multiple nodes in a multi node cluster so that Map Reduce can work on individual chunks of data available nodes thus enabling parallelism.
The paper for Google File System which was the basis for Hadoop Distributed File System (HDFS) can be found here
The paper for Map Reduce model can be found here
For a detailed explanation on Map Reduce read this post
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is not a conventional database but is more like Hashtable or HashMap which stores a key/value pair. Cassandra works on top of HDFS and makes use of it to scale. Both Cassandra and HBase are implementations of Google's BigTable. Paper for Google BigTable can be found here.
BigTable makes use of a String Sorted Table (SSTable) to store key/value pairs. SSTable is just a File in HDFS which stores key followed by value. Furthermore BigTable maintains a index which has key and offset in the File for that key which enables reading of value for that key using only a seek to the offset location. SSTable is effectively immutable which means after creating the File there is no modifications can be done to existing key/value pairs. New key/value pairs are appended to the file. Update and Delete of records are appended to the file, update with a newer key/value and deletion with a key and tombstone value. Duplicate keys are allowed in this file for SSTable. The index is also modified with whenever update or delete take place so that offset for that key points to the latest value or tombstone value.
Thus you can see Cassandra's internal allow fast read/write which is crucial for real time data handling. Whereas Vanilla Hadoop with Map Reduce can be used to process batch oriented passive data.
Hadoop consists of two fundamental components: distributed datastore (HDFS) and distributed computation framework (MapReduce). It reads a bunch of input data then writes output from/to the datastore. It needs distributed datastore since it performs parallel computing with the local data on cluster of machines to minimize the data loading time.
While Cassandra is the datastore with linear scalability and fault-tolerance ability. It lacks of the parallel computation ability provided by MapReduce in Hadoop.
The default datastore (HDFS) of Hadoop can be replaced with other storage backend, such as Cassandra, Glusterfs, Ceph, Amazon S3, Microsoft Azure's file system, MapR’s FS, and etc. However, each alternatives has its pros and cons, they should be evaluated based on the needs.
There are some resources that help you integrate Hadoop with Cassandra: http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configHadoop.html

Using Hadoop & related projects to analyze usage patterns that constantly change

We're strategizing on how to analyze user "interest" (clicks, likes, etc) on 1M+ items on our site to generate a "similar items" list.
In order to process a large amount of raw data we're learning about Hadoop, Hive, and related projects.
My question is regarding this concern: Hadoop/Hive and the like seem to be geared more towards data dumps, followed by processing cycles. Presumably the end of the processing cycle is something to the extend of an indexed graph of links between related items.
If I'm on track so far, how is data typically processed in these scenarios: I.e.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Do we stream data as it comes in, analyze it and update the data store?
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Is this use case better addressed by Cassandra than Hive/HDFS?
I'm looking to better understand the common approach to this kind of big data processing.
I think this is a good use case for Hadoop family of tools.
It looks to me like HDFS and Flume might be obvious choices, I would look into either HBase or Hive depending on what kinds of analysis you are interested in, how flexible you are in organizing the data
and querying it.
Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Answer: Hadoop is very good for this. I would use HBase for this, but there are other choices.
Do we stream data as it comes in, analyze it and update the data store?
Answer: Flume is good for this.
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Answer: You have options to do both. Bulk would probably be a MapReduce job on HDFS where piece-by-piece could be managed through HBase column-family values or Hive rows. If you give more details, I could be more precise.
Is this use case better addressed by Cassandra than Hive/HDFS?
Answer: Cassandra and HBase are both implementations of Google's BigTable. I think that choice depends on
how do you need to organize, access, analyze and update data. I can provide more guidance if needed.
HBase is usually better for semi-structured, high R/W processing.
DHFS is generally good choice for flexible, scalable storage of data dumps as you call them.
Flume is applicable for moving streaming data.
I would also consider looking into Titan and HBase if you are thinking graph.
Hive would be applicable if you are interested in tabular-oriented data and using SQL-like queries.

Resources