where is data stored in hadoop - hadoop

Though I have understood the architecture of hadoop a bit , I have some void in understanding of where the data is exactly situated.
My question is like " Suppose I have large data of some random books .. is the data of books stored in multiple Nodes previously using HDFS and we perform MapReduce on each node and get the result in our system ?
'OR'
Do we store data some where in large database and whenever we want to perform the MapReduce operation, we take the chunks and store them in multiple Nodes for performing operation ?

Either is possible, it really depends on your use case and needs. However, generally Hadoop MapReduce runs against data stored in HDFS. The system is designed around data locality which requires the data be in HDFS. That is the Map tasks run on the same piece of hardware where the data is stored in order to improve performance.
That said if for some reason your data must be stored outside of HDFS and then processed using MapReduce it can be done but is a bit more work and is not as efficient as processing data in HDFS locally.
So lets take two use cases. Start with log files. Log files as they are are not particularly accessible. They just need to be stuck somewhere and stored for later analysis. HDFS is perfect for this. If you really need a log back out you can get it but generally people will be looking for the output of the analytics. So store your logs in HDFS and process them normally.
However, data in the format ideal for HDFS and Hadoop Map Reduce (many records in a single large flat file) is not what I would consider highly accessible. Hadoop Map Reduce expects to have input files that are multi megabytes in size with many records per file. The more you diverge from this case, the more your performance will decline. Sometimes your data is needed online at all times, and HDFS is not ideal for this. For instance we will use your book example. If these books are used in an application that needs the content accessible in an online fashion, I.E. editting and annotating, you may choose to store them in a database. Then when you need to run batch analytics you use a custom InputFormat to retrieve the records from the database and process them in MapReduce.
I am currently doing this with a web crawler that stores the web pages individually in Amazon S3. Web pages are too small to serve as a single efficient input to MapReduce, so I have a custom InputFormat that feeds each mapper several files. The output of this MapReduce job is eventually written back to S3, and because I am using Amazon EMR, the Hadoop cluster goes away.

Related

sensor data with SAP HANA and Hadoop/HDFS

i would like to save sensor data in a suitable database.
I have 100.000 writes every minute with 100 bytes size each write.
Also i want to do analytics on the data.
i thought about hadoop, because it has many different frameworks to analyze the data.(e.g Apache spark)
Now my problem:
Hbase a nosql database would be suitable solution, because it has a column familiy data model to access large columns. But it runs on top of HDFS.
HDFS has 64 MB size of data Blocks. What does that mean for me if i have 100 byte data?
Also i would like to run machine learning on top of hadoop. Would it be possible to use HBASE and SAP Hana together?(SAP Hana runs with hadoop)
Let me try to address you points step by step:
I would like to save sensor data in a suitable database.
I would suggest something like OpenTSDB running on HBase here, since you also want to run a Hadoop cluster anyhow.
I have 100.000 writes every minute with 100 bytes size each write.
As you correctly point out, small messages/files are an issue for HDFS. Not so for HBase though (the block size is abstracted away by HBase, no need to adjust it for the underlying HDFS).
A solution like OpenTSDB on Hbase or pure Hbase will work just fine for this load.
That said since you apparently want to access your data via Hbase and also SAP Hana (which will probably require aggregating measurements from many 100byte files into larger files because now the HDFS block size comes into play) I would suggest handling incoming data via Kafka first and then reading from Kafka into raw HDFS (in some way compatible with Hana) and Hbase via separate consumers on Kafka.
Would it be possible to use HBASE and SAP Hana together?
See above explanation, Kafka (or a similar distributed queue) would be what you want for ingesting into multiple stores from a stream of small messages in my opinion.
HDFS has 64 MB size of data Blocks. What does that mean for me if i have 100 byte data?
Doesn't matter to Hbase.
Doesn't matter to Kafka (at least with your throughput and message sizes it doesn't :))
Raw HDFS storage will require you to manually aggregate those 100 byte messages into larger files (maybe Avro would be helpful for you here)
Also i would like to run machine learning on top of Hadoop.
Not an issue, HDFS is a distributed system, so you can scale things up to more performance and add a machine learning solution based Spark or whatever other thing you want to run on top of Hadoop at any time. Worst case you will have to add another machine to your cluster, but there is no hard limit on the number of things you can simultaneously run on your data once it's stored in HDFS and your cluster is powerful enough.

doubts regarding migration to big data

I have a few doubts regarding hadoop
In one of the videos published by cloudera an instructer told that in hadoop there is HDFS. Every file will be stored as a set of chucks or blocks. Each block will be replicated three times in different machines to minimize the point of failure. Each mapper will process a single hdfs block.
From these logics i perceived that if i have a server having some 100 peta bytes of logs which are not stored in traditional file system unlike hdfs.
Main doubt 1. Now if i want to analyse this huge data efficiently using the mapreduce technique then do i have to transfer the data in a new server running hdfs and having three times the storage of the old server.
In one more video which was also published by cloudera..the instructer mentioned clearly that we dont need to migrate the traditional system to a new system, we can use hadoop and map reduce on top of that. This is little contradictry to the statement mentioned in first point.
Main doubt 2: Lets assume that point 2 statement is true. Now how can this be possible. I mean how can we apply hadoop and map reduce on a traditional file system where there is no replication of blocks or name node ..deamon on each machine.
My main task is to Facilitate fast analysis of a huge amount of logs which are currently not stored in hdfs. For doing this will i need a new server or not.
P.S: I need some good tutorial or Books or some articles which could give me in depth knowledge of big data so that i can start working on it.
So recomendations are most welcome.
Hadoop is just an infrastructure for running a MapReduce style workload (for "big data" or "analytics" atop a cluster of servers.
You can use HDFS for data sharing across the nodes, then use Hadoop's built in workload management to distribute work to nodes where the data is stored. This is sometimes called "function shipping."
But it's also possible to not use HDFS. You can use another network file sharing / distribution mechanism. FTP (file copies), S3 (access from the Amazon Web Services cloud), and a variety of other clustered/distributed file systems are supported by various vendors/platforms. Some of these move the data to the system on which workload is being done ("data shipping").
Which storage strategy is appropriate, efficient, and performant is a big question, and depends greatly on your infrastructure and your MapReduce app's data access patterns. In general, however, analytics jobs are resource hungry, so only small analytics apps tend to run on servers doing other work (the "original systems"). So processing "big data" does tend to suggest new servers--if not ones you buy, ones you rent temporarily from a cloud service like AWS, RackSpace, etc.--and data streaming from replicas/clones of data captured in production ("secondary storage") rather than data still resident on "primary storage."
If you're just starting out with small or modest apps, you might be able to access data in-place, directly from existing systems. But if you've got 100 PB of logs, you're going to want that processed on systems devoted to the task.

MapReduce for same task/different data

We have a system that is made up of multiple PostgreSQL databases. Each database has the same tables, i.e., schema, but only carries a share of the data (and not the full data!).The reason for distributing the data is that our customers run queries that are rather complex and perform up to 100 calculations per row.
By distributing the data to multiple databases, we want to lower the amount of work processed by each database, and ultimately speed up search. At the end, we combine the results of each database to create the final results.
A friend of mine has recommended looking at MapReduce (Hadoop). In my opinion, map-reduce only makes sense if the single workers share the same data but perform different type of work on it (corresponds to multiple instruction, single data).
In our case, however, the workers should perform the same task, but perform that task on various data (corresponds to single instruction, multiple data).
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Yes.
I think you have a misconception about Hadoop and MapReduce. A MapReduce job does indeed work on the same type of data (i.e., "same tables"), but different segments of that data. The parallel Map and Reduce tasks are the same tasks over different portions of the data. MapReduce is most definitely "single instruction, multiple data" from your definition.
Hadoop is by no means a drop-in replacement for a SQL database. They do different things in different ways. Here are some other things to note:
Note that MapReduce is only really going to do batch analytics for you. Things like rollups and counts and aggregates. You won't be able to retrieve or search with MapReduce effectively. Also, updating data in Hadoop is not a typical way you want to do things-- you treat things as more "append only". For any of that, you'll probably want to look at HBase.
Hadoop's file system segments the data for you. From a file system perspective, it'll look like files in folders that contain CSV (or some other file format). Files get split up into blocks, which can then be operated on separately with map tasks. You won't have to manually shard the data like you are now.
Take a look at Hive. It's a abstraction layer on top of MapReduce that interprets a light version of SQL into MapReduce under the covers. It should allow you to convert some of your logic a bit easier.

cassandra and hadoop - realtime vs batch

As per http://www.dbta.com/Articles/Columns/Notes-on-NoSQL/Cassandra-and-Hadoop---Strange-Bedfellows-or-a-Match-Made-in-Heaven-75890.aspx
Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions.
What are the differences in the architecture/implementation of Cassandra and Hadoop which account for this sort of difference in usage. (in lay software professional terms)
I wanted to add, because I think there might be a misleading statement here saying Cassandra might perform good for reads.
Cassandra is not very good at random reads either, it's good compared to other solutions out there in how can you read randomly over a huge amount of data, but at some point if the reads are truly random you can't avoid hitting the disk every single time which is expensive, and it may come down to something useless like a few thousand hits/second depending on your cluster, so planning on doing lots of random queries might not be the best, you'll run into a wall if you start thinking like that. I'd say everything in big data works better when you do sequential reads or find a way to sequentially store them. Most cases even when you do real time processing you still want to find a way to batch your queries.
This is why you need to think beforehand what you store under a key and try to get the most information possible out of a read.
It's also kind of funny that statement says transaction and Cassandra in the same sentence, cause that really doesn't happen.
On the other hand hadoop is meant to be batch almost by definition, but hadoop is a distributed map reduce framework, not a db, in fact, I've seen and used lots of hadoop over cassandra, they're not antagonistic technologies.
Handling your big data in real time is doable but requires good thinking and care about when and how you hit the database.
Edit: Removed secondary indices example, as last time I checked that used random reads (though I've been away from Cassandra for more than a year now).
The Vanilla hadoop consists of a Distributed File System (DFS) at the core and libraries to support Map Reduce model to write programs to do analysis. DFS is what enables Hadoop to be scalable. It takes care of chunking data into multiple nodes in a multi node cluster so that Map Reduce can work on individual chunks of data available nodes thus enabling parallelism.
The paper for Google File System which was the basis for Hadoop Distributed File System (HDFS) can be found here
The paper for Map Reduce model can be found here
For a detailed explanation on Map Reduce read this post
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is not a conventional database but is more like Hashtable or HashMap which stores a key/value pair. Cassandra works on top of HDFS and makes use of it to scale. Both Cassandra and HBase are implementations of Google's BigTable. Paper for Google BigTable can be found here.
BigTable makes use of a String Sorted Table (SSTable) to store key/value pairs. SSTable is just a File in HDFS which stores key followed by value. Furthermore BigTable maintains a index which has key and offset in the File for that key which enables reading of value for that key using only a seek to the offset location. SSTable is effectively immutable which means after creating the File there is no modifications can be done to existing key/value pairs. New key/value pairs are appended to the file. Update and Delete of records are appended to the file, update with a newer key/value and deletion with a key and tombstone value. Duplicate keys are allowed in this file for SSTable. The index is also modified with whenever update or delete take place so that offset for that key points to the latest value or tombstone value.
Thus you can see Cassandra's internal allow fast read/write which is crucial for real time data handling. Whereas Vanilla Hadoop with Map Reduce can be used to process batch oriented passive data.
Hadoop consists of two fundamental components: distributed datastore (HDFS) and distributed computation framework (MapReduce). It reads a bunch of input data then writes output from/to the datastore. It needs distributed datastore since it performs parallel computing with the local data on cluster of machines to minimize the data loading time.
While Cassandra is the datastore with linear scalability and fault-tolerance ability. It lacks of the parallel computation ability provided by MapReduce in Hadoop.
The default datastore (HDFS) of Hadoop can be replaced with other storage backend, such as Cassandra, Glusterfs, Ceph, Amazon S3, Microsoft Azure's file system, MapR’s FS, and etc. However, each alternatives has its pros and cons, they should be evaluated based on the needs.
There are some resources that help you integrate Hadoop with Cassandra: http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configHadoop.html

Difference between MapReduce of data store like couchdb and that of Hadoop?

Recently on a webinar by Couchbase, they said that Hadoop be used for processing large log file and Couchbase for presenting it to the application layer. They claimed that the map and reduce of Couchbase and Hadoop was different and suitable for the respective use case mentioned.
I was going to use Couchbase map reduce for processing large amouont of log file.
Can some one please clarify the exact difference between the two map reduce? Are there any features in Hadoop which makes it more suitable for processing large log files?
Thanks...
the main difference in the fact that couchbase uses incremental map/reduce and won't scan all the data set one you need to update or remove the items. another difference is the magnitude of "large". if you need to process hundreds of gigabytes of logs once then the couchbase isn't.the best choice.
Couchbase is one of many NoSQL data storage applications. Data is stored in Key / Value pairs, with the keys indexed for quick retrieval.
Conversely data in hadoop is not indexed (other than the file name), and pulling a specific value from a file in HDFS is much slower, possibly involving scanning of many files.
You would typically use something like Hadoop mapreduce to process large files, and update / populate a NoSQL store (such as Couchbase).
Using a NoSQL datastore for processing large amounts of data will most probably be less efficient than using MapReduce to do the same job. But the NoSQL datastore will be able to service a web layer considerably more efficiently than a MapReduce job (which can take 10's of seconds to initialize, and minutes / hours to run).

Resources