Map Reduce : Which is the underlying Data Structure used - hadoop

I am wondering that if such a large datasets are used in Hadoop Map Reduce then what are the data structures which are used by hadoop. If possible please somebody provide me a detail view of underlying data structures in hadoop.

HDFS is the default underlying storage platform of Hadoop.
Its like any other file system in the sense that - it does not care what structure the files have. It only ensures that files will be saved in a redundant fashion and available for retrieval quickly.
So it is totally upto you the user, to store files with whatever structure you like inside them.
A Map Reduce program simply gets the file data fed to it as an input. Not necessarily the entire file, but parts of it depending on InputFormats etc. The Map program then can make
use of the data in whatever way it wants to.
'Hive' - on the other hand deals with TABLES (columns/rows). And you can query them in a SQL like fashion using Hive-QL.

Thanks to all of you
I got the answer of my question. The underlying HDFS uses block as a storing units a detail description of which is mentioned in the following book and networking streaming concepts.
All the details are available in the third chapter of Hadoop: The Definitive Guide.

Related

Data format and database choices Spark/hadoop

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
direct reading and filtering. i mean that it can be done with Spark, for exemple:
data = sqlCtxt.read.json(path_data)
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.
Thank you by advance!
Use Parquet. I'm not sure about CSV but definitely don't use JSON. My personal experience using JSON with spark was extremely, extremely slow to read from storage, after switching to Parquet my read times were much faster (e.g. some small files took minutes to load in compressed JSON, now they take less than a second to load in compressed Parquet).
On top of improving read speeds, compressed parquet can be partitioned by spark when reading, whereas compressed JSON cannot. What this means is that Parquet can be loaded onto multiple cluster workers, whereas JSON will just be read onto a single node with 1 partition. This isn't a good idea if your files are large and you'll get Out Of Memory Exceptions. It also won't parallelise your computations, so you'll be executing on one node. This isn't the 'Sparky' way of doing things.
Final point: you can use SparkSQL to execute queries on stored parquet files, without having to read them into dataframes first. Very handy.
Hope this helps :)

Big Data - Lambda Architecture and Storing Raw Data

Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)
Basically, data are ingested from RabbitMQ by Storm and save to Cassandra
Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :
Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)
As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation
And here I'm facing a choice :
Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.
Can someone help me in that choice ?
HDFS makes perfect sense. Some considerations :
Serialization of data - Use ORC/ Parquet or AVRO if format is variable
Compression of data - Always compress
HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources
hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like:
https://academy.datastax.com/resources/datastax-reference-application-killrvideo
that might help you to get started.
Also the question is more material for quora or even http://www.mail-archive.com/user#cassandra.apache.org/ this question has been asked there a lot of time.

Can we store relational data in hdfs

I am trying to convert a application that have relational database as backend. Can I store the data relationaly in HDFS as well?
Just for the sake of storing, you can store anything in HDFS. But that won't make any sense. First of all, you should not think of Hadoop as a replacement to your RDBMS(which you are trying to do here). Both are meant for totally different purposes. Hadoop is not a good fit for your transactional, relational or real-time kind of needs. It was meant to serve your offline batch processing needs. So, it's better to analyze your use case properly and then freeze your decision.
As a suggestion I would like to point you to Hive. It provides you warehousing capabilities on top of your existing Hadoop cluster. It also provides an SQL like interface to your warehouse, which will make your life much easier if you are coming from SQL background. But again, Hive is also a batch processing system and is not a good fit if you need something real time.
You can have a look at HBase though, as suggested by abhinav. It's a DB that can run on top of your Hadoop cluster and provides you random, real time read/write access to your data. But you should keep 1 thing in mind that it's a NoSQL db. It doesn't follow the SQL terminologies and conventions. So, you might find it a bit alien initially. You might have to think about issues like how to store your data in a new storage style(columnar) unlike the row style storage of your RDBMS. Otherwise it's not a problem to setup and use it.
HTH
Any file can be stored in HDFS. But if you want an SQL type DB you should go for HBASE. If you directly store your data into HDFS you will not be able to store rationality.

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Can HBase Access Text Documents and CSV Documents Just as Hadoop?

In Hadoop, I can easily create Map/Reduce apps which access and process data in huge text files and csv files. My question is can Hbase do the same and access such huge files, or HBase has other uses?
Hbase runs queries just as relational databases; so, I kind of have a hard time to understand the advantage of HBase, unless it can access huge text and csv files just as Hadoop does.
First of all Hbase is just a store. And a store never accesses anything. Rather you access the store to fetch or put the data. Like any other datastore Hbase has only one job to do, store your data and make it available to you whenever you need it. You can write MapReduce jobs or sequential Java programs etc etc to put data into Hbase or fetch data from it. It's totally upto you which path you prefer.
Coming to the second part of your question, Hbase never ever works like traditional relational databases. Everything, starting from storing the data to accessing the data, is totally different. The advantage of using Hbase is that you can store really really huge amount of data into it and have random read/write access. The data can be of any type viz. text, csv, tsv, binary etc etc. But, before going ahead, you must think well whether Hbase is a suitable choice for you or not, as one size doesn't fit all.
HTH

Resources