What is I/O variance applying to Cloud Storage or HDFS? - hadoop

According to Google Cloud documentation,
Cloud Storage may increase I/O variance. In many situations, Cloud Storage has a higher I/O variance than HDFS. This can be problematic if you have consistent I/O requirements, such as an application backed by HBase or another NoSQL database.
Can someone provide an example of variance? What does it mean?

It means the rate of access is highly variable, or not consistent within a threshold.
Any cloud object storage will have higher variance compared to dedicated filesystems since they're multi-tenant
The comment, in general, is saying you wouldn't use GCS as a backend for a database (specifically HBase, Kudu, other Hadoop databases) as it'll introduce latency into these commonly low-latent systems, even though GCS can be used as the backing HDFS storage layer

You can translate I/O to reads (input) and writes (output)
When read or write rate fit in a narrow range, you can easily decide on resource planning from network bandwidth to RAM and CPU.
As I/O variance increases, you have a wider range of read and write rates and deploying an optimized infrastructure becomes harder.
as #OneCricketeer mentioned, the higher variance in cloud storage could be result of multi tenancy and deployment of a large diversity of services.
One way to mitigate this problem is to divide services into multiple I/O variance ranges (i.e: databases, object storage, file storage etc.).

Related

Why to not leave data in HDFS,

Last day while I was following my courses on GCP in coursera, they mentioned and they insisted that we should not leave the data in HDFS, but instead after finishing the job, we should copy it and persist in cloud storage, and each time we want to launch a job, we should put the data again in HDFS and repeat the loop,
so my questions are the following:
Do we lose our data in HDFS if we turn off the hadoop cluster?
Why we should not leave the data in HDFS?
Price issues?
Thank you
The pros and cons of Cloud Storage vs. HDFS
Moving to Cloud Storage
1. The cons:
a. Cloud Storage may increase I/O variance.
b. Cloud Storage does not support file appends or truncates.
c. Cloud Storage is not POSIX-compliant.
d. Cloud Storage may not expose all file system information.
e. Cloud storage may have greater request latency.
2. The pros:
a. Lower costs.
b. Separation from compute and storage.
c. Interoperability.
d. HDFS compatibility with equivalent (or better) performance.
e. High data availability.
f. No storage management overhead.
g. Quick startup.
h. Google IAM security.
i. Global consistency.

To hadoop or not to hadoop

We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.
I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).
There is just one problem that i keep coming up against
namely 'Use hadoop if you have more than 10gb of data'
Now we don't even have 1gb of data(at this stage) is it still viable.
My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.
thanks
Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.
Let me explain why I suggest against hadoop for such low volume of data.
By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.
As #Makubex has suggested, don't use hadoop.
And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.
Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.
Refer: When to use T-SQL or SSIS for ETL

Amount of data storage : HDFS vs NoSQL

In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

doubts regarding migration to big data

I have a few doubts regarding hadoop
In one of the videos published by cloudera an instructer told that in hadoop there is HDFS. Every file will be stored as a set of chucks or blocks. Each block will be replicated three times in different machines to minimize the point of failure. Each mapper will process a single hdfs block.
From these logics i perceived that if i have a server having some 100 peta bytes of logs which are not stored in traditional file system unlike hdfs.
Main doubt 1. Now if i want to analyse this huge data efficiently using the mapreduce technique then do i have to transfer the data in a new server running hdfs and having three times the storage of the old server.
In one more video which was also published by cloudera..the instructer mentioned clearly that we dont need to migrate the traditional system to a new system, we can use hadoop and map reduce on top of that. This is little contradictry to the statement mentioned in first point.
Main doubt 2: Lets assume that point 2 statement is true. Now how can this be possible. I mean how can we apply hadoop and map reduce on a traditional file system where there is no replication of blocks or name node ..deamon on each machine.
My main task is to Facilitate fast analysis of a huge amount of logs which are currently not stored in hdfs. For doing this will i need a new server or not.
P.S: I need some good tutorial or Books or some articles which could give me in depth knowledge of big data so that i can start working on it.
So recomendations are most welcome.
Hadoop is just an infrastructure for running a MapReduce style workload (for "big data" or "analytics" atop a cluster of servers.
You can use HDFS for data sharing across the nodes, then use Hadoop's built in workload management to distribute work to nodes where the data is stored. This is sometimes called "function shipping."
But it's also possible to not use HDFS. You can use another network file sharing / distribution mechanism. FTP (file copies), S3 (access from the Amazon Web Services cloud), and a variety of other clustered/distributed file systems are supported by various vendors/platforms. Some of these move the data to the system on which workload is being done ("data shipping").
Which storage strategy is appropriate, efficient, and performant is a big question, and depends greatly on your infrastructure and your MapReduce app's data access patterns. In general, however, analytics jobs are resource hungry, so only small analytics apps tend to run on servers doing other work (the "original systems"). So processing "big data" does tend to suggest new servers--if not ones you buy, ones you rent temporarily from a cloud service like AWS, RackSpace, etc.--and data streaming from replicas/clones of data captured in production ("secondary storage") rather than data still resident on "primary storage."
If you're just starting out with small or modest apps, you might be able to access data in-place, directly from existing systems. But if you've got 100 PB of logs, you're going to want that processed on systems devoted to the task.

Resources