To hadoop or not to hadoop - hadoop

We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.
I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).
There is just one problem that i keep coming up against
namely 'Use hadoop if you have more than 10gb of data'
Now we don't even have 1gb of data(at this stage) is it still viable.
My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.
thanks

Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.
Let me explain why I suggest against hadoop for such low volume of data.
By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.

As #Makubex has suggested, don't use hadoop.
And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.
Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.
Refer: When to use T-SQL or SSIS for ETL

Related

Amount of data storage : HDFS vs NoSQL

In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

Streaming data access and latency in hadoop applications

I am very much new to hadoop and going through the book 'Hadoop the definitive guide'
What is meaning of Streaming data access in Hadoop and why we say latency is high in Hadoop applications. Can anyone please explain me ? Thanks in advance
Ok..Let me try.."Streaming data access" implies that instead of reading data as packets or chunks, data is read continuously with a constant bitrate, just as water from a tap. The application starts reading data from the start of a file and keeps on reading it in a sequential manner without random seeks.
Coming to the second part of your question, latency is said to be high in Hadoop applications as the initial few seconds are spent in the activities like job submission, resource distribution, split creation, mappper(s) creation etc.
HTH
For latency, I can say that the completion time is always more than 30 sec, even if you are working with KB's of data. I don't totally know why it is so long but this time is initializations, e.g creating job, determination that which part of data is going to be processed by which worker, and so on.
So, if you are going to be working on small amount of data that is less than GB's, then don't go for hadoop, just use your pc. Hadoop is only good for big data
It refers to the fact that HDFS operations are read-intensive as opposed to write-intensive. In a typical scenario source data which is what you would use for analysis is loaded into HDFS only when it is up-to-date and to ensure you have the latest data set.
During analysis, a copy of the original data (in almost its entire form) is made. Your MapReduce operation will then be invoked on the copied data.
As you can see it is different to the usual relationship between storage and processing. In normal operations (think your PC/Mac) you would ideally want the files to open quickly, which is low latency and maintain small file sizes to make that feasible.
Since HDFS inclines itself to working with petabytes (1000s of GBs) latency will be high but in contrast it is realistically possible to work with large data sets much more easily.

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

Resources