I am very much new to hadoop and going through the book 'Hadoop the definitive guide'
What is meaning of Streaming data access in Hadoop and why we say latency is high in Hadoop applications. Can anyone please explain me ? Thanks in advance
Ok..Let me try.."Streaming data access" implies that instead of reading data as packets or chunks, data is read continuously with a constant bitrate, just as water from a tap. The application starts reading data from the start of a file and keeps on reading it in a sequential manner without random seeks.
Coming to the second part of your question, latency is said to be high in Hadoop applications as the initial few seconds are spent in the activities like job submission, resource distribution, split creation, mappper(s) creation etc.
HTH
For latency, I can say that the completion time is always more than 30 sec, even if you are working with KB's of data. I don't totally know why it is so long but this time is initializations, e.g creating job, determination that which part of data is going to be processed by which worker, and so on.
So, if you are going to be working on small amount of data that is less than GB's, then don't go for hadoop, just use your pc. Hadoop is only good for big data
It refers to the fact that HDFS operations are read-intensive as opposed to write-intensive. In a typical scenario source data which is what you would use for analysis is loaded into HDFS only when it is up-to-date and to ensure you have the latest data set.
During analysis, a copy of the original data (in almost its entire form) is made. Your MapReduce operation will then be invoked on the copied data.
As you can see it is different to the usual relationship between storage and processing. In normal operations (think your PC/Mac) you would ideally want the files to open quickly, which is low latency and maintain small file sizes to make that feasible.
Since HDFS inclines itself to working with petabytes (1000s of GBs) latency will be high but in contrast it is realistically possible to work with large data sets much more easily.
Related
We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.
I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).
There is just one problem that i keep coming up against
namely 'Use hadoop if you have more than 10gb of data'
Now we don't even have 1gb of data(at this stage) is it still viable.
My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.
thanks
Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.
Let me explain why I suggest against hadoop for such low volume of data.
By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.
As #Makubex has suggested, don't use hadoop.
And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.
Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.
Refer: When to use T-SQL or SSIS for ETL
I have a few doubts regarding hadoop
In one of the videos published by cloudera an instructer told that in hadoop there is HDFS. Every file will be stored as a set of chucks or blocks. Each block will be replicated three times in different machines to minimize the point of failure. Each mapper will process a single hdfs block.
From these logics i perceived that if i have a server having some 100 peta bytes of logs which are not stored in traditional file system unlike hdfs.
Main doubt 1. Now if i want to analyse this huge data efficiently using the mapreduce technique then do i have to transfer the data in a new server running hdfs and having three times the storage of the old server.
In one more video which was also published by cloudera..the instructer mentioned clearly that we dont need to migrate the traditional system to a new system, we can use hadoop and map reduce on top of that. This is little contradictry to the statement mentioned in first point.
Main doubt 2: Lets assume that point 2 statement is true. Now how can this be possible. I mean how can we apply hadoop and map reduce on a traditional file system where there is no replication of blocks or name node ..deamon on each machine.
My main task is to Facilitate fast analysis of a huge amount of logs which are currently not stored in hdfs. For doing this will i need a new server or not.
P.S: I need some good tutorial or Books or some articles which could give me in depth knowledge of big data so that i can start working on it.
So recomendations are most welcome.
Hadoop is just an infrastructure for running a MapReduce style workload (for "big data" or "analytics" atop a cluster of servers.
You can use HDFS for data sharing across the nodes, then use Hadoop's built in workload management to distribute work to nodes where the data is stored. This is sometimes called "function shipping."
But it's also possible to not use HDFS. You can use another network file sharing / distribution mechanism. FTP (file copies), S3 (access from the Amazon Web Services cloud), and a variety of other clustered/distributed file systems are supported by various vendors/platforms. Some of these move the data to the system on which workload is being done ("data shipping").
Which storage strategy is appropriate, efficient, and performant is a big question, and depends greatly on your infrastructure and your MapReduce app's data access patterns. In general, however, analytics jobs are resource hungry, so only small analytics apps tend to run on servers doing other work (the "original systems"). So processing "big data" does tend to suggest new servers--if not ones you buy, ones you rent temporarily from a cloud service like AWS, RackSpace, etc.--and data streaming from replicas/clones of data captured in production ("secondary storage") rather than data still resident on "primary storage."
If you're just starting out with small or modest apps, you might be able to access data in-place, directly from existing systems. But if you've got 100 PB of logs, you're going to want that processed on systems devoted to the task.
I'm a very beginner of hadoop. But I had this interesting observation.
Using the example in hadoop documentation,
By running the same example in Standalone Operation and Pseudo-Distributed Operation, the standalone one took less than 1 minute but Pseudo-distributed operation took more than 3 minutes. This is big difference. I could understand there are extra network and scheduling overhead in distributed mode. But the difference just seems to be too much. This may not be a real comparison because the example is very simple.
My question is, how much difference did you experience between the standalone and distributed mode for a real-world job?
These are reasonably different scenarios. In stand-alone mode, it never starts up a proper one-node Hadoop cluster. Everything happens locally, inline, in the JVM. Data never has to be even written out to disk, potentially. Pseudo-distributed operation is the smallest "real" Hadoop installation, of one local node. You have to read/write data to a local HDFS instance, spawn another JVM, etc. All of that adds a lot of overhead. Maybe the overhead is indeed a few minutes. This seems entirely sensible to me.
Hadoop frame work is meant for processing BIG DATA..
So the size of the data matters a lot ,because ,a smaller file would get processed in traditional file system very quickly than in hadoop because hadoop mapreduce frame work has internal work to do (to make chunks of data file and to send it to data nodes and while processing again access from data nodes ).So for a smaller files ,hadoop frame work is not suitable.
Coming to standalone and pseudo distributed mode ,one aspect u should consider is size of the file and second being actual difference in standalone and pseudo distributed mode.
In standalone mode there is no concept of HDFS,data is not copied to hadoop distributed file system (obviously time saved).Where as in pseudo distributed mode ,hdfs involved which need to be copied with the data that's need to be processed.
Small size data files better to use traditional file processing and if the file size become huge and huge ,hadoop framework gives better processing time!
Hope this helps!
Context:
We are considering an AMQP-compliant solution as a way to compute a constant live stream of data that amounts to 90 gb daily. What we'd like to achieve is live stats, more or less, based on all or some combination of the metrics we're observing. The considered strategy is to send data on the queue and have worker process deltas of the data, sending the data back on the queue as an aggregation of the original data.
Observation:
To me, this looks like a job for something like Hadoop, but concerns (and shields) were raised, mainly about speed. I didn't have the time to benchmark both, we're expecting to pump a good amount of data through the queue (anywhere in the neighborhood of 10~100 mb/s) though. I still think it looks like a job for a distributed computing system, and I also feel the queue solution will scale poorer than a distributed computing solution.
Question:
Put simply, am I right? I've read a bit on Hadoop + HDFS, I was thinking about using another FS, like Lustre or something, to circumvent the NodeName SPOF, and use some kind of solution to have some kind of tolerance to failure of nodes of any kind on the whole cluster.
Its really hard to write your own "distributed environment" solution when you need fail-tolarence, good balancing, etc.If you need near-realtime map/reduce you should checkout storm which is what twitter uses for their huge data needs. Its less complicated then hadoop, and better on consuming queue type input (In my opinion).
Also if you decide to analyze your data on hadoop don't worry too much on SPOF of name node, there are some ways to avoid it.
From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.