I need to keep a global array of strings across all map and reduce tasks, which each one of them can update while running.
Is is possible to do that in hadoop 1.2.1?
As far as I understood, counters only work with type long, and distributed cache files are read-only.
Would be great if someone can give pointers for this problem.
Thanks!
You really should not have shared variables in map-reduce programs.
But if you really need it check the zookeeper, it is a distributed coordination service and is a core part of hadoop ecosystem. You can use it to store any kind of shared data, including arrays of strings.
Related
can anyone tell me what's the most robust way to copy files from HDFS to S3 in Pyspark ?
I am looking at 2 options:
I. Call distcp directly as in the following:
distcp_arglist =['/usr/lib/hadoop/bin/hadoop','distcp',
...,
'-overwrite',
src_path, dest_path]
II. Using s3-distcp - which seems a bit more involved.
https://gist.github.com/okomestudio/699edbb8e095f07bafcc
Any suggestions are welcome. Thanks.
I'm going to point you at a little bit of my code, cloudcp
This is a basic proof of concept of implementing distCp in spark
individual files are scheduled via the spark scheduler; not ideal for 0-byte files, but stops the job being held up by a large file off one node
does do locality via a special RDD which works out the location of every row (i.e file) differently (which has to be in the org.apache.spark package for scoped access)
shows how to do FS operations within a spark map
shuffles the input for a bit of randomness
collects results within an RDD
Doesn't do:
* incremental writes (you can't compare checksums between HDFS and S3 anyway, but it could do a check for fs.exists(path) before the copy.
* permissions. S3 doesn't have them
* throttling
* scheduling of the big files first. You ought to.
* recovery of job failure (no incremental, see)
Like I said, PoC to say "we be more agile by using spark for the heavy lifting"
Anyway, take it and play, you can rework it to operate within an existing spark context with ease, as long as you don't mind a bit of scala coding.
Distcp would probably be the way to go as it is well-proven solution for transfering data between the clusters. I guess any possible alternatives would do something similar - create mapreduce jobs for transfering the data. Important point here is how to tune this process for your particular data as it could really depend on many factors like networking or map-reduce settings. I recommend you to read HortonWorks article about how you can tune this process
I am facing a unique problem, and wanted your opinions here.
I have a legacy map-reduce application, where multiple map-reduce jobs run sequentially, the intermediate data is written back and forth to HDFS. Because of intermediate data written to HDFS, the jobs with small data lose more than gain from HDFS's features, and take considerably more time than what a non-Hadoop equivalent would have taken. Eventually I plan to convert all my map reduce jobs to Spark DAGs, however that's a big-bang change, so I am reasonably procrastinating.
What I really want as a short term solution is that, change the storage layer, so that I continue to benefit from map-reduce parallelism, but do not pay much penalty for storage layer. In that direction, I am thinking of using Spark as the storage layer, where map-reduce jobs will store their outputs in Spark through Spark Context, and the inputs will be read again (by creating Spark input split, each split will have it's own Spark RDD) from Spark Context.
In this way, I will be able to operate intermediate data read/write at memory speed, which will theoretically give me significant performance improvement.
My question is, does this architectural scheme make sense? Has anyone encountered situations like this? Am I missing something significant, which I should have considered even at this preliminary stage of the solution?
Thanks in advance!
does this architectural scheme make sense?
It doesn't. Spark has no standalone storage layer so there is nothing you can use here. If it wasn't enough at its core it is using standard Hadoop input formats for reading and writing data.
If you want to reduce overhead of a storage layer you should rather consider accelerated accelerated storage (like Alluxio) or memory grid (like Ignite Hadoop Accelerator).
The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.
I want to do log parsing of huge amounts of data and gather analytic information. However all the data comes from external sources and I have only 2 machines to store - one as backup/replication.
I'm trying to using Hadoop, Lucene... to accomplish that. But, all the training docs mention that Hadoop is useful for distributed processing, multi-node. My setup does not fit into that architecture.
Are they any overheads with using Hadoop with just 2 machines? If Hadoop is not a good choice are there alternatives? We looked at Splunk, we like it, but that is expensive for us to buy. We just want to build our own.
Hadoop should be used for distributed batch processing problems.
5-common-questions-about-hadoop
Analysis of log files is one of the more common uses of Hadoop, its one of the tasks Facebook use it for.
If you have two machines, you by definition have a multi-node cluster. You can use Hadoop on a single machine if you want, but as you add more nodes the time it takes to process the same amount of data is reduced.
You say you have huge amounts of data? These are important numbers to understand. Personally when I think huge in terms of data, i think in the 100s terabytes+ range. If this is the case, you'll probably need more than two machines, especially if you want to use replication over the HDFS.
The analytic information you want to gather? Have you determined that these questions can be answered using the MapReduce approach?
Something you could consider would be to use Hadoop on Amazons EC2 if you have a limited amount of hardware resources. Here are some links to get you started:
hadoop-world-building-data-intensive-apps-with-hadoop-and-ec2
Hadoop Wiki - AmazonEC2
Have any of you tried Hadoop? Can it be used without the distributed filesystem that goes with it, in a Share-nothing architecture? Would that make sense?
I'm also interested into any performance results you have...
Yes, you can use Hadoop on a local filesystem by using file URIs instead of hdfs URIs in various places. I think a lot of the examples that come with Hadoop do this.
This is probably fine if you just want to learn how Hadoop works and the basic map-reduce paradigm, but you will need multiple machines and a distributed filesystem to get the real benefits of the scalability inherent in the architecture.
Hadoop MapReduce can run ontop of any number of file systems or even more abstract data sources such as databases. In fact there are a couple of built-in classes for non-HDFS filesystem support, such as S3 and FTP. You could easily build your own input format as well by extending the basic InputFormat class.
Using HDFS brings certain advantages, however. The most potent advantage is that the MapReduce job scheduler will attempt to execute maps and reduces on the physical machines that are storing the records in need of processing. This brings a performance boost as data can be loaded straight from the local disk instead of transferred over the network, which depending on the connection may be orders of magnitude slower.
As Joe said, you can indeed use Hadoop without HDFS. However, throughput depends on the cluster's ability to do computation near where data is stored. Using HDFS has 2 main benefits IMHO 1) computation is spread more evenly across the cluster (reducing the amount of inter-node communication) and 2) the cluster as a whole is more resistant to failure due to data unavailability.
If your data is already partitioned or trivially partitionable, you may want to look into supplying your own partitioning function for your map-reduce task.
The best way to wrap your head around Hadoop is to download it and start exploring the include examples. Use a Linux box/VM and your setup will be much easier than Mac or Windows. Once you feel comfortable with the samples and concepts, then start to see how your problem space might map into the framework.
A couple resources you might find useful for more info on Hadoop:
Hadoop Summit Videos and Presentations
Hadoop: The Definitive Guide: Rough Cuts Version - This is one of the few (only?) books available on Hadoop at this point. I'd say it's worth the price of the electronic download option even at this point ( the book is ~40% complete ).
Parallel/ Distributed computing = SPEED << Hadoop makes this really really easy and cheap since you can just use a bunch of commodity machines!!!
Over the years disk storage capacities have increased massively but the speeds at which you read the data have not kept up. The more data you have on one disk, the slower the seeks.
Hadoop is a clever variant of the divide an conquer approach to problem solving.
You essentially break the problem into smaller chunks and assign the chunks to several different computers to perform processing in parallel to speed things up rather than overloading one machine. Each machine processes its own subset of data and the result is combined in the end. Hadoop on a single node isn't going to give you the speed that matters.
To see the benefit of hadoop, you should have a cluster with at least 4 - 8 commodity machines (depending on the size of your data) on a the same rack.
You no longer need to be a super genius parallel systems engineer to take advantage of distributed computing. Just know hadoop with Hive and your good to go.
yes, hadoop can be very well used without HDFS. HDFS is just a default storage for Hadoop. You can replace HDFS with any other storage like databases. HadoopDB is an augmentation over hadoop that uses Databases instead of HDFS as a data source. Google it, you will get it easily.
If you're just getting your feet wet, start out by downloading CDH4 & running it. You can easily install into a local Virtual Machine and run in "pseudo-distributed mode" which closely mimics how it would run in a real cluster.
Yes You can Use local file system using file:// while specifying the input file etc and this would work also with small data sets.But the actual power of hadoop is based on distributed and sharing mechanism. But Hadoop is used for processing huge amount of data.That amount of data cannot be processed by a single local machine or even if it does it will take lot of time to finish the job.Since your input file is on a shared location(HDFS) multiple mappers can read it simultaneously and reduces the time to finish the job. In nutshell You can use it with local file system but to meet the business requirement you should use it with shared file system.
Great theoretical answers above.
To change your hadoop file system to local, you can change it in "core-site.xml" configuration file like below for hadoop versions 2.x.x.
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
for hadoop versions 1.x.x.
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>