What does a spark cluster means? [closed] - hadoop

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have used spark on my local machine using python for analytical puproses.
Recently I've heard the words "spark cluster" and I was wondering what it is exactly?
Is it just Spark running on some cluster of machines ?
And how can it be used on cluster without Hadoop system? Is it possible? Can you please describe?

Apache spark is a distributed computing system. While it can run on a single machine, it is meant to run on a cluster and to take advantage of parallelism possible utilizing the cluster. Spark utilizes much of the Hadoop stack, such as the HDFS file system. However, Spark overlaps considerably with Hadoop distributed computing chain. Hadoop centers around the map reduce programming pattern, while Spark is more general with regard to program design. Also, Spark has features to help increase performance.
For more information, see https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

Related

Can I apply Ambari on a Running cluster? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I was looking for how to monitor Hadoop clusters more conveniently, and then I came across something called Ambari.
I want to apply Apache Ambari to my running Hadoop cluster.
Is it possible to apply Apache Ambari to a running Hadoop cluster?
If this is not possible, are there any future patches planned?
#Coldbrew No. Ambari should be installed on a fresh cluster. If you indeed need to use ambari hadoop, I would recommend make a new ambari cluster w/ hadoop configured as close to possible as your existing hadoop and then migrating the native hadoop data to the new platform.

Files transfer to HDFS [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I need to bring the files (zip, csv, xml etc) from windows share location to HDFS. Which is the best approach ? I have kafka - flume - hdfs in mind. Please suggest the efficient way.
I tried getting the files to Kafka consumer.
producer.send(
new ProducerRecord(topicName,key,value),
Expect an efficient approach
Kafka is not designed to send files, only individual messages of up to 1MB, by default.
You can install NFS Gateway in Hadoop, then you should be able to copy directly from the windows share to HDFS without any streaming technology, only a scheduled script on the windows machine, or externally ran
Or you can mount the windows share on some Hadoop node, and schedule a Cron job if you need continuous file delivery - https://superuser.com/a/1439984/475508
Other solutions I've seen use tools like Nifi / Streamsets which can be used to read/move files
https://community.hortonworks.com/articles/26089/windows-share-nifi-hdfs-a-practical-guide.html

How a Search Script like InoutScripts' Inout Spider attains scalability? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to know more about how these Search Engines scripts like InoutScripts' Inout Spider attains scalability.
Is it because of the technology they are using.
Do u think is it because of the technology of combining hadoop and hypertable.
For storing and large scale processing of data sets on clusters of commodity hardware, Hadoop is used which is an open source software framework. Hadoop is an Apache top level project being built and used by a global community of contributors and users. At the application layer itself Hadoop detects and handle failures rather than rely on hardware to deliver high availability.

Is there a good library that helps chain MapReduce jobs using Hadoop Streaming and Python? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
This question answers part of my question but not completely.
How do I run a script that manages this, is it from my local filesystem? Where exactly do things like MrJob or Dumbo come into picture? Are there any more alternative?
I am trying to run K-Means where each iterations (a MapReduce job) output will be the input to the next iteration with Hadoop Streaming and Python.
I do not have much experience and any information should help me make this work.Thanks!
If you are not very tightly coupled with Python then you have a very good option. There is one project from Cloudera called "Crunch" that allows us to create pipelines of MR jobs easily. it's a java library that provides a framework for writing, testing, and running MapReduce pipelines, and is based on Google's FlumeJava library.
There is another non-python option. Graphlab is an open source project to produce free implementations of scalable machine learning algorithms on multicore machine and clusters. There is an implemented fast scalable version of the Kmeans++ algorithm included in the package. See Graphlab for details.
Clustering API of graphlab can be found here .
Seems like a good applications for Spark it has also streaming option but I'm afraid it only works with Scala, but they have Python API, definitively worth a try, it is not that difficult to use ( at least the tutorials) and it can scale at large.
It should be possible to use GraphLab Create (in Python) running on Hadoop to do what you describe. The clustering toolkit can help implement the K-Means part. You can coordinate/script it from your local machine and use the graphlab.deploy API to run the job on Hadoop.

Using a "local" S3 as a replacement for HDFS? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have been testing out the most recent Cloudera CDH4 hadoop-conf-pseudo (i.e. MRv2 or YARN) on a notebook, which has 4 cores, 8GB RAM, and an Intel X25MG2 SSD. The OS is Ubuntu 12.04LTS 64bit. So far so good.
Looking at Setting up hadoop to use S3 as a replacement for HDFS, I would like to do it on my notebook - on this notebook, there is a S3 emulator that my colleagues and I implemented.
Nevertheless, I can't find where I can set the jets3t.properties to change the end point to localhost. I downloaded the hadoop-2.0.1-alpha.tar.gz and searched the source without finding out a clue. There is a similar Q on SO Using s3 as fs.default.name or HDFS?, but I want to use our own lightweight and fast S3 emulation layer, instead of AWS S3, for our experiments.
I would appreciate a hint as to how I can change the end point to a different hostname.
Regards,
--Zack

Resources