I am new to Hadoop and Hive world.
I have written a Hive query which is processing 189 Million rows (40 GB file). While I am executing query. Hive query is executing in single machine and generating many map and reduce tasks. Is that expected behavior?
I have read in many articles Hadoop is distributed processing framework. What I was understanding Hadoop will split your job in multiple tasks and distribute those tasks in different nodes and once tasks finish reducer will join the output. Please correct me if I am wrong.
I have 1 master and 2 slave nodes. I am using Hadoop 2.2.0 and Hive 0.12.0.

your understanding about hive is correct- hive translates your Query to hadoop job which in turn gets split into multiple tasks, distribute to nodes,map > sort&shuffle > reduce aggregate > return to hive CLI.

If you have 2 slave nodes, Hive will split its workload across the two, provided your cluster is properly configured.
That being said, if your input file is not splittable (for example, it's a GZIP compressed file), Hadoop will not able to split/parallelize the work, and you will be stuck with a single input split and thus a single mapper, limiting the workload to a single machine.

you all correct my job is converted into different task and distributed to nodes.
While I am checking Hadoop Web UI in first level it was showing job is running in single node. While I drill down further it is showing Mappers and Reducers and where the are running.
Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

Pig on a single machine

Imagine that i have a file with 100 MM of records, and I want to use pig to wrangle it.
I don't have a cluster, but I still want to use PIG for productivity reasons. Could I use PIG in a single machine or it will have a poor performance?
Does Pig will simulate a MR job in a a single machine, or will use a self backend engine to execute the process?
Surely single machine with 100MM records processing by Hadoop won't give you performance.
For Development/Testing purpose you can use single machine with small/moderate amount of data, but not in production.
Hadoop Linearly scales it's performace as you add more number of nodes to the cluster.
Single machine also can act as a cluster.
PIG can run in 2 modes, local and mapreduce.
In local mode no hadoop daemons and hdfs.
In mapreduce, your pig script will be converted to MR Jobs and then gets executed.
Hope it helps!

How to fully utilize all Spark nodes in cluster?

I have launched a 10 node cluster with the ec2-script in standalone mode for Spark. I am accessing data in s3 buckets from within the PySpark shell but when I perform transormations on the RDD, only one node is ever used. For example the below will read in data from the CommonCorpus:
bucket = ("s3n://#aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/"
data = sc.textFile(bucket)
When I run this, only one of my 10 slaves processes the data. I know this because only one slave (213) has any logs of the activity when viewed from the Spark web console. When I view the the activity in Ganglia, this same node (213) is the only slave with a spike in mem usage when the activity was run.
Furthermore I have the exact same performance when I run the same script with an ec2 cluster of only one slave. I am using Spark 1.1.0 and any help or advice is greatly appreciated.
I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.
(Note, however, that Spark can load 10 gzipped files in parallel just fine; it's just that each of those 10 files can only be loaded by 1 task. You can still get parallelism across files, just not within a file.)
You can confirm that you only have 1 partition by checking the number of partitions in your RDD explicitly:
The upper bound on the number of tasks that can run in parallel on an RDD is the number of partitions in the RDD or the number of slave cores in your cluster, whichever is lower.
In your case, it's the number of RDD partitions. You can increase that by repartitioning your RDD as follows:
data = sc.textFile(bucket).repartition(sc.defaultParallelism * 3)
Why sc.defaultParallelism * 3?
The Spark Tuning guide recommends having 2-3 tasks per core, and sc.defaultParalellism gives you the number of cores in your cluster.

Is it possible to specify which takstrackers to use in a MapReduce job?

We have two types of jobs in our Hadoop cluster. One job uses MapReduce HBase scanning, the other one is just pure manipulation of raw files in HDFS. Within our HDFS cluster, part of the datanodes are also HBase regionservers, but others aren't. We would like to run the HBase scans only in the regionservers (to take advantage of the data locality), and run the other type of jobs in all the datanodes. Is this idea possible at all? Can we specify which tasktrackers to use in the MapReduce job configuration?
Any help is appreciated.

Time taken by MapReduce jobs

I am new to hadoop and mapreduce.I have a problem in running my data in hadoop Mapreduce. I want the results to be given in milliseconds. Is there any way that i can execute my Mapreduce jobs in milliseconds?
If not then what is the minimum time hadoop mapreduce can take in a fully distributed multi-cluster(5-6 nodes).
File size to be analyzed in hadoop mapreduce is around 50-100Mb
Program is written in Pig.Any suggesstions?
For adhoc realtime querying of data use Imapala, Apache Drill (WIP). Drill is based on Google Dremel.
Hive jobs get converted into MapReduce, so Hive is also batch oriented in nature and not real time. A lot of work is going on improve the performance of Hive (1 and 2) though.
it's not possible(afaik). hadoop is not meant for real time stuff on the first place. it is best suitable for batch jobs. the mapreduce framework needs some time to accept and setup the job, which you can't avoid. and i don't think it's a wise decision to get ultra high end machines to setup a hadoop cluster. also, the framework has to do a few things before actually starting the job, creating the logical splits of your data, for instance.
