Pig on a single machine - hadoop

Imagine that i have a file with 100 MM of records, and I want to use pig to wrangle it.
I don't have a cluster, but I still want to use PIG for productivity reasons. Could I use PIG in a single machine or it will have a poor performance?
Does Pig will simulate a MR job in a a single machine, or will use a self backend engine to execute the process?

Surely single machine with 100MM records processing by Hadoop won't give you performance.
For Development/Testing purpose you can use single machine with small/moderate amount of data, but not in production.
Hadoop Linearly scales it's performace as you add more number of nodes to the cluster.
Single machine also can act as a cluster.
PIG can run in 2 modes, local and mapreduce.
In local mode no hadoop daemons and hdfs.
In mapreduce, your pig script will be converted to MR Jobs and then gets executed.
Hope it helps!

Related

Performance of Pig in local mode vs mapreduce mode

I have a Hadoop cluster with 3 nodes and 12 GB of data / 1.5 mid records . I understood that Pig can be run in local mode (for development purpose) and in mapreduce mode.
For a little research project I am comparing processing times of running Pig in local and mapreduce mode.
When doing performance measurements the processing time in local mode is much faster than in mapreduce mode. (My code consists of loading the data file using JsonLoader with a schema , filtering and dumping the result.)
Is there a rule of thumb when map reduce mode is faster than local mode ?
Thank you !
It's not clear how you've tuned the YARN cluster to accomodate your workload, or how large the files you're reading actually are.
In general, 12 GB is not enough data to warrant the use of Hadoop/Mapreduce, assuming Pig can do multi-processing on its own.
However, if the files are split amongst datanodes, and you have allocated enough resources to each of those 3 machines, then the job should complete faster than just one machine.
You could even further enhance runtimes by using Pig on Tez or Spark engines.

Pig local vs mapreduce mode performance comparision

I have setup a 3 node Hadoop cluster with Cloudera manager CDH4. When ran a Pig job in mapreduce mode it took double the time than that of the local mode for same data set. Is that an expected behavior?
Also is there any documentation available for performance tuning options for mapreduce jobs?
Thanks much for any help!
This is probably because you are using a toy dataset and the overhead of mapreduce is larger than the benefit of parallelization
A good start for performance tuning is the "Making Pig Fly" chapter from the "Programming Pig" book.
Another reason is when you run in -x local mode, Pig does not do the same jar compilations as it does for map reduce mode. With small data sets and complex pig script the actual jar compilation time becomes noticeable.

1 big Hadoop and Hbase cluster vs 1 Hadoop cluster + 1 Hbase cluster

Hadoop will run a lot of jobs by reading data from Hbase and writing data to
Hbase. Suppose I have 100 nodes, then there are two ways that I can build my Hadoop/Hbase
cluster:
100 nodes hadoop & hbase cluster (1 big Hadoop&Hbase)
Separate the Database(Hbase), then we have two clusters:
60 nodes Hadoop cluster and 40 nodes Hbase cluster (1 Hadoop + 1 Hbase)
which option is better? Why?
Thanks.
I would say option 2 is better.My reasoning - even though your requirement is mostly of running lots of mapreduce jobs to read and write data out of hbase, there are a lot of things going behind scene for hbase to optimise those reads and write for your submitted jobs. Hmaster will have to do load balancing often , unless your region keys are perfectly balanced. Table hotspotting can be there. For Regionserver, there will be major-compactions and if your jvm skills are not that good, then occasionally Stop the World garbage collection can happen. All the regions may start splitting at the same time. Your regionserver can go down and so on. Moot point is - tuning hbase needs time. If you have just one node dedicated for hbase then probability of aforementioned problems are higher. It's always better to have more than one node, so all the performance pressure doesn't apply to just one node. And by the way , scoring point of hbase is it's inherently distributed nature, you wouldn't want to kill it. All said, you can experiment on the ratio of nodes between hadoop and hbase- May be 70:30 or 80:20. Mileage may vary according to your application requirements.
The main reason to separate HBase and Hadoop is when they have different usage scenarios - i.e. HBAse does random read-write in low latency and Hadoop does sequential batches. In this case the different access patterns can interfere with each other and it can be better to separate the clusters.
If you're just using HBase in batch mode you can use the same cluster (and probably rethink using HBase since it is slower than raw hadoop in batch).
Note that you would need to tune HBase along the lines mentioned by Chandra Kant regardless of the path you take

How to run Hive mapreduce tasks in all available nodes?

I am new to Hadoop and Hive world.
I have written a Hive query which is processing 189 Million rows (40 GB file). While I am executing query. Hive query is executing in single machine and generating many map and reduce tasks. Is that expected behavior?
I have read in many articles Hadoop is distributed processing framework. What I was understanding Hadoop will split your job in multiple tasks and distribute those tasks in different nodes and once tasks finish reducer will join the output. Please correct me if I am wrong.
I have 1 master and 2 slave nodes. I am using Hadoop 2.2.0 and Hive 0.12.0.
your understanding about hive is correct- hive translates your Query to hadoop job which in turn gets split into multiple tasks, distribute to nodes,map > sort&shuffle > reduce aggregate > return to hive CLI.
If you have 2 slave nodes, Hive will split its workload across the two, provided your cluster is properly configured.
That being said, if your input file is not splittable (for example, it's a GZIP compressed file), Hadoop will not able to split/parallelize the work, and you will be stuck with a single input split and thus a single mapper, limiting the workload to a single machine.
Thank you all for your quick reply.
you all correct my job is converted into different task and distributed to nodes.
While I am checking Hadoop Web UI in first level it was showing job is running in single node. While I drill down further it is showing Mappers and Reducers and where the are running.
Thanks :)

Time taken by MapReduce jobs

I am new to hadoop and mapreduce.I have a problem in running my data in hadoop Mapreduce. I want the results to be given in milliseconds. Is there any way that i can execute my Mapreduce jobs in milliseconds?
If not then what is the minimum time hadoop mapreduce can take in a fully distributed multi-cluster(5-6 nodes).
File size to be analyzed in hadoop mapreduce is around 50-100Mb
Program is written in Pig.Any suggesstions?
For adhoc realtime querying of data use Imapala, Apache Drill (WIP). Drill is based on Google Dremel.
Hive jobs get converted into MapReduce, so Hive is also batch oriented in nature and not real time. A lot of work is going on improve the performance of Hive (1 and 2) though.
it's not possible(afaik). hadoop is not meant for real time stuff on the first place. it is best suitable for batch jobs. the mapreduce framework needs some time to accept and setup the job, which you can't avoid. and i don't think it's a wise decision to get ultra high end machines to setup a hadoop cluster. also, the framework has to do a few things before actually starting the job, creating the logical splits of your data, for instance.

Resources