As we all know, Apache Pig is a data flow language. If I write a Pig Script and the Pig decides to split and run two or more jobs to execute the task in hand, so How does Pig Store the data which it passes from job1 to job 2. ???!!
I read the Pig documentation which says :-
"Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner."
(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management)
So Does Pig has a writer which stores the output of an intermediate job in Memory / RAM for better performance (spill to disk if required) and then if PIG has implemented a Reader which reads the data directly from memory to pass that data to the next Job for Processing???
In Mapreduce, we write the entire data to disk and then read it again for the next job to start.
Does Pig has a upper hand here, by implementing readers and writers which writes in RAM/memory (spill if required) and reads from RAM (and disk if required) for better Performance.
Kindly share your expertise/ views on the highlighted comment from the PIG documentation as to what does it actually mean or is stating otherwise.
Thanks in Advance,
Cheers :))
If pig script has multiple jobs,than output of each job is written into a temporary folder in HDFS, which is defined by pig.temp.dir (default is /tmp). See Pig Docs "Storing Intermediate Results". Also while running script do hadoop fs -ls /tmp/pig*, sometimes when jobs are interrupted, these folders are not cleaned up and need to be freed manually.
The spilling of the bags refers to what happens in the Mapper stage and there is not RAM communication between MR jobs.
Related
We have a log collection agent running with HDFS, that is, the agent(like Flume) keeps collecting logs from some applications and then writes then to HDFS. The reading and writing process are running without a break, leading the destination files of HDFS keeping increasing.
And here is the question, since the input data is changing continuously, what would happen to a MapReduce job if I set the collection agent's destination path as the job's input path?
FileInputFormat.addInputPath(job, new Path("hdfs://namenode:9000/data/collect"));
A map-reduce job processes only data available at the start.
Map-Reduce is for batch data processing. For continuous data processing use tools like Storm or Spark Streaming.
If I create MR job by using TableMapReduceUtil(Hbase), it seems that hbase scanner feeds data into mapper and converts data from reducer to specific hbase output format to store it in hbase table.
For this reason, I expect hbase mapreduce job will take more time than native MR job.
So, How definitely long does Hbase job take more than native MR?
In regards to reads going through HBase can be 2-3 times slower than native map/reduce that uses files directly.
In the recently announced HBase 0.98 they've added the capability to do map/reduce over HBase snapshots. You can see this presentation for details (slide 7 for API, slide 16 for speed comparison).
In regard to writes you can write into HFiles directly and then bulk load to HBase - however since HBase caches data and does bulk writes you can also tune it up and get comparable or better results
In the first map reduce job I am processing an HBase table and outputting a smaller list of the rowkeys. I need to use this list of strings in order to process another map reduce job which is pulling from a different HBase table and outputting to another Hbase table. What is the proper way to store and access the ouput of the first map reduce job?
Hadoop doesn't support streaming the output of one MR job to another. So, the output of the first MR job has to be stored in HDFS (or some other persistent storage) and then read in the second MR job. Create a DAG of jobs using Oozie or Azkaban. For a simple work flow use Hadoop's JobControl API.
Apache Tez which is still in the incubator phase allows streaming of data across MR tasks. As mentioned, Tez is still in the Incubator stage, so use it with a bit of caution.
I am using
Hbase:0.92.1-cdh4.1.2, and
Hadoop:2.0.0-cdh4.1.2
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_slave01.mydomain.com:localhost/127.0.0.1:43455 has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave ("tracker_slave01.mydomain.com"), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.setName(Bytes.toBytes(tableName));
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:
http://databuzzprd.blogspot.in/2013/11/bulk-load-data-in-hbase-table.html
I'm confused as to how conext.write works in hadoop reducer.
Why is there no locking issues in hadoop reducers(if there is more than 1 reducer) if all are writing to the same file in HDFS?
Normally, if we would write to the same file ourselves in a hadoop mapper/reducer, we would get locking errors that we can't write to the same file concurrently.
If your map reduce program runs on a Multi node cluster then there will be unique Map and Reduce programs running on each node.
Reduce in Map Reduce doesn't directly write to the file itself. It delegates this task to OutputFormat which is responsible for sinking of Data. It could be to a File, Database Table or any other location. FileOutputFormat will sink to a location in Hadoop Distributed File System (HDFS). DBOutputFormat will sink to a Database table (read this post).
For your question of file locks please have a look at this post at Yahoo Developer Network.