Pig is not able to process big file - hadoop

I am new to Hadoop and Pig.
I have setup Hadoop cluster with 3 node. I have written a Pig script which is normally reading data and executing aggregated functions on the it.
When I am executing 4.8G file with 36 Million Records pig is giving output in 51 minutes.
When I am executing 9.6G file with 72 Million Records pig script is crashing and Hadoop is giving following error.
Unable to recreate exception from backed error: AttemptID:attempt_1389348682901_0050_m_000005_3 Info:Container killed by the ApplicationMaster.
Job failed, hadoop does not return any error message
I am using Hadoop 2.2.0 and Pig 0.12.0.
My nodes configuration are
Master: 2 CPU, 2 GB RAM
Slave1: 2 CPU, 2 GB RAM
Slave2: 1 CPU, 2 GB RAM
Could you please advice me on this?

After trying things with Pig. I moved to Hive.
What I observed when I was using Pig:
I was uploading file in HDFS and loading it in Pig. So Pig was again loading that file. I was processing file twice.
For my scenario Hive fits. I am uploading file in HDFS and loading that file in Hive. It takes few milliseconds. Because Hive is seamlessly working with HDFS files. So no need to load data again in Hive tables. That saves lots of time.
Both components are good, for me Hive fits.
Thanks all for your time and advice.

Related

How can i increase size of processing input data file in map-reduce job?

Currently i am using a cluster with following configurations:
1 namenode and 5 datanodes. Each datanodes have 8.7TB of hardisk and 32GB ram.
When i try to execute a map-reduce job above 300GB i get a error but when i try to execute a job with same code for dataset below 300GB its get executed without any problem. Looks like my cluster can not process above 300GB of data, is that the case? Can i process a map-reduce job for dataset above 300GB , what configurations do i need to change? Do i need to do changes in my Drivers?

hive data processing taking longer time than expected

I'm facing an issue with ORC type data in hive. Needed some suggestions if someone faced similar problem.
I've huge data stored in hive table (partitioned & ORCed). The ORC data size is around 4 TB. I'm trying to copy this data to an uncompressed normal hive table (same table structure).
The process is running forever & occupying huge amount of non DFS storage in the pursuit. At present the process is running for 12 hours & has occupied 130 TB of non-DFS. That's very much abnormal for a Hadoop cluster with 20 servers.
Below are my parameters:
Hadoop running: HDP 2.4
Hive: 0.13
No. of servers: 20 (2 NN included)**
I wonder what a simple join or a normal analytics operation on this ORCed table would do. And theory tells that ORC format data increases performance for basic DML queries.
Can someone please let me know if I'm doing something wrong or is this a normal behavior? With ORCed data, this is my first experience.
Well, on a starters I saw that yarn log files are getting created in huge size. Mostly it shows the error logs only in heavy.
Thanks

When to move data to HDFS/Hive?

So I'm developing an application that is expected to deal with large amounts of data, and as such I've decided to use Hadoop to process it.
My services node and datanodes are separated from the webapp, so I'm using HttpFS to communicate the app with Hadoop.
So, whenever a new row of data is generated in my application, should I already call the corresponding HttpFS URL to append the data to an HDFS file? Should I write this data in a file in the webserver and using a cronjob upload it to HDFS for example every hour?
Should I have the Hive table updated or should I just load the data in there whenever I need to query it?
I'm pretty new to Hadoop so any link that could help will also be useful.
I prefer below approach.
Do not call HtpFS URL to append data to HDSF file for every row update. HDFS is efficient when data file size is more than 128 MB (in Hadoop 2.x) or 64 MB (in Hadoop 1.x)
Write the data in web server. Have a rolling appender when file size reaches certain limit - in multiples of 128 MB e.g 1 GB file.
You can have hourly based cron jobs but make sure that you are sending big data file (e.g 1 GB or multiple of 128 MB) instead of just sending the log file, which is accumulated in 1 hour.
Regarding loading of data, you can use internal or external HIVE tables. Have a look at this article

How does Pig Pass Data from First Job and its next Job

As we all know, Apache Pig is a data flow language. If I write a Pig Script and the Pig decides to split and run two or more jobs to execute the task in hand, so How does Pig Store the data which it passes from job1 to job 2. ???!!
I read the Pig documentation which says :-
"Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner."
(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management)
So Does Pig has a writer which stores the output of an intermediate job in Memory / RAM for better performance (spill to disk if required) and then if PIG has implemented a Reader which reads the data directly from memory to pass that data to the next Job for Processing???
In Mapreduce, we write the entire data to disk and then read it again for the next job to start.
Does Pig has a upper hand here, by implementing readers and writers which writes in RAM/memory (spill if required) and reads from RAM (and disk if required) for better Performance.
Kindly share your expertise/ views on the highlighted comment from the PIG documentation as to what does it actually mean or is stating otherwise.
Thanks in Advance,
Cheers :))
If pig script has multiple jobs,than output of each job is written into a temporary folder in HDFS, which is defined by pig.temp.dir (default is /tmp). See Pig Docs "Storing Intermediate Results". Also while running script do hadoop fs -ls /tmp/pig*, sometimes when jobs are interrupted, these folders are not cleaned up and need to be freed manually.
The spilling of the bags refers to what happens in the Mapper stage and there is not RAM communication between MR jobs.

Native mapreduce VS hbase mapreduce

If I create MR job by using TableMapReduceUtil(Hbase), it seems that hbase scanner feeds data into mapper and converts data from reducer to specific hbase output format to store it in hbase table.
For this reason, I expect hbase mapreduce job will take more time than native MR job.
So, How definitely long does Hbase job take more than native MR?
In regards to reads going through HBase can be 2-3 times slower than native map/reduce that uses files directly.
In the recently announced HBase 0.98 they've added the capability to do map/reduce over HBase snapshots. You can see this presentation for details (slide 7 for API, slide 16 for speed comparison).
In regard to writes you can write into HFiles directly and then bulk load to HBase - however since HBase caches data and does bulk writes you can also tune it up and get comparable or better results

Resources