When we use MR job with HDFS input / output then MR job behave as transactional i.e. it gets executed successfully or in case of failure data written to HDFS gets rolled back, we do not get partial results say, 3 out of 10 lines are present in output.
But when we run same MR job over HBase it behave as non-transactional i.e. if I have to put 10 objects to HTable and I called context.write(...) 3 times and failed on 4th iteration then I can see 3 puts in HBase though MR job has failed.
Is there any way via which we can have transactional MR job over HBase ? i.e. either entire output is written to HBase or no-output is written to HBase.
Thanks in advance.
Related
I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.
I have a Spark job which reads some timeseries data and pushes that to HBASE using HBASE client API. I am executing this Spark job
on a 10 node cluster. Say at first when spark kicks off it picks machine1,machine2,machine3 as its executors. Now when the job inserts a row to HBASE. Below is what my undersatnding on what it does.
Based on the row key a particular region(from the META) would be chosen and that row will be pushed to that RegionServer's memstore and WAL and once the memestore is full it would be flushed to the disk.Now if assume a particular row is being processed by a executor on machine2 and the regionserver which handles that region to which the put is to be made is on machine6. Will the data be transferred from machine2 to machine6 over network and then the data will be stored in memstore of machine6. Or spark will wisely launch an executor on that machine during write(if the dynamic allocation is turned on) and pushes to it?
Spark does not know what happens in your job, it will not start an executor on the same node where writes happen. Data will be transferred over the network.
I have a MR job which runs fine in the cluster.
After the job completion I'm able to get YARN logs but I couldn't find the MR job counters like no of input records, output records.
Is it possible to get that information after job completion?
As we all know, Apache Pig is a data flow language. If I write a Pig Script and the Pig decides to split and run two or more jobs to execute the task in hand, so How does Pig Store the data which it passes from job1 to job 2. ???!!
I read the Pig documentation which says :-
"Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner."
(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management)
So Does Pig has a writer which stores the output of an intermediate job in Memory / RAM for better performance (spill to disk if required) and then if PIG has implemented a Reader which reads the data directly from memory to pass that data to the next Job for Processing???
In Mapreduce, we write the entire data to disk and then read it again for the next job to start.
Does Pig has a upper hand here, by implementing readers and writers which writes in RAM/memory (spill if required) and reads from RAM (and disk if required) for better Performance.
Kindly share your expertise/ views on the highlighted comment from the PIG documentation as to what does it actually mean or is stating otherwise.
Thanks in Advance,
Cheers :))
If pig script has multiple jobs,than output of each job is written into a temporary folder in HDFS, which is defined by pig.temp.dir (default is /tmp). See Pig Docs "Storing Intermediate Results". Also while running script do hadoop fs -ls /tmp/pig*, sometimes when jobs are interrupted, these folders are not cleaned up and need to be freed manually.
The spilling of the bags refers to what happens in the Mapper stage and there is not RAM communication between MR jobs.
In the first map reduce job I am processing an HBase table and outputting a smaller list of the rowkeys. I need to use this list of strings in order to process another map reduce job which is pulling from a different HBase table and outputting to another Hbase table. What is the proper way to store and access the ouput of the first map reduce job?
Hadoop doesn't support streaming the output of one MR job to another. So, the output of the first MR job has to be stored in HDFS (or some other persistent storage) and then read in the second MR job. Create a DAG of jobs using Oozie or Azkaban. For a simple work flow use Hadoop's JobControl API.
Apache Tez which is still in the incubator phase allows streaming of data across MR tasks. As mentioned, Tez is still in the Incubator stage, so use it with a bit of caution.