Pig Latin distinguishing Map or Reduce queries - hadoop

I have the following data sample:
AGE,EDU,SEX,SALARY
67,10th,Male,<=50K
17,10th,Female,<=50K
40,Assoc-voc,Male,>50K
35,Assoc-voc,Male,<=50K
57,Assoc-voc,Male,<=50K
49,Assoc-voc,Male,>50K
42,Bachelors,Male,>50K
30,Bachelors,Male,>50K
23,Bachelors,Female,<=50K
==============================================
I created the following Pig Latin/hadoop script:
sensitive = LOAD '/mdsba' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
*--Filtered the data by the salary
Data_filter1 = FILTER sensitive by (SALARY matches '<=50K');
Data_filter2 = FILTER sensitive by (SALARY matches '>50K');
--group both filters
B= foreach(group Data_filter1 by(AGE,EDU,SEX))
generate Data_filter1;
C= foreach(group Data_filter2 by(AGE,EDU,SEX))
generate Data_filter2;
Dump B ;
Dump C ;
=============================================================
Is there any way to determine whether the queries B,C, Data_filter1, or Data_filter2 run on Map or Reduce process. Since the following report is generated at the end of the job:
Elapsed: 35sec
Diagnostics:
Average Map Time: 12sec
Average Shuffle Time: 10sec
Average Merge Time: 0sec
Average Reduce Time: 2sec
With many thanks

Yes, when you are launching the job you'll see a string
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Alias1[73,14] C: Alias2[20, 9] R: Alias3[90, 78]
M stands for mapper, C for combiner, R for reducer. But in general case there is a possibility that your queries will run on both mapper and reducer

Related

Performance issues of small files on Hive

I was reading an article regarding how small files degrade the performance of the hive query.
https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1
I understand the first part regarding overloading the NameNode.
However, what he had said regrading map-reduce doesn't seem to happen. for both map-reduce and Tez.
When a MapReduce job launches, it schedules one map task per block of
data being processed
I don't see mapper task created per file.May the reason is, he is referring the version 1 of map-reduce and so much change haver been done after that.
Hive Version: Hive 1.2.1000.2.6.4.0-91
My table:
create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;
Data:
following code will create 100 small files it containing only few kb of data.
for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done
However I see only one mapper and one reducer task being created for following query.
[root#sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)
Same result with map-reduce.
hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
This is because the following configuration is taking effect
hive.hadoop.supports.splittable.combineinputformat
from the documentation
Whether to combine small input files so that fewer mappers are
spawned.
So essentially Hive can infer that the input is a group of small files smaller than the blocksize and combine them reducing the required number of mappers.

Hive not running Map Reduce with "where" clause

I'm trying out something simple in Hive on HDFS.
The problem is that the queries are not running map reduce when I'm running a "where clause". However, it runs map reduce for count(*), and even group by clauses.
Here's data and queries with result:
Create External Table:
CREATE EXTERNAL TABLE testtab1 (
id STRING, org STRING)
row format delimited
fields terminated by ','
stored as textfile
location '/usr/ankuchak/testtable1';
Simple select * query:
0: jdbc:hive2://> select * from testtab1;
15/07/01 07:32:46 [main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
+---------------+---------------+--+
| testtab1.id | testtab1.org |
+---------------+---------------+--+
| ankur | idc |
| user | idc |
| someone else | ssi |
+---------------+---------------+--+
3 rows selected (2.169 seconds)
Count(*) query
0: jdbc:hive2://> select count(*) from testtab1;
Query ID = ankuchak_20150701073407_e7fd66ae-8812-4e02-87d7-492f81781d15
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
15/07/01 07:34:08 [HiveServer2-Background-Pool: Thread-40]: ERROR mr.ExecDriver: yarn
15/07/01 07:34:08 [HiveServer2-Background-Pool: Thread-40]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1435425589664_0005, Tracking URL = http://slc02khv:8088/proxy/application_1435425589664_0005/
Kill Command = /scratch/hadoop/hadoop/bin/hadoop job -kill job_1435425589664_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
15/07/01 07:34:16 [HiveServer2-Background-Pool: Thread-40]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-07-01 07:34:16,291 Stage-1 map = 0%, reduce = 0%
2015-07-01 07:34:23,831 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.04 sec
2015-07-01 07:34:30,102 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.41 sec
MapReduce Total cumulative CPU time: 2 seconds 410 msec
Ended Job = job_1435425589664_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.41 sec HDFS Read: 6607 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 410 msec
OK
+------+--+
| _c0 |
+------+--+
| 3 |
+------+--+
1 row selected (23.527 seconds)
Group by query:
0: jdbc:hive2://> select org, count(id) from testtab1 group by org;
Query ID = ankuchak_20150701073540_5f20df4e-0bd4-4e18-b065-44c2688ce21f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
15/07/01 07:35:40 [HiveServer2-Background-Pool: Thread-63]: ERROR mr.ExecDriver: yarn
15/07/01 07:35:41 [HiveServer2-Background-Pool: Thread-63]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1435425589664_0006, Tracking URL = http://slc02khv:8088/proxy/application_1435425589664_0006/
Kill Command = /scratch/hadoop/hadoop/bin/hadoop job -kill job_1435425589664_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
15/07/01 07:35:47 [HiveServer2-Background-Pool: Thread-63]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2015-07-01 07:35:47,200 Stage-1 map = 0%, reduce = 0%
2015-07-01 07:35:53,494 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2015-07-01 07:36:00,799 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.53 sec
MapReduce Total cumulative CPU time: 2 seconds 530 msec
Ended Job = job_1435425589664_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.53 sec HDFS Read: 7278 HDFS Write: 14 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
+-------+------+--+
| org | _c1 |
+-------+------+--+
| idc | 2 |
| ssi | 1 |
+-------+------+--+
2 rows selected (21.187 seconds)
Now the simple where clause:
0: jdbc:hive2://> select * from testtab1 where org='idc';
OK
+--------------+---------------+--+
| testtab1.id | testtab1.org |
+--------------+---------------+--+
+--------------+---------------+--+
No rows selected (0.11 seconds)
It would be great if you could provide me with some pointers.
Please let me know if you need further information in this regard.
Regards,
Ankur
Map job is occuring in your last query. So it's not that map reduce is not happening. However, some rows should be returned in your last query. The likely culprit here is that for some reason it is not finding a match on the value "idc". Check your table and ensure that the group for Ankur and user contain the string idc.
Try this to see if you get any results:
Select * from testtab1 where org rlike '.*(idc).*';
or
Select * from testtab1 where org like '%idc%';
These will grab any row that has a value containing the string 'idc'. Good luck!
Here, details of the same error and fixed recently. Try verifying the version you are using

MRJOB reducer gives no output on EMR but provides output when run in local machine

When I execute a MapReduce job on a local setup I get the desired output from the reducer while the same code on EMR does not produce any. I have a cluster setup of 1 master and 10 core.
This is the output. There is no error displayed
Map-Reduce Framework
Map input records=3000
Map output records=378
Map output bytes=36054
Map output materialized bytes=40448
Input split bytes=1420
Combine input records=0
Combine output records=0
Reduce input groups=179
Reduce shuffle bytes=40448
Reduce input records=378
Reduce output records=0
Spilled Records=756
Shuffled Maps =380
Failed Shuffles=0
Merged Map outputs=380
GC time elapsed (ms)=23484
CPU time spent (ms)=125780
Physical memory (bytes) snapshot=9989242880
Virtual memory (bytes) snapshot=52768247808
Total committed heap usage (bytes)=6517702656
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=711180681
File Output Format Counters
Bytes Written=0
Following the reducer code:
def reducer(self, key, val):
best = -60
best_name = None
lat = 0
longi = 0
yr = 0
genre = None
for hot, name,lat,longi,yr,genre in val:
if hot > best:
best = hot
best_name = name
lat = lat
longi = longi
yr = yr
genre = genre
yield (key,(best,best_name,lat,longi,yr,genre))

hadoop map reduce job pending too long

I have a question about running hadoop mapreduce job. I have a table staff, partitioned by join date.
Create statement like that:
create table staff (id int, age int) partitioned by (join_date string) row format delimited fields terminated by '\;';
I put some data to parition '20130921' then when i execute statement bellow, the result is ok:
select count(*) from staff where join_date='20130921';**
But when i execute on partition '20130922' (partition without data), the map reduce job is pending too long, seem like is run forever:
hive> select count(*) from staff where join_date='20130922';**
Total MapReduce jobs = 1**
Launching Job 1 out of 1**
**Number of reduce tasks determined at compile time: 1**
**In order to change the average load for a reducer (in bytes):**
set hive.exec.reducers.bytes.per.reducer=<number>**
**In order to limit the maximum number of reducers:**
set hive.exec.reducers.max=<number>**
**In order to set a constant number of reducers:**
set mapred.reduce.tasks=<number>**
**Starting Job** = `job_201309231116_0131, Tracking URL = ....jobid=job_201309231116_0131`
**Kill Command** = `/u01/hadoop-0.20.203.0/bin/../bin/hadoop job -kill job_201309231116_0131`
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2013-09-23 17:19:07,182 Stage-1 map = 0%, reduce = 0%
2013-09-23 17:19:07,182 Stage-1 map = 0%, reduce = 0%
2013-09-23 17:19:07,182 Stage-1 map = 0%, reduce = 0%
The jobtracker show reduce task pending and this job dont seem like can finished.
Im using hadoop-0.20.203.0 and hive-0.10.0. I googled all day but didnt find any topic have same problem, please help me.
Best regards.
This seems to be a problem with your Hive installation. I came across a similar problem. You can try out restarting Hive Server and Hive Metastore. This fixed my problem.

Pig integrated with Cassandra: simple distributed query takes a few minutes to complete. Is this normal?

I set up a test integration of Cassandra + Pig/Hadoop. 8 nodes are Cassandra + TaskTracker nodes, 1 node is the JobTracker/NameNode.
I fired up the cassandra client and created a the simple bit of data listed in the Readme.txt in the Cassandra distribution:
[default#unknown] create keyspace Keyspace1;
[default#unknown] use Keyspace1;
[default#Keyspace1] create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
[default#KS1] set Users[jsmith][first] = 'John';
[default#KS1] set Users[jsmith][last] = 'Smith';
[default#KS1] set Users[jsmith][age] = long(42)
Then I ran the sample pig query listed in CASSANDRA_HOME (using pig_cassandra):
grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
grunt> cols = FOREACH rows GENERATE flatten(columns);
grunt> colnames = FOREACH cols GENERATE $0;
grunt> namegroups = GROUP colnames BY (chararray) $0;
grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group;
grunt> orderednames = ORDER namecounts BY $0;
grunt> topnames = LIMIT orderednames 50;
grunt> dump topnames;
It took about 3 minutes to complete.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.0 0.9.1 root 2012-01-12 22:16:53 2012-01-12 22:20:22 GROUP_BY,ORDER_BY,LIMIT
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201201121817_0010 8 1 12 6 9 21 21 21 colnames,cols,namecounts,namegroups,rows GROUP_BY,COMBINER
job_201201121817_0011 1 1 6 6 6 15 15 15 orderednames SAMPLER
job_201201121817_0012 1 1 9 9 9 15 15 15 orderednames ORDER_BY,COMBINER hdfs://xxxx/tmp/temp-744158198/tmp-1598279340,
Input(s):
Successfully read 1 records (3232 bytes) from: "cassandra://Keyspace1/Users"
Output(s):
Successfully stored 3 records (63 bytes) in: "hdfs://xxxx/tmp/temp-744158198/tmp-1598279340"
Counters:
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
There were no errors or warnings in the logging.
Is this normal, or is there something wrong?
Yes this is normal because running a Map/Reduce job on Hadoop usually takes about 1 minute just for startup. Pig generates multiple Map/Reduce jobs dependent on the complexity of the script.

Resources