error when executing DUMP command in PIG - hadoop

grunt> a = load '/cleartrip/prodqueue_cleartrip-quick-Nov05th14-morning.txt' USING PigStorage('\t') AS
>> (col0:chararray, col1:chararray, col2:chararray, col3:chararray, col4:chararray, col5:chararray, col6:chararray, col7:chararray);
grunt> b = limit a 10;
grunt> dump b;
Input(s):
Failed to read data from "/cleartrip/prodqueue_cleartrip-quick-Nov05th14-morning.txt"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
null -> null,
null

try to Copy ur Sample.txt to Home Folder..It might help you clearly
then get into the pig in Hadoop
or better do one thing
Run Pig in local file system
Pig -x local

Related

Performance issues of small files on Hive

I was reading an article regarding how small files degrade the performance of the hive query.
https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1
I understand the first part regarding overloading the NameNode.
However, what he had said regrading map-reduce doesn't seem to happen. for both map-reduce and Tez.
When a MapReduce job launches, it schedules one map task per block of
data being processed
I don't see mapper task created per file.May the reason is, he is referring the version 1 of map-reduce and so much change haver been done after that.
Hive Version: Hive 1.2.1000.2.6.4.0-91
My table:
create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;
Data:
following code will create 100 small files it containing only few kb of data.
for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done
However I see only one mapper and one reducer task being created for following query.
[root#sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)
Same result with map-reduce.
hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
This is because the following configuration is taking effect
hive.hadoop.supports.splittable.combineinputformat
from the documentation
Whether to combine small input files so that fewer mappers are
spawned.
So essentially Hive can infer that the input is a group of small files smaller than the blocksize and combine them reducing the required number of mappers.

Json parse with elephantbird in Pig

I can't get the following data to parse in Pig. It's what the twitter API returns after getting all tweets from a certain user.
source data: (I removed some numbers to not invade on anyone's privacy by accident)
[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}, {another one goes here ....} ]
I have tried a lot of things but this is the current code I have:
REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';
--Load Json
loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []);
describe loadJson;
--dump loadJson;
--PARSING JSON
--txt
--a = FOREACH loadJson GENERATE json#'text' AS ParsedInput;
dump loadJson;
c = FOREACH loadJson GENERATE flatten(json#'text') as (m:map[]);
If I'm not getting erros, I just get no returns (as in 0 bytes returned after the script is done running)
for instance:
success!
Input(s):
Successfully read 0 records (532459 bytes) from: "/user/cloudera/tweetwall"
Output(s):
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-988640258/tmp-846532109"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
1. You need to give the root name for your input json
I added "tweets" as your root name
{"tweets":[<your input>]}
2. This is nested json, so you need to load your json file with 'nested' option in the loader
input.json
{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}
PigScript:
REGISTER '/tmp/json-simple-1.1.jar';
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
loadJson = LOAD 'input.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
C = FOREACH B GENERATE FLATTEN(m#'text');
DUMP C;
Output:
(#Beace_ your nan makes me laugh with some of the things she comes out with)

MRJOB reducer gives no output on EMR but provides output when run in local machine

When I execute a MapReduce job on a local setup I get the desired output from the reducer while the same code on EMR does not produce any. I have a cluster setup of 1 master and 10 core.
This is the output. There is no error displayed
Map-Reduce Framework
Map input records=3000
Map output records=378
Map output bytes=36054
Map output materialized bytes=40448
Input split bytes=1420
Combine input records=0
Combine output records=0
Reduce input groups=179
Reduce shuffle bytes=40448
Reduce input records=378
Reduce output records=0
Spilled Records=756
Shuffled Maps =380
Failed Shuffles=0
Merged Map outputs=380
GC time elapsed (ms)=23484
CPU time spent (ms)=125780
Physical memory (bytes) snapshot=9989242880
Virtual memory (bytes) snapshot=52768247808
Total committed heap usage (bytes)=6517702656
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=711180681
File Output Format Counters
Bytes Written=0
Following the reducer code:
def reducer(self, key, val):
best = -60
best_name = None
lat = 0
longi = 0
yr = 0
genre = None
for hot, name,lat,longi,yr,genre in val:
if hot > best:
best = hot
best_name = name
lat = lat
longi = longi
yr = yr
genre = genre
yield (key,(best,best_name,lat,longi,yr,genre))

Hive failed with java.io.IOException(Max block location exceeded for split .... splitsize: 45 maxsize: 10)

The hive does need to process 45 files. Each size is about 1GB. After the mapper execution finished 100%, hive Failed with the error message above.
Driver returned: 1. Errors: OK
Hive history file=/tmp/hue/hive_job_log_hue_201308221004_1738621649.txt
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1376898282169_0441, Tracking URL = http://SH02SVR2882.hadoop.sh2.ctripcorp.com:8088/proxy/application_1376898282169_0441/
Kill Command = //usr/lib/hadoop/bin/hadoop job -kill job_1376898282169_0441
Hadoop job information for Stage-1: number of mappers: 236; number of reducers: 0
2013-08-22 10:04:40,205 Stage-1 map = 0%, reduce = 0%
2013-08-22 10:05:07,486 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 121.28 sec
.......................
2013-08-22 10:09:18,625 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7707.18 sec
MapReduce Total cumulative CPU time: 0 days 2 hours 8 minutes 27 seconds 180 msec
Ended Job = job_1376898282169_0441
Ended Job = -541447549, job is filtered out (removed at runtime).
Ended Job = -1652692814, job is filtered out (removed at runtime).
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job Submission failed with exception
'java.io.IOException(Max block location exceeded for split: Paths:/tmp/hive-beeswax-logging/hive_2013-08-22_10-04-32_755_6427103839442439579/-ext-10001/000009_0:0+28909,....,/tmp/hive-beeswax-logging/hive_2013-08-22_10-04-32_755_6427103839442439579/-ext-10001/000218_0:0+45856
Locations:10.8.75.17:...:10.8.75.20:; InputFormatClass: org.apache.hadoop.mapred.TextInputFormat
splitsize: 45 maxsize: 10)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 236 Cumulative CPU: 7707.18 sec HDFS Read: 63319449229 HDFS Write: 8603165 SUCCESS
Total MapReduce CPU Time Spent: 0 days 2 hours 8 minutes 27 seconds 180 msec
But I didn't set the maxsize. Executed many times but got the same error. I tried to add mapreduce.jobtracker.split.metainfo.maxsize property for hive. But in this case, hive failed without any map work.
set mapreduce.job.max.split.locations > 45
In our situation it solved problem.

Pig integrated with Cassandra: simple distributed query takes a few minutes to complete. Is this normal?

I set up a test integration of Cassandra + Pig/Hadoop. 8 nodes are Cassandra + TaskTracker nodes, 1 node is the JobTracker/NameNode.
I fired up the cassandra client and created a the simple bit of data listed in the Readme.txt in the Cassandra distribution:
[default#unknown] create keyspace Keyspace1;
[default#unknown] use Keyspace1;
[default#Keyspace1] create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
[default#KS1] set Users[jsmith][first] = 'John';
[default#KS1] set Users[jsmith][last] = 'Smith';
[default#KS1] set Users[jsmith][age] = long(42)
Then I ran the sample pig query listed in CASSANDRA_HOME (using pig_cassandra):
grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
grunt> cols = FOREACH rows GENERATE flatten(columns);
grunt> colnames = FOREACH cols GENERATE $0;
grunt> namegroups = GROUP colnames BY (chararray) $0;
grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group;
grunt> orderednames = ORDER namecounts BY $0;
grunt> topnames = LIMIT orderednames 50;
grunt> dump topnames;
It took about 3 minutes to complete.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.0 0.9.1 root 2012-01-12 22:16:53 2012-01-12 22:20:22 GROUP_BY,ORDER_BY,LIMIT
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201201121817_0010 8 1 12 6 9 21 21 21 colnames,cols,namecounts,namegroups,rows GROUP_BY,COMBINER
job_201201121817_0011 1 1 6 6 6 15 15 15 orderednames SAMPLER
job_201201121817_0012 1 1 9 9 9 15 15 15 orderednames ORDER_BY,COMBINER hdfs://xxxx/tmp/temp-744158198/tmp-1598279340,
Input(s):
Successfully read 1 records (3232 bytes) from: "cassandra://Keyspace1/Users"
Output(s):
Successfully stored 3 records (63 bytes) in: "hdfs://xxxx/tmp/temp-744158198/tmp-1598279340"
Counters:
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
There were no errors or warnings in the logging.
Is this normal, or is there something wrong?
Yes this is normal because running a Map/Reduce job on Hadoop usually takes about 1 minute just for startup. Pig generates multiple Map/Reduce jobs dependent on the complexity of the script.

Resources