Running Apache Pig tutorial problems - hadoop

I am having some difficulties running "standard" pig tutorial - pig script1-hadoop.pig
However, because of cluster set up (users), I had to modify an example a bit. Standard tutorial expects all files on / of HDFS, which I cannot use in my case, so I created /pig dir for that purpose
drwxrwxrwx - hdfs hdfs 0 2014-03-31 11:15 /pig
with the uploaded content
-rw-r--r-- 3 jakub hdfs 10408717 2014-03-31 10:41 /pig/excite.log.bz2
I also modified the pig script script1-hadoop.pig as well, to respect those changes as follows (mainly just for load and store commands):
raw = LOAD '/pig/excite.log.bz2' USING PigStorage('\t') AS (user, time, query);
STORE ordered_uniq_frequency INTO '/pig/script1-hadoop-results' USING PigStorage();
I run the pig script:
[jakub#hadooptools pigtmp]$ pig script1-hadoop.pig
but with no luck and getting error:
2014-03-31 10:15:11,896 [main] ERROR - You don't have permission to perform the operation. Error from the server: Permission denied: user=jakub, access=WRITE, inode="/":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
at org.apache.hadoop.ipc.RPC$
at org.apache.hadoop.ipc.Server$Handler$
at org.apache.hadoop.ipc.Server$Handler$
at Method)
at org.apache.hadoop.ipc.Server$
I am not quite sure why PIG script is trying to write into / on HDFS. I know that PIG can store some immediate results on HDFS so I modified pig.temp.dir property (/etc/pig/conf/ and created location on HDFS /pig/tmp
drwxrwxrwx - jakub hdfs 0 2014-03-31 11:15 /pig/tmp
Any idea what might be wrong? Pig in local mode is ok.

User running Pig script has to have permissions to write to tmp directory created and /user/pig_user_running has to be present on the cluster as well with permissions allowing him to write there.
Super-user on HDFS is the user under which namenode process is running, which is typycally HDFS.


Cannot access Hive internal tables-AccessControlException

My user id and my team cannot access any of the internal tables in hive db. when we fire up the queries in HUE and 'CLI' as well, we are getting
'AccessControlException', please find the log below,
INFO : set mapreduce.job.reduces=<number> INFO : Cleaning up the staging area maprfs:/var/mapr/cluster/yarn/rm/staging/keswara/.staging/job_1494760161412_0139
ERROR : Job Submission failed with exception
(User keswara(user id 1802830393) does not have access to
maprfs:///user/hive/warehouse/bistore_sit.db/wt_consumer/d_partition_number=0/000114_0)' User keswara(user id 1802830393) does not have access to maprfs:///user/hive/warehouse/bistore_sit.db/wt_consumer/d_partition_number=0/000114_0
at com.mapr.fs.MapRFileSystem.getMapRFileStatus(
at com.mapr.fs.MapRFileSystem.getFileStatus(
at org.apache.hadoop.fs.FileSystem.getFileBlockLocations(
at org.apache.hadoop.fs.FileSystem$
at org.apache.hadoop.fs.FileSystem$ at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus( at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus( at org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus( at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits( at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits( at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits( at at at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits( at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits( at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal( at org.apache.hadoop.mapreduce.Job$
at org.apache.hadoop.mapreduce.Job$
at Method)
at at org.apache.hadoop.mapreduce.Job.submit(
at org.apache.hadoop.mapred.JobClient$
at org.apache.hadoop.mapred.JobClient$
at Method)
at org.apache.hadoop.mapred.JobClient.submitJobInternal( at org.apache.hadoop.mapred.JobClient.submitJob( at at at org.apache.hadoop.hive.ql.exec.Task.executeTask(
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential( at
any of the user can't access the internal tables right now,am part of the mapr group and sudo user as well!
and the table and partitions ownership belongs to the mapr group and the permissions are look good though!
[mapr#SAN2LPMR03 mapr]$ hadoop fs -ls /user/hive/warehouse/bistore.db/wt_consumer
Found 1 items
drwxrwxrwt - mapr mapr 1 2017-03-24 11:51 /user/hive/warehouse/bistore.db/wt_consumer/d_partition_number=__HIVE_DEFAULT_PARTITION__
Please help me to sort this out! Really appreciate your help!
If the tables are in parquet format then the files for that table will have write access only for the user who has created the table.
For this you can change the user permissions for that files using statement like below
hdfs dfs -chomd 777 /user/hive/warehouse/bistore_sit.db/wt_con‌​sumer/d_partitio‌​n_nu‌​mber=0/000114_‌​0/*
This statement will grant all users all the permissions to that particular files.
I have noticed the following while testing for some tables in both CSV and parquet formats.
When you create hive table in CSV format the table will have 777 permission for all users who have access to the group you are part of.
But when the hive table is created in parquet format only the user who has created the table will have write access. I think it has to do something with parquet format
[root#psnode44 hive-2.1]# hadoop fs -ls /user/hive/warehouse/
Found 1 items
drwxrw-rw- - mapr mapr 2 2017-06-28 12:49 /user/hive/warehouse/test
0: jdbc:hive2://> select *from test;
Error: User basa(user id 5005) does not have access to maprfs:/user/hive/warehouse/test (state=,code=0)
[root#psnode44 hive-2.1]# hadoop fs -ls /user/hive/warehouse/
Found 1 items
drwxrwxrwx - mapr mapr 2 2017-06-28 12:49 /user/hive/warehouse/test
Even thought, I changed the chmod on warehouse,still its getting same error.
[root#psnode44 hive-2.1]# hadoop fs -chmod -R 777 /user/hive/warehouse/
[root#psnode44 hive-2.1]# hadoop fs -ls /user/hive/warehouse/
Found 1 items
drwxrwxrwx - mapr mapr 2 2017-06-28 12:49 /user/hive/warehouse/test
0: jdbc:hive2://> select *from test;
Error: User basa(user id 5005) does not have access to maprfs:/user/hive/warehouse/test (state=,code=0)

Hadoop Mapreduce Error Input path does not exist: hdfs://localhost:54310/user/hduser/input"

I have installed hadoop 2.6 in Ubuntu Linux 15.04 and its running fine. But, when I am running a sample test mapreduce program, its giving the following error:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/hduser/input.
Kindly help me. Below is the complete details of the error.
hduser#krishadoop:/usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
15/08/24 15:22:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/24 15:22:38 INFO Configuration.deprecation: is deprecated. Instead, use dfs.metrics.session-id
15/08/24 15:22:38 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/08/24 15:22:39 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/app/hadoop/tmp/mapred/staging/hduser1122930879/.staging/job_local1122930879_0001
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/hduser/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
at org.apache.hadoop.mapreduce.Job$
at org.apache.hadoop.mapreduce.Job$
at Method)
at org.apache.hadoop.mapreduce.Job.submit(
at org.apache.hadoop.mapreduce.Job.waitForCompletion(
at org.apache.hadoop.examples.WordCount.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
at org.apache.hadoop.examples.ExampleDriver.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.RunJar.main(
Seems like you mentioned a wrong input path. Hadoop is searching for an input path at /user/hduser/input. Hadoop also follows unix like tree structure. If you simply mention a directory input it will be taken as /user/{username}/input.
hadoop fs -mkdir -p /user/hduser/input
hadoop fs -put <datafile> /user/hduser/input
If you see this path (file) physically and still getting the error, you may have confused with local file system and Hadoop Distributed File System(HDFS). In order to run this map-reduce, this file should be located in HDFS (locating only inside local file system will not do it.).
You can import local file system files into HDFS by this command.
hadoop fs -put <local_file_path> <HDFS_diresctory>
You confirm that the file that you imported exists in HDFS by this command.
hadoop fs -ls <HDFS_path>
You must create and upload your input before executing your hadoop job. For example, if you need to upload input.txt file, you should do the following:
$HADOOP_HOME/bin/hdfs dfs -mkdir /user/hduser/input
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal $HADOOP_HOME/input.txt /user/hduser/input/input.txt
The first line creates the directory, and the other upload your input file into hdfs (hadoop fylesystem).
When you compile any jar file using input and output file/directory, you should make sure that the input file is already created(in the specified path) and output file does not exist.
If you want to give a text file as input file, first copy a text file from local file system to hdfs and compiling it by using the following commands
hadoop fs -copyFromLocal /input.txt /user/hduser/input.txt
/usr/local/hadoop/sbin$ yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /user/hduser/input.txt /output
/input.txt may be replaced with address of any text file.
You need to start Pig in local mode and not cluster node:
pig -x local
Program is not able to find the Hadoop path for the inputs. It is searching in the local system files rather than Hadoop's DFS.
This problem will go away when your program is able to locate the HDFS location. We need to let the program understand the HDFS location given in the configuration file. To do that, add these lines in your program code.
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml"));
you should make the directory in HDFS:
for instance, "hadoop fs -mkdir /input_dir"
Then when you run your MapReduce program. You should mention the absolute path of input directory, so the format should be:
hadoop jar jarFileName.jar className /input_dir /outputdir right
The following is wrong because it is relative path
hadoop jar jarFileName.jar className input_dir outputdir wrong
If you find /bin/bash: /bin/java: No such file or directory in log, try setting JAVA_HOME in /etc/hadoop/

Getting permission denied error when executing Hive query

I'm getting the following error when executing a select count(*) from tablename query when connected to beeline.
ERROR : Job Submission failed with exception ' denied
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkOwner(
I can execute showtables; successfully but get this error anytime I execute a query. I am logged in as the hadoop user that has access to both hadoop and hive.
I've granted the folder where the tables resides full permissions:
drwxr-xr-x - hadoop supergroup 0 2015-06-03 15:44 /data1
drwxrwxrwx - hadoop hadoop 0 2015-06-05 15:23 /tmp
drwxrwxrwx - hadoop supergroup 0 2015-06-05 15:24 /user
The table is in the user directory.
Environment details:
OS: CentOS
Hadoop: HW 2.6.0
Hive: 1.2
Any help would be greatly appreciated.
Is this a hive managed table in that case could you print what you get when you do
hadoop fs -ls /user
hadoop fs -ls /user/hive
hadoop fs -ls /user/hive/warehouse
the error suggests that you are accessing a table from a user who is not the owner and seems like user does not have read and execute access

Cannot load a file from Hadoop HDFS from Pig Latin

I am having trouble trying to load a csv from file. I keep on getting the following error:
Failed to read data from "hdfs://localhost:9000/user/der/1987.csv"
Failed to produce result in "hdfs://localhost:9000/user/der/totalmiles3"
Looking at my Hadoop hdfs installed in my local machine I see the file. In fact the file is located at multiple locations such as /, /user/ , etc.
hdfs dfs -ls /user/der
Found 1 items
-rw-r--r-- 1 der supergroup 127162942 2015-05-28 12:42
My pig scripts is as follows:
records = LOAD '1987.csv' USING PigStorage(',') AS
(Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime,
CRSArrTime, UniqueCarrier, FlightNum, TailNum,ActualElapsedTime,
CRSElapsedTime,AirTime,ArrDelay, DepDelay, Origin, Dest,
Distance:int, TaxIn, TaxiOut, Cancelled,CancellationCode,
Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,
milage_recs= GROUP records ALL;
tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);
STORE tot_miles INTO 'totalmiles3';
I ran pig with the -x local option. I was able to read the files from my local hard disk with the -x local option. Got the right answer and the tail -f on Hadoop namenode did not scroll which proves I ran the files all locally on hard disk:
pig -x local totalmiles.pig
Now I am getting errors. It seems the hadoop name server is getting request because I used tail -f and see the logs scroll.
pig totalmiles.pig
records = LOAD '/user/der/1987.csv' USING PigStorage(',') AS
I get the following error:
Failed Jobs:
JobId Alias Feature Message Outputs
job_local602774674_0001 milage_recs,records,tot_miles
GROUP_BY,COMBINER Message: ENOENT: No such file or directory
at$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.j
org.apache.hadoop.fs.FilterFileSystem.setPermission( 502)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSys
at org .apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
Failed to read data from "/user/der/1987.csv"
Failed to produce result in "hdfs://localhost:9000/user/der/totalmiles3"
I used the hdfs to check for permissions by mkdir and that seems ok:
hdfs dfs -mkdir /user/der/temp2
hdfs dfs -ls /user/der
Found 3 items
-rw-r--r-- 1 der supergroup 127162942 2015-05-28 12:42
drwxr-xr-x - der supergroup 0 2015-05-28 16:21
drwxr-xr-x - der supergroup 0 2015-05-28 15:57
I tried the pig with mapreduce option and still get the same type of error:
pig -x mapreduce totalmiles.pig
5-05-28 20:58:44,608 [JobControl] INFO
ontrol.ControlledJob - PigLatin:totalmiles.pig while
ENOENT: No such file or directory
at$POSIX.chmodImpl(Na at$POSIX.chmod(
org.apache.hadoop.fs.RawLocalFileSystem.setPermissi at
at org.apache.hadoop.fs.FileSystem.mkdirs(
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(Jo
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobS
at org.apache.hadoop.mapreduce.Job$
My core-site.xml has the temp dir as follows:
<description>A base for other temporary directories.
and my hdfs-site.xml as the namenode and datanode as follows:
I've gotten a bit further in debugging the issue. It seems my namenode is misconfigured as I cannot reformat it:
[hadoop hdfs formatting gets error failed for Block pool ]
We have to give the hadoop file path as : /user/der/1987.csv
records = LOAD '/user/der/1987.csv' USING PigStorage(',') AS
(Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime,
CRSArrTime, UniqueCarrier, FlightNum, TailNum,ActualElapsedTime,
CRSElapsedTime,AirTime,ArrDelay, DepDelay, Origin, Dest,
Distance:int, TaxIn, TaxiOut, Cancelled,CancellationCode,
Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,
If its for testing, you can have the file : 1987.csv in the path from where you are executing the pig script, i.e. have 1987.csv and the .pig file in the same location.

Cloudera hdfs another namenode already locked the storage directory

I am running CDH-5.3.2-1.cdh5.3.2.p0.10 with ClouderaManager on Centos 6.6.
My HDFS service was working on a Cluster. But I wanted to change the mounting point for the hadoop data. Yet without success, so I came with the idea to rollback all changes, but the previous configuration doesnt work what is discouraging.
I have two nodes within the cluster. One node for data is bad DataNodes Health Bad.
In the log I have got a few errors
1:40:10.821 PM ERROR org.apache.hadoop.hdfs.server.common.Storage
It appears that another namenode has already locked the storage directory
1:40:10.821 PM INFO org.apache.hadoop.hdfs.server.common.Storage
Cannot lock storage /dfs/nn. The directory is already locked
1:40:10.821 PM WARN org.apache.hadoop.hdfs.server.common.Storage Cannot lock storage /dfs/nn. The directory is already locked
1:40:10.822 PM FATAL org.apache.hadoop.hdfs.server.datanode.DataNode
Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to Exiting. All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(
I have been trying many possible solutions but without any luck.
formatting hadoop namenode -format
stopping cluster and rm -rf /dfs/* [and reformatting]
some adjustments to /dfs/nn/current/VERSION file
removing in_use.lock file and starting only a lacking node
removing a file in /tmp/hsperfdata_hdfs/ with name like the pid locking the directory.
There are files in the directory
[root#spark1 dfs]# ll
total 8
drwxr-xr-x 3 hdfs hdfs 4096 Apr 28 13:39 nn
drwx------ 3 hdfs hadoop 4096 Apr 28 13:40 snn
There is no dn dir what is a bit interesting.
All operations on hdfs files I perform as an hdfs user.
In the file /etc/hadoop/conf/hdfs-site.xml there is
Here is a similar thread of CDH users google group which might help you :!topic/cdh-user/FYu0gZcdXuE
Also did you do the namenode format from cloudera manager or command line ? Ideally you should be doing it through cloudera manager and not command line.
