pig local mode spill data issue - hadoop

I am trying to solve this issue but unable to understand. The pig script in my Development machine ran on a 1.8 GB data file successfully.
When I am trying to run it in server it is stating that it cannot find a local device to spill data spill0.out
I have modified the pig.temp.Dir property in the pig.property file to point to a location having space..
error:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
So how to find out where pig is spilling out the data and can we change the pig spill directory location as well somehow.
I using pig in local mode.
Any ideas or suggestions or workarounds will be of great help.
Thanks..

I found an answer.
We need to put the follwing to the $PIG_HOME/conf/pig.properties file
mapreduce.jobtracker.staging.root.dir
mapred.local.dir
pig.temp.dir
and then test.
This has helped me solve the problem.

This is not a problem with Pig.
I'm not using Pig and I also have exactly the same error.
The problem seems to be more related to Hadoop. I also use it in local mode. I'm using Hadoop 2.6.0

I had no luck with these answers, Pig (version 0.15.0) was still writing pigbag* files to /tmp dir so I just renamed my /tmp dir and created a symbolic link to the desired location like this:
sudo -s #change to root
cd /
mv tmp tmp_local
ln -s /desired/new/tmp/location tmp
chmod 1777 tmp
mv tmp_local/* tmp
Make sure there are no active applications writing to tmp folder at the time of running these commands.

Related

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

How can I get spark to access local HDFS on windows?

I have installed both hadoop and spark locally on a windows machine.
I can access HDFS files in hadoop, e.g.,
hdfs dfs -tail hdfs:/out/part-r-00000
works as expected. However, if I try to access the same file from the spark shell, e.g.,
val f = sc.textFile("hdfs:/out/part-r-00000")
I get an error that the file does not exist. Spark can access files in the windows file system using the file:/... syntax, though.
I have set the HADOOP_HOME environment variable to c:\hadoop which is the folder containing the hadoop install (in particular winutils.exe, which seems to be necessary for spark, is in c:\hadoop\bin).
Because it seems that HDFS data is stored in the c:\tmp folder, I was wondering whether there is would be a way to let spark know about this location.
Any help would be greatly appreciated. Thank you.
If you are getting file doesn't exist, that means your spark application(code snippet) is able to connect to HDFS.
The HDFS file path that you are using seems wrong.
This should solve your issue
val f = sc.textFile("hdfs://localhost:8020/out/part-r-00000")

How do you transfer files onto the Hadoop FS (HDFS) on WIndows cmdline without Cygwin?

I have zero experience with Hadoop, but suddenly have to use it at work with Spark on Windows. My question, which has been asked a few times here, but I never could quite get the syntax for what I need, is this. I'm trying to transfer a simple file called:
gensortText.txt which let's say is at c:\gensortText.txt
I know you can use hadoop fs -copyFromLocal. I've tried these things:
hadoop fs -copyFromLocal C:\gensortText.txt hdfs://0.0.0.0:19000
ERROR: Relative path in absolute URI.
hadoop fs -copyFromLocal C:\gensortOutText.txt \tmp\hadoop-Administrator\dfs
ERROR: copyFromLocal: `tmphadoop-Administratordfs': No such file or directory
and a number of other variations with hdfs: and using the tmp directory which all returned similar errors.
I have hadoop in c:\deploy as suggested in the Hadoop2Windows guide (which works and allowed me to run Hadoop. I can access the WebGui and all that). Hadoop has created my new HDFS at c:\temp. Please someone help me figure out how to transfer files into the system. It can even be manually if that's possible, but that doesn't seem to work as it doesn't show up in the Web GUI when I go to "Utilities->Browse the Filesystem". Nothing shows up there actually.
Can someone please help. Any information that's relevant I can provide, but I'm so new to this I don't really know what would be helpful. I think it's just my syntax for the cmdline tool. Can someone give me a concrete example of how to use hadoop -fs copyFromLocal or another simple way to do this? Sorry for my ignorance on the subject, and thanks for any help
To be able to run hadoop commands on Windows you need to have winutils installed and visible to hadoop process.

unable to setup psuedo distributed hadoop cluster

I am using centos 7. Downloaded and untarred hadoop 2.4.0 and followed the instruction as per the link Hadoop 2.4.0 setup
Ran the following command.
./hdfs namenode -format
Got this error :
Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode
I see a number of posts with the same error with no accepted answers and I have tried them all without any luck.
This error can occur if the necessary jarfiles are not readable by the user running the "./hdfs" command or are misplaced so that they can't be found by hadoop/libexec/hadoop-config.sh.
Check the permissions on the jarfiles under: hadoop-install/share/hadoop/*:
ls -l share/hadoop/*/*.jar
and if necessary, chmod them as the owner of the respective files to ensure they're readable. Something like chmod 644 should be sufficient to at least check if that fixes the initial problem. For the more permanent fix, you'll likely want to run the hadoop commands as the same user that owns all the files.
I followed the link Setup hadoop 2.4.0
and I was able to get over the error message.
Seems like the documentation on hadoop site is not complete.

Error in Pig: Cannot locate pig-withouthadoop.jar. do 'ant jar-withouthadoop', and try again

I am trying to Start Pig-0.12.0 on MAC after I Installed Pig from Apache website.
Before I start Pig shell, I copied below 4 lines after creating pig-env.sh file in conf Directory.
Export JAVA_HOME=/usr
Export PIG_HOME=/Users/Hadoop_Cluster/pig-0.12.0
Export HADOOP_HOME=Users/Hadoop_Cluster/hadoop-1.2.1
Export PIG_CLASSPATH=$HADOOP_HOME/conf/
Also, Added below text in pig.properties file:
Fs.default.name=hdfs://localhost:9000
Mapred.job.tracker=localhost:9001
I copied core-site.xml, hdfs-site.xml and mapped-site.xml file from
Hadoop_home/conf to pig_home/conf
I Get below Error when starting Pig in Command line under bin directory of Pig. Error says:
Cannot locate pig-withouthadoop.jar. do 'ant jar-withouthadoop', and Try again
If it is not there copy pig-0.12.0-withouthadoop.jar (renamed or not, it shouldn't matter) to your $PIG_HOME, so in the end the file /Users/Hadoop_Cluster/pig-0.12.0/pig-0.12.0-withouthadoop.jar exists.
Also be careful about the lower case/upper case letters. Otherwise it should be fine.
Finally it works.
All I did is rename the file in conf directory to "pig-withouthadoop.jar" instead of pig-0.12.0-withouthadoop. Also I make sure the hadoop is not in safe mode.
I kept the same settings as below in file below and all the 3 hdp files are
copied to pig_home/conf directory.
export JAVA_HOME=/usr
export PIG_HOME=/Users/Hadoop_Cluster/pig-0.12.0
export HADOOP_HOME=/Users/Hadoop_Cluster/hadoop-1.2.1
export PIG_CLASSPATH=$HADOOP_HOME/conf/
I too got the same error. Solved by removing /bin in the home patch in .bashrc .. source in bashrc and start pig..
export PIG_HOME=/home/hadoop/pig-0.13.0/bin ==> wrong
export PIG_HOME=/home/hadoop/pig-0.13.0 ==> correct..
You need to follow as per the error generated :
Cannot locate pig-withouthadoop.jar. do 'ant jar-withouthadoop'
One needs to run the command ant jar-withouthadoop to get pig-withouthadoop.jar
if ant is not installed for ubuntu users try apt-get install ant.
The command ant jar-withouthadoop will take roughly 15 -20 mins, but one needs to be patient for getting this sorted.
I scratched my head all day.Kept looking for solutions on goggle none helped.
On extraction of the pig tar there is no jar that is created in the home directory.The above is to be followed to create the jar file and to run pig successfully.
I don't exactly know why this is done,but this is the solution that has worked for me with hadoop 1.2 [out of safe mode] and pig 0.12.1
The key is find
pig-withouthadoop.jarpig-withouthadoop.jar\
in your $pig_home.
so use
find / -name *withouthadoop*
you can find it. maybe
pig-withouthadoop.jar
, you should rename it and cp to $pig_home. Worked for me

Resources