pig beginner's example [unexpected error] - hadoop

I am new to Linux and Apache Pig. I am following this tutorial to learn pig:
http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm
This is a basic word counting example. The data file 'input.txt' and the program file 'wordcount.pig' are in the Wordcount package, linked on the site.
I already have Pig 0.11.1 downloaded on my local machine, as well as Hadoop, and Java 6.
When I downloaded the Wordcount package it took me to a "tar.gz" file. I am unfamiliar with this type, and wasn't sure how to extract it.
It contains the files 'input.txt','wordcount.pig' and a Readme file. I saved 'input.txt' to my Desktop. I wasn't sure where to save wordcount.pig, and decided to just type in the commands line by line in the shell.
I ran pig in local mode as follows:pig -x local
and then I just copy-pasted each line of the wordcount.pig script at the grunt> prompt like this:
A = load '/home/me/Desktop/input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;
This generates the following errors:
...
Retrying connect to server: localhost/127.0.0.1:8021. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected error during execution.
My questions:
1.
Should I be saving 'input.txt' and the original 'wordcount.pig' script to some special folder inside the directory pig-0.11.1? That is, create a folder called word inside pig-0.11.1 and put 'wordcount.pig' and 'input.txt' there and then type in "wordcount.pig" from the grunt> prompt ???
In general, if I have data in say, 'dat.txt', and a script say, 'program.pig', where should I be saving them to run 'program.pig' from the grunt shell??? I think they should both go in pig-0.11.1,so I can do $ pig -x local wordcount.pig, but I am not sure.
2.
Why am I not able to run the script line by line as I tried to?
I have specified the location of the file 'input.txt' in the load statement.
So why does it not just run the commands line by line and dump the contents of D to my screen???
3.
When I try to run Pig in mapreduce mode using $pig, it gives this error:
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-03 23:57:06,956 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage

This error indicates that Pig is unable to connect to Hadoop to run the job. You say you have downloaded Hadoop -- have you installed it? If you have installed it, have you started it up according to its docs -- have you run the bin/start-all.sh script? Using -x local tells Pig to use the local filesystem instead of HDFS, but it still needs a running Hadoop instance to perform the execution. Before trying to run Pig, follow the Hadoop docs to get your local "cluster" set up and make sure your NameNode, DataNodes, etc. are up and running.

2043 error occurs when hadoop and pig fail to communicate with each other.
Never do a right click --> extract here, when dealing with tar.gz files.
U shud always do a tar -xzvf *.tar.gz on terminal when extracting them.
I noticed that pig doesn't get installed properly when u do a right click on pig..tar.gz file and select extract here. It's good to do a tar -xzvf pig..tar.gz from terminal.
Make sure u are running Hadoop before u execute pig -x local kind of commands.
If u want to run *.pig files from grunt> prompt, use:
grunt> exec *.pig
If u want to run pig files outside grunt> prompt, use:
$ pig -x local *.pig

Related

How to load results of bash command into gcs using Airflow?

I try to save json that is a result of bash command in the bucket of gcs. Executing the bash command in my local terminal everything works properly and it loads data into gcs. Unfortunately the same bash command doesn't work via Airflow. Airflow marks the task as successfully done but in gcs I can see empty file. I suspect that this happens because of the out of airflow memory but I am not sure. If so someone can explain me how and where the results are stored in airflow ? I see in the bash operator documentation that airflow creates a temporary directory which is cleaned after the execution. Does it mean that the results of bash command also are cleaned afterwards ? Is there any way to save the results in gcs ?
This is my dag:
get_data = BashOperator(
task_id='get_data',
bash_command='curl -X GET -H 'XXX: xxx' some_url | gsutil cp -L manifest.txt - gs://bucket/folder1/filename.json; rm manifest.txt',
dag=dag
)

Permission denied: user=basi, access=WRITE, inode="/":

Im a fresher in hadoop and pig.i have installed pig in my local user in ubuntu and hadoop as hduser.Pig working fine in local mode for small datasets.started pig in mapreduce mode and tryng to implement wordcount but getting permission denied error as below.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=basi, access=WRITE, inode="/":hduser:supergroup:drwxr-xr-x
started hadoop in psudomode
statrted pig in local user:pig -x mapreduce
grunt> A = LOAD '/Wordcount.txt' AS (line:Chararray);
grunt> B = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) AS word;
grunt> grouped = group B by word;
grunt> wc = FOREACH grouped GENERATE group, COUNT(B);
grunt> DUMP wc
/Wordcount.txt is file in hdfs
Its not clear how you loaded /Wordcount.txt into the root folder, but the error is saying you're trying to write into the root directory, which is only possible as the hduser account, not basi, your local user.
One option - switch to the other user.
Otherwise, don't use the root of HDFS as the dumping ground for all files; use your dedicated /user directory
It is not Pig but Hadoop related. It happened to me with Spark. Probably you installed your Hadoop manually. You need to create supergroup and add hduser into supergroup.
sudo groupadd supergroup
sudo usermod -aG supergroup hduser
Then try again.
proceed as below
chmod 777 /Wordcount.txt
chmod change the permission of text file as rwxrwxrwx for owner group and other respectively
and then provide complete location of text file in the load command similar to below
grunt> A = LOAD '/directory/abc/Wordcount.txt' AS (line:Chararray);
then run the code again...
hopes this will help you out.
In Pig, DUMP command would first write its output to /tmp/temp.... and then the client reads from it. My guess is, your cluster does not have /tmp. If that is the case, please try creating the /tmp directory (usually with permission 1777).
(Edited: Reading answers of others, I think the one about /user makes sense. Without it, you won't even be able to submit any jobs.)

Hadoop Mapreduce get job history in psuedo-distributed mode

I am running Hadoop Mapreduce and Yarn in psuedo distributed mode and I want to get job history log. To get that, I tried solution 2 in this question and so, from the directory.
hadoop-3.0.0/bin
I executed
$ ./hdfs dfs -ls /tmp/hadoop-uname/mapred.
Following is what I get as response:
ls: `/tmp/hadoop-uname/mapred': No such file or directory
I get same response for:
$ ./hdfs dfs -ls /tmp/hadoop-uname/mapred/staging
also.
My questions are:
1) Are job history logs generated in psuedo history mode?
2) Is logging turned on by default? Or I need to do some other setting to turn it on?
3) Am I missing anything else?

how to load the data from local system to hdfs using PIG

I have a csv file sample.csv and located in \home\hadoop\Desktop\script\sample.csv .
I tried to load in PIG using
movies = load '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id,name,year,rating,duration);
But this PIG statement is giving an error but while giving statement as dump movies;, it is throwing error and showing input and output is failed.
Please suggest me how to load the data using pig statement.
If your input file is at local then you can enter into grunt shell by typing pig -x local
If you enter into grunt shell then you can type the below statement
record = LOAD '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:int);
dump record;
If your input file is not at local then first you need to copy that file from local to HDFS using below command
hadoop dfs -put <path of file at local> <path of hdfs dir>
Once your file is loaded into HDFS you can enter to map reduce mode by typing pig
again grunt shell will be opened. ia assuming that your HDFS location is something like below LOAD statement
record = LOAD '/user/hadoop/inputfiles/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:int);
dump record;
You can also use copyFromLocal command in grunt shell to move local file to hdfs.
open pig shell in local mode by pig -x local and if your file present at hdfs then you can use pig to open grant shell.
$pig -x local
grunt> movies = load '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:chararray);
grunt> dump movies;

Is it possible to execute CMD at the middle of docker file?

I am installing hadoop-0.20.2 using docker. I have two files one is for java installation and another is for hadoop installation. I am starting services using CMD command
cmd ["path/to/start-all.sh"]
Now, i want to to write third dockerfile which executes an example Map-Reduce job. But the problem is
Third docker file depends on second hadoop-docker file. fo eg:
FROM sec_doc_file
RUN /bin/hadoop fs -mkdir input
It requires hadoop services. But hadoop services ll be started only after running second docker file. But i want to run it as part of third docker file before starting MR job? Is it possible? If so, please provide an example. If not, what could be the other possibilities?
#something like
From sec_doc_file
#Start_Service
RUN /bin/hadoop fs -mkdir input
#continue_map_reduce_job
The docker image you use as base for the new container is a base for files, not for processes supposed to be running. To do what you want you would need to start the process(es) you need during dockerbuild and run the commands to set up properly. Each RUN creates a new AUFS layer, but does not keep the possible previous running services. So, if you need a service to be up to perform some setup during docker build you would need to run it in one line (concatenating commands or with a custom script). Example:
FROM Gops/sec_doc_file
RUN path/to/start-all.sh && /bin/hadoop fs -mkdir input
So for setting up HDFS folders and files during docker build you'd need to run the hdfs daemons and perform the action you wish in the same RUN command:
RUN /etc/hadoop/hadoop-env.sh &&\
/opt/hadoop/sbin/start-dfs.sh &&\
/opt/hadoop/bin/hdfs dfs -mkdir input

Resources