how to load the data from local system to hdfs using PIG - hadoop

I have a csv file sample.csv and located in \home\hadoop\Desktop\script\sample.csv .
I tried to load in PIG using
movies = load '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id,name,year,rating,duration);
But this PIG statement is giving an error but while giving statement as dump movies;, it is throwing error and showing input and output is failed.
Please suggest me how to load the data using pig statement.

If your input file is at local then you can enter into grunt shell by typing pig -x local
If you enter into grunt shell then you can type the below statement
record = LOAD '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:int);
dump record;
If your input file is not at local then first you need to copy that file from local to HDFS using below command
hadoop dfs -put <path of file at local> <path of hdfs dir>
Once your file is loaded into HDFS you can enter to map reduce mode by typing pig
again grunt shell will be opened. ia assuming that your HDFS location is something like below LOAD statement
record = LOAD '/user/hadoop/inputfiles/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:int);
dump record;

You can also use copyFromLocal command in grunt shell to move local file to hdfs.

open pig shell in local mode by pig -x local and if your file present at hdfs then you can use pig to open grant shell.
$pig -x local
grunt> movies = load '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:chararray);
grunt> dump movies;

Related

Hive disable history logs and query logs

We are using hive on our production machines but it generates a lot of job logs in /tmp/<user.name>/ directory. We would like to disable this logging as we don't need it but can't find any option to disable it. Some of the answers we checked required us to modify hive-log4j.properties file. But the only file available in /usr/lib/hive/conf is hive-site.xml
While starting hive it gives the following information:
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.7.0.jar!/hive-log4j.properties
Hive history file=/tmp/adqops/hive_job_log_79c7f1c2-b4e5-4b7b-b2d3-72b032697bb5_1000036406.txt
So it seems that hive-log4j.properties file is included in a jar and we can't modify it.
Hive Version: hive-hwi-0.10.0-cdh4.7.0.jar
Any help/solution is greatly appreciated.
Thanks.
Since Hive expects a custom properties file name, I guess you cannot use the usual trick of setting -Dlog4j.configuration=my_custom_log4j.properties on the command-line.
So I fear you would have to edit hive-common-xxx.jar with some ZIP utility to
extract the default props file into /etc/hive/conf/ or any other
directory that will be at the head of CLASSPATH
delete the file from the JAR
edit the extracted file
Ex:
$ unzip -l /blah/blah/blah/hive-common-*.jar | grep 'log4j\.prop'
3505 12-02-2015 10:31 hive-log4j.properties
$ unzip /blah/blah/blah/hive-common-*.jar hive-log4j.properties -d /etc/hive/conf/
Archive: /blah/blah/blah/hive-common-1.1.0-cdh5.5.1.jar
inflating: /etc/hive/conf/hive-log4j.properties
$ zip -d /blah/blah/blah/hive-common-*.jar hive-log4j.properties
deleting: hive-log4j.properties
$ vi /etc/hive/conf/hive-log4j.properties
NB: proceed at your own risk... 0:-)
Effectively set the logging level to FATAL
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.level=FATAL -e "<query>"
OR redirect logs into another directory and just purge the dir
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.dir=./logs --hiveconf hive.log.level=DEBUG -e "<query>"
It will create a log file in logs folder. Make sure that the logs folder exist in current directory.

Redirect Hadoop job output to a file

Im running a Hadoop job and outputs are displayed on the console.
Is there a way for me to redirect the output to a file..I tried the below command to redirect the output but it does not work.
hduser#vagrant:/usr/local/hadoop$ hadoop jar share/hadoop/mapreduce/hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output>joboutput
You can redirect the error stream to file, which is the output of hadoop job. That is use;
hadoop jar share/hadoop/mapreduce/hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 2>joboutput
If you are running the examples from the Hadoop homepage (https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) the output will be written to
/user/hduser/gutenberg /user/hduser/gutenberg-output
on HDFS and not the local file system.
You can see the output via
hadoop fs -text /user/hduser/gutenberg /user/hduser/gutenberg-output/*
And to dump that output to a local file
hadoop fs -text /user/hduser/gutenberg /user/hduser/gutenberg-output/* > local.txt
The -text option will decompress the data so you get textual output in case you have some type of compression enabled.

How to import/export hbase data via hdfs (hadoop commands)

I have saved my crawled data by nutch in Hbase whose file system is hdfs. Then I copied my data (One table of hbase) from hdfs directly to some local directory by command
hadoop fs -CopyToLocal /hbase/input ~/Documents/output
After that, I copied that data back to another hbase (other system) by following command
hadoop fs -CopyFromLocal ~/Documents/input /hbase/mydata
It is saved in hdfs and when I use list command in hbase shell, it shows it as another table i.e 'mydata' but when I run scan command, it says there is no table with 'mydata' name.
What is problem with above procedure?
In simple words:
I want to copy hbase table to my local file system by using a hadoop command
Then, I want to save it directly in hdfs in another system by hadoop command
Finally, I want the table to be appeared in hbase and display its data as the original table
If you want to export the table from one hbase cluster and import it to another, use any one of the following method:
Using Hadoop
Export
$ bin/hadoop jar <path/to/hbase-{version}.jar> export \
<tablename> <outputdir> [<versions> [<starttime> [<endtime>]]
NOTE: Copy the output directory in hdfs from the source to destination cluster
Import
$ bin/hadoop jar <path/to/hbase-{version}.jar> import <tablename> <inputdir>
Note: Both outputdir and inputdir are in hdfs.
Using Hbase
Export
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
<tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
Copy the output directory in hdfs from the source to destination cluster
Import
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
Reference: Hbase tool to export and import
If you can use the Hbase command instead to backup hbase tables you can use the Hbase ExportSnapshot Tool which copies the hfiles,logs and snapshot metadata to other filesystem(local/hdfs/s3) using a map reduce job.
Take snapshot of the table
$ ./bin/hbase shell
hbase> snapshot 'myTable', 'myTableSnapshot-122112'
Export to the required file system
$ ./bin/hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to fs://path_to_your_directory
You can export it back from the local file system to hdfs:///srv2:8082/hbase and run the restore command from hbase shell to recover the table from the snapshot.
$ ./bin/hbase shell
hbase> disable 'myTable'
hbase> restore_snapshot 'myTableSnapshot-122112'
Reference:Hbase Snapshots

pig beginner's example [unexpected error]

I am new to Linux and Apache Pig. I am following this tutorial to learn pig:
http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm
This is a basic word counting example. The data file 'input.txt' and the program file 'wordcount.pig' are in the Wordcount package, linked on the site.
I already have Pig 0.11.1 downloaded on my local machine, as well as Hadoop, and Java 6.
When I downloaded the Wordcount package it took me to a "tar.gz" file. I am unfamiliar with this type, and wasn't sure how to extract it.
It contains the files 'input.txt','wordcount.pig' and a Readme file. I saved 'input.txt' to my Desktop. I wasn't sure where to save wordcount.pig, and decided to just type in the commands line by line in the shell.
I ran pig in local mode as follows:pig -x local
and then I just copy-pasted each line of the wordcount.pig script at the grunt> prompt like this:
A = load '/home/me/Desktop/input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;
This generates the following errors:
...
Retrying connect to server: localhost/127.0.0.1:8021. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected error during execution.
My questions:
1.
Should I be saving 'input.txt' and the original 'wordcount.pig' script to some special folder inside the directory pig-0.11.1? That is, create a folder called word inside pig-0.11.1 and put 'wordcount.pig' and 'input.txt' there and then type in "wordcount.pig" from the grunt> prompt ???
In general, if I have data in say, 'dat.txt', and a script say, 'program.pig', where should I be saving them to run 'program.pig' from the grunt shell??? I think they should both go in pig-0.11.1,so I can do $ pig -x local wordcount.pig, but I am not sure.
2.
Why am I not able to run the script line by line as I tried to?
I have specified the location of the file 'input.txt' in the load statement.
So why does it not just run the commands line by line and dump the contents of D to my screen???
3.
When I try to run Pig in mapreduce mode using $pig, it gives this error:
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-06-03 23:57:06,956 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage
This error indicates that Pig is unable to connect to Hadoop to run the job. You say you have downloaded Hadoop -- have you installed it? If you have installed it, have you started it up according to its docs -- have you run the bin/start-all.sh script? Using -x local tells Pig to use the local filesystem instead of HDFS, but it still needs a running Hadoop instance to perform the execution. Before trying to run Pig, follow the Hadoop docs to get your local "cluster" set up and make sure your NameNode, DataNodes, etc. are up and running.
2043 error occurs when hadoop and pig fail to communicate with each other.
Never do a right click --> extract here, when dealing with tar.gz files.
U shud always do a tar -xzvf *.tar.gz on terminal when extracting them.
I noticed that pig doesn't get installed properly when u do a right click on pig..tar.gz file and select extract here. It's good to do a tar -xzvf pig..tar.gz from terminal.
Make sure u are running Hadoop before u execute pig -x local kind of commands.
If u want to run *.pig files from grunt> prompt, use:
grunt> exec *.pig
If u want to run pig files outside grunt> prompt, use:
$ pig -x local *.pig

Shell script that writes in grunt shell?

I am trying to write a shell script that opens grunt shell, runs a pig file in it and then copies the output files to local machine. Is this possible? Any links will be helpful!
you can run a pig script from the command line:
#> pig -f script.txt
The tail end of the script can execute fs commands to 'get' the data back to the local filesystem
grunt> fs -get /path/in/hdfs /local/path

Resources