How to transfer files between machines in Hadoop and search for a string using Pig - hadoop

I have 2 questions:
I have a big file of records, a few million ones. I need to transfer this file from one machine to a hadoop cluster machine. I guess there is no scp command in hadoop (or is there?) How to transfer files to the hadoop machine?
Also, once the file is on my hadoop cluster, I want to search for records which contain a specific string, say 'XYZTechnologies'. How to do this is Pig? Some sample code would be great to give me a head-start.
This is the very first time I am working on Hadoop/Pig. So please pardon me if it is a "too basic" question.
EDIT 1
I tried what Jagaran suggested and I got the following error:
2012-03-18 04:12:55,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "(" "( "" at line 3, column 26.
Was expecting:
<QUOTEDSTRING> ...
Also, please note that, I want to search for the string anywhere in the record, so I am reading the tab separated record as one single column:
A = load '/user/abc/part-00000' using PigStorage('\n') AS (Y:chararray);

for your first question, i think that Guy has already answered it.
as for the second question, it looks like if you just want to search for records which contain a specific string, a bash script is better, but if you insist on Pig, this is what i suggest:
A = load '/user/abc/' using PigStorage(',') AS (Y:chararray);
B = filter A by CONTAINS(A, 'XYZTechnologies');
store B into 'output' using PigStorage()
keep in mind that PigStorage default delimeter is tab so put a delimeter that does not appear in your file.
then you should write a UDF that returns a boolean for CONTAINS, something like:
public class Contains extends EvalFunc<Boolean> {
#Override
public Boolean exec(Tuple input) throws IOException
{
return input.get(0).toString().contains(input.get(1).toString());
}
}
i didn't test this, but this is the direction i would have tried.

For Copying to Hadoop.
1. You can install Hadoop Client in the other machine and then do
hadoop dfs -copyFromLocal from commandline
2. You could simple write a java code that would use FileSystem API to copy to hadoop.
For Pig.
Assuming you know field 2 may contain XYZTechnologies
A = load '<input-hadoop-dir>' using PigStorage() as (X:chararray,Y:chararray);
-- There should not be "(" and ")" after 'matches'
B = Filter A by Y matches '.*XYZTechnologies.*';
STORE B into 'Hadoop=Path' using PigStorage();

Hi you can use the hadoop grep function to find the specific string in the file.
for e.g my file contains some data as follows
Hi myself xyz. i like hadoop.
hadoop is good.
i am practicing.
so the hadoop command is
hadoop fs -text 'file name with path' | grep 'string to be found out'
Pig shell:
--Load the file data into the pig variable
**data = LOAD 'file with path' using PigStorage() as (text:chararray);
-- find the required text
txt = FILTER data by ($0 MATCHES '.string to be found out.');
--display the data.
dump txt; ---or use Illustrate txt;
-- storing it in another file
STORE txt into "path' using PigStorage();

Related

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

Apache Pig: How to concat strings in load function?

I am new to Pig and I want to use Pig to load data from a path. The path is dynamic and is stored in a txt file. Say we have a txt file called pigInputPath.txt
In the pig script, I plan to do the following:
First load the path using:
InputPath = Load 'pigInputPath.txt' USING PigStorage();
Second load data from the path using:
Data = Load 'someprefix' + InputPath + 'somepostfix' USING PigStorage();
But this would not work. I also tried CONCAT but it also gives me an error. Can someone help me with this. Thanks a lot!
First, find a way to pass your input path as a parameter. (References: Hadoop Pig: Passing Command Line Arguments, https://wiki.apache.org/pig/ParameterSubstitution)
Lets say you invoke your script as pig -f script.pig -param inputPath=blah
You could then LOAD from that path with required prefix and postfix as follows:
Data = LOAD 'someprefix$inputPath/somepostfix' USING PigStorage();
The catch for the somepostfix string is that is needs to be separated from the parameter using a / or other such special characters to tell pig that the string is not a part of the parameter name.
One option to avoid using special characters is by doing the following:
%default prefix 'someprefix'
%default postfix 'somepostfix'
Data = LOAD '$prefix$inputPath$postfix' USING PigStorage();

How to find the file name and size of the file from fsimage?

I am trying to find the files which are less than block size in HDFS.
By using OIV i converted the fsimage into text file with delimiters like below.
hdfs oiv_legacy -i /tmp/fsimage -o /tmp/fsimage_$RUNDATE/fsimage.txt -p Delimited -delimiter '#'
Since fsimage has lot of data. From this how to find the file name and file size of each and every file in HDFS.
Can anyone please help.
Thanks in advance....
Take a look at scripts at the end of this documentation.
Starting from:
A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
replication:int,
modTime:chararray,
accessTime:chararray,
blockSize:long,
numBlocks:int,
fileSize:long,
NamespaceQuota:int,
DiskspaceQuota:int,
perms:chararray,
username:chararray,
groupname:chararray);
-- Grab the pathname and filesize
B = FOREACH A generate path, fileSize;
-- Save results
STORE B INTO '$outputFile';
hadoop fs -find /tmp/fsimage size 64 -print
Note:I am using MapR Hadoop.The syntax might wary if its Cloudera,Hortonworks.

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)
You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());
I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources