How to find the file name and size of the file from fsimage? - hadoop

I am trying to find the files which are less than block size in HDFS.
By using OIV i converted the fsimage into text file with delimiters like below.
hdfs oiv_legacy -i /tmp/fsimage -o /tmp/fsimage_$RUNDATE/fsimage.txt -p Delimited -delimiter '#'
Since fsimage has lot of data. From this how to find the file name and file size of each and every file in HDFS.
Can anyone please help.
Thanks in advance....

Take a look at scripts at the end of this documentation.
Starting from:
A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
replication:int,
modTime:chararray,
accessTime:chararray,
blockSize:long,
numBlocks:int,
fileSize:long,
NamespaceQuota:int,
DiskspaceQuota:int,
perms:chararray,
username:chararray,
groupname:chararray);
-- Grab the pathname and filesize
B = FOREACH A generate path, fileSize;
-- Save results
STORE B INTO '$outputFile';

hadoop fs -find /tmp/fsimage size 64 -print
Note:I am using MapR Hadoop.The syntax might wary if its Cloudera,Hortonworks.

Related

Shell Script - Iterate through each line in text file and rename HDFS file

I have a text file in HDFS which would have records like below. The number of lines in file may vary every time.
hdfs://myfile.txt
file_name_1
file_name_2
file_name_3
I have the below hdfs directory and file structure like below.
hdfs://myfolder/
hdfs://myfolder/file1.csv
hdfs://myfolder/file2.csv
hdfs://myfolder/file3.csv
Using shell script I am able to count the number of files in HDFS directory and number of lines available in my HDFS text file. Only if the count matches between the number of files in directory and number of records in my text file, I am going to proceed further with the process.
Now, i am trying to rename hdfs://myfolder/file1.csv to hdfs://myfolder/file_name_1.csv using the first record from my text file.
Second file should be renamed to hdfs://myfolder/file_name_2.csv and third file to hdfs://myfolder/file_name_3.csv
I have difficulty in looping through both the text file and also the files in HDFS directory.
Is there an optimal way to achieve this using shell script.
You cannot do this directly from HDFS, you'd need to stream the file contents, then issue individual move commands.
e.g.
#!/bin/sh
COUNTER = 0
for file in $(hdfs dfs -cat file.txt)
do
NAME = $(sed $file ...) # replace text, as needed. TODO: extract the extension
hdfs dfs -mv file "$NAME_${COUNTER}.csv" # 'csv' for example - make sure the extension isn't duplicated!!
COUNTER = $((COUNTER + 1)
done

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

Pyspark: get list of files/directories on HDFS path

As per title. I'm aware of textFile but, as the name suggests, it works only on text files.
I would need to access files/directories inside a path on either HDFS or a local path. I'm using pyspark.
Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())
status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))
for fileStatus in status:
print(fileStatus.getPath())
I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:
from hdfs import Config
client = Config().get_client('dev')
files = client.list('the_dir_path')
If you use PySpark, you can execute commands interactively:
List all files from a chosen directory:
hdfs dfs -ls <path> e.g.: hdfs dfs -ls /user/path:
import os
import subprocess
cmd = 'hdfs dfs -ls /user/path'
files = subprocess.check_output(cmd, shell=True).strip().split('\n')
for path in files:
print path
Or search files in a chosen directory:
hdfs dfs -find <path> -name <expression> e.g.: hdfs dfs -find /user/path -name *.txt:
import os
import subprocess
cmd = 'hdfs dfs -find {} -name *.txt'.format(source_dir)
files = subprocess.check_output(cmd, shell=True).strip().split('\n')
for path in files:
filename = path.split(os.path.sep)[-1].split('.txt')[0]
print path, filename
This might work for you:
import subprocess, re
def listdir(path):
files = str(subprocess.check_output('hdfs dfs -ls ' + path, shell=True))
return [re.search(' (/.+)', i).group(1) for i in str(files).split("\\n") if re.search(' (/.+)', i)]
listdir('/user/')
This also worked:
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('/user/')
[str(f.getPath()) for f in fs.get(conf).listStatus(path)]
If you want to read in all files in a directory, check out sc.wholeTextFiles [doc], but note that the file's contents are read into the value of a single row, which is probably not the desired result.
If you want to read only some files, then generating a list of paths (using a normal hdfs ls command plus whatever filtering you need) and passing it into sqlContext.read.text [doc] and then converting from a DataFrame to an RDD seems like the best approach.
There is an easy way to do this using snakebite library
from snakebite.client import Client
hadoop_client = Client(HADOOP_HOST, HADOOP_PORT, use_trash=False)
for x in hadoop_client.ls(['/']):
... print x

How to transfer files between machines in Hadoop and search for a string using Pig

I have 2 questions:
I have a big file of records, a few million ones. I need to transfer this file from one machine to a hadoop cluster machine. I guess there is no scp command in hadoop (or is there?) How to transfer files to the hadoop machine?
Also, once the file is on my hadoop cluster, I want to search for records which contain a specific string, say 'XYZTechnologies'. How to do this is Pig? Some sample code would be great to give me a head-start.
This is the very first time I am working on Hadoop/Pig. So please pardon me if it is a "too basic" question.
EDIT 1
I tried what Jagaran suggested and I got the following error:
2012-03-18 04:12:55,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "(" "( "" at line 3, column 26.
Was expecting:
<QUOTEDSTRING> ...
Also, please note that, I want to search for the string anywhere in the record, so I am reading the tab separated record as one single column:
A = load '/user/abc/part-00000' using PigStorage('\n') AS (Y:chararray);
for your first question, i think that Guy has already answered it.
as for the second question, it looks like if you just want to search for records which contain a specific string, a bash script is better, but if you insist on Pig, this is what i suggest:
A = load '/user/abc/' using PigStorage(',') AS (Y:chararray);
B = filter A by CONTAINS(A, 'XYZTechnologies');
store B into 'output' using PigStorage()
keep in mind that PigStorage default delimeter is tab so put a delimeter that does not appear in your file.
then you should write a UDF that returns a boolean for CONTAINS, something like:
public class Contains extends EvalFunc<Boolean> {
#Override
public Boolean exec(Tuple input) throws IOException
{
return input.get(0).toString().contains(input.get(1).toString());
}
}
i didn't test this, but this is the direction i would have tried.
For Copying to Hadoop.
1. You can install Hadoop Client in the other machine and then do
hadoop dfs -copyFromLocal from commandline
2. You could simple write a java code that would use FileSystem API to copy to hadoop.
For Pig.
Assuming you know field 2 may contain XYZTechnologies
A = load '<input-hadoop-dir>' using PigStorage() as (X:chararray,Y:chararray);
-- There should not be "(" and ")" after 'matches'
B = Filter A by Y matches '.*XYZTechnologies.*';
STORE B into 'Hadoop=Path' using PigStorage();
Hi you can use the hadoop grep function to find the specific string in the file.
for e.g my file contains some data as follows
Hi myself xyz. i like hadoop.
hadoop is good.
i am practicing.
so the hadoop command is
hadoop fs -text 'file name with path' | grep 'string to be found out'
Pig shell:
--Load the file data into the pig variable
**data = LOAD 'file with path' using PigStorage() as (text:chararray);
-- find the required text
txt = FILTER data by ($0 MATCHES '.string to be found out.');
--display the data.
dump txt; ---or use Illustrate txt;
-- storing it in another file
STORE txt into "path' using PigStorage();

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources