Pyspark: get list of files/directories on HDFS path - hadoop

As per title. I'm aware of textFile but, as the name suggests, it works only on text files.
I would need to access files/directories inside a path on either HDFS or a local path. I'm using pyspark.

Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())
status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))
for fileStatus in status:
print(fileStatus.getPath())

I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:
from hdfs import Config
client = Config().get_client('dev')
files = client.list('the_dir_path')

If you use PySpark, you can execute commands interactively:
List all files from a chosen directory:
hdfs dfs -ls <path> e.g.: hdfs dfs -ls /user/path:
import os
import subprocess
cmd = 'hdfs dfs -ls /user/path'
files = subprocess.check_output(cmd, shell=True).strip().split('\n')
for path in files:
print path
Or search files in a chosen directory:
hdfs dfs -find <path> -name <expression> e.g.: hdfs dfs -find /user/path -name *.txt:
import os
import subprocess
cmd = 'hdfs dfs -find {} -name *.txt'.format(source_dir)
files = subprocess.check_output(cmd, shell=True).strip().split('\n')
for path in files:
filename = path.split(os.path.sep)[-1].split('.txt')[0]
print path, filename

This might work for you:
import subprocess, re
def listdir(path):
files = str(subprocess.check_output('hdfs dfs -ls ' + path, shell=True))
return [re.search(' (/.+)', i).group(1) for i in str(files).split("\\n") if re.search(' (/.+)', i)]
listdir('/user/')
This also worked:
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('/user/')
[str(f.getPath()) for f in fs.get(conf).listStatus(path)]

If you want to read in all files in a directory, check out sc.wholeTextFiles [doc], but note that the file's contents are read into the value of a single row, which is probably not the desired result.
If you want to read only some files, then generating a list of paths (using a normal hdfs ls command plus whatever filtering you need) and passing it into sqlContext.read.text [doc] and then converting from a DataFrame to an RDD seems like the best approach.

There is an easy way to do this using snakebite library
from snakebite.client import Client
hadoop_client = Client(HADOOP_HOST, HADOOP_PORT, use_trash=False)
for x in hadoop_client.ls(['/']):
... print x

Related

Shell Script - Iterate through each line in text file and rename HDFS file

I have a text file in HDFS which would have records like below. The number of lines in file may vary every time.
hdfs://myfile.txt
file_name_1
file_name_2
file_name_3
I have the below hdfs directory and file structure like below.
hdfs://myfolder/
hdfs://myfolder/file1.csv
hdfs://myfolder/file2.csv
hdfs://myfolder/file3.csv
Using shell script I am able to count the number of files in HDFS directory and number of lines available in my HDFS text file. Only if the count matches between the number of files in directory and number of records in my text file, I am going to proceed further with the process.
Now, i am trying to rename hdfs://myfolder/file1.csv to hdfs://myfolder/file_name_1.csv using the first record from my text file.
Second file should be renamed to hdfs://myfolder/file_name_2.csv and third file to hdfs://myfolder/file_name_3.csv
I have difficulty in looping through both the text file and also the files in HDFS directory.
Is there an optimal way to achieve this using shell script.
You cannot do this directly from HDFS, you'd need to stream the file contents, then issue individual move commands.
e.g.
#!/bin/sh
COUNTER = 0
for file in $(hdfs dfs -cat file.txt)
do
NAME = $(sed $file ...) # replace text, as needed. TODO: extract the extension
hdfs dfs -mv file "$NAME_${COUNTER}.csv" # 'csv' for example - make sure the extension isn't duplicated!!
COUNTER = $((COUNTER + 1)
done

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

How to find the file name and size of the file from fsimage?

I am trying to find the files which are less than block size in HDFS.
By using OIV i converted the fsimage into text file with delimiters like below.
hdfs oiv_legacy -i /tmp/fsimage -o /tmp/fsimage_$RUNDATE/fsimage.txt -p Delimited -delimiter '#'
Since fsimage has lot of data. From this how to find the file name and file size of each and every file in HDFS.
Can anyone please help.
Thanks in advance....
Take a look at scripts at the end of this documentation.
Starting from:
A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
replication:int,
modTime:chararray,
accessTime:chararray,
blockSize:long,
numBlocks:int,
fileSize:long,
NamespaceQuota:int,
DiskspaceQuota:int,
perms:chararray,
username:chararray,
groupname:chararray);
-- Grab the pathname and filesize
B = FOREACH A generate path, fileSize;
-- Save results
STORE B INTO '$outputFile';
hadoop fs -find /tmp/fsimage size 64 -print
Note:I am using MapR Hadoop.The syntax might wary if its Cloudera,Hortonworks.

Running Simple Hadoop Command using Java code

I would like to list files using hadoop command. "hadoop fs -ls filepath". I want to write a Java code to achieve this. Can I write a small piece of java code, make a jar of it and supply it to Map reduce job(Amazon EMR) to achieve this ? Can you please point me to the code and steps using which I can achieve this ?
You can list files in HDFS using JAVA code as below
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
...
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://localhost:54310"), configuration);
FileStatus[] fileStatus = hdfs.listStatus(new Path("hdfs://localhost:54310/user/path"));
Path[] paths = FileUtil.stat2Paths(fileStatus);
for (Path path : paths) {
System.out.println(path);
}
Use this in your map reduce trigger code ( main or run method) for get the list and pass it args for your map reduce class
Option 2
create shell script to read list of files using hadoop fs -ls command
provide this script as part of EMR bootstrap script to get list of files
in same script you can write code to save the paths in text files under path /mnt/
read this path from your map reduce code and provide to arg list for your mapper and reducers
Here is My Github Repository
Simple Commands like:
making folder,
putting files to hdfs,
reading,
listing and
writing data are present in JAVA API folder.
And you can explore other folders to get map-reduce codes in java.

Merging MapReduce output

I have two MapReduce jobs which produce files in two separate directories which look like so:
Directory output1:
------------------
/output/20140102-r-00000.txt
/output/20140102-r-00000.txt
/output/20140103-r-00000.txt
/output/20140104-r-00000.txt
Directory output2:
------------------
/output-update/20140102-r-00000.txt
I want to merge these two directories together in a new directory /output-complete/ where the 20140102-r-00000.txt replaces the original file in the /output directory and all of the "-r-0000x" is removed from the file name. The two original directories will now be empty and the resulting directory should look as follows:
Directory output3:
-------------------
/output-complete/20140102.txt
/output-complete/20140102.txt
/output-complete/20140103.txt
/output-complete/20140104.txt
What is the best way to do this? Can I use only HDFS shell commands? Do I need to create a java program to traverse both directories and do the logic?
you can use pig ...
get_data = load '/output*/20140102*.txt' using Loader()
store get_data into "/output-complete/20140102.txt"
or HDFS Command...
hadoop fs -cat '/output*/20140102*.txt' > output-complete/20140102.txt
single qoutes may not work, then try with double quotes
You can use hdfs command -getMerge for merging hdfs files.

Resources