How to get absolute path's of end directories? - bash

I have a directory structure as follows in the HDFS,
/data/current/population/{p_1,p_2}
/data/current/sport
/data/current/weather/{w_1,w_2,w_3}
/data/current/industry
The folders population, sport, weather & industry each correspond to different dataset. The end-folders, for example p_1 & p_2, pertain to different data-sources if available.
I'm working on PySpark code which work on these A_1, A_2, B, C_1, C_2, C_3 & D folders (the end-folders). Given a path like /data/current/ to your code, how do you extract the absolute paths of just the end folders?
The command hdfs dfs -ls -R /data/current gives the following output
/data/current
/data/current/population
/data/current/population/p_1
/data/current/population/p_2
/data/current/sport
/data/current/weather
/data/current/weather/w_1
/data/current/weather/w_2
/data/current/weather/w_3
/data/current/industry
But I want to end up with the absolute paths of end-folders. My output should look like following
/data/current/population/p_1
/data/current/population/p_2
/data/current/sport
/data/current/weather/w_1
/data/current/weather/w_2
/data/current/weather/w_3
/data/current/industry
-Thanks in advance

Why don't you write some code using HDFS client like SnakeBite.
I am attaching the scala function to do the same below.This function takes the root folder path and gives a List of all end paths. You can do the same in python using SnakeBite.
def traverse(path: Path, col: ListBuffer[String]): ListBuffer[String] = {
val stats = fs.listStatus(path)
for (stat <- stats) {
if (stat.isFile()) {
col += stat.getPath.toString()
} else {
val nl = fs.listStatus(stat.getPath)
if (nl.isEmpty)
col += stat.getPath.toString()
else {
for (n <- nl) {
if (n.isFile) {
col += n.getPath.toString()
} else {
col ++= traverse(n.getPath, new ListBuffer)
}
}
}
}
}
col
}

Below HDFS command might help :
hdfs getconf -confKey fs.defaultFS

Related

Open csv from subdirectories with partially unknown name and save all csv in one big file

I have a bunch of files in different subfolders of the root folder. I want to open all the files with the name 'NBack' AND '.csv' extension but not containing the letter 'X'. Then I want to add two columns in each files and merge/concatenate all concerned files into one big file.
I created so far this code, but for some reason it runs an eternity and seems to process the same files again and again (but not sure on this point). At the end I don't have a concatenated file but only one single file
for root, folders, files in os.walk(path):
for f in files:
filteredResults = [f for f in files if not "X" in f] #exlude files with the letter 'X'
for ff in filteredResults:
dd = [ff for ff in filteredResults if ff.endswith('.csv')] #among remaining files, keep the .csv files
for g in dd:
r = [g for g in dd if 'NBack' in g] #among those, keep those containing 'NBack'
a = pd.DataFrame() #empty dataset for the new big dataset
for i in r:
o = [i for i in r if not '.pdf' in i] #exclude .pdf's (for some reason including only .csv didn't work well enough).
appended = [] #necessary to append files before concatenating them????
for ii in o: #for the final set of files
p = os.path.join(root, ii)
data = pd.read_csv(p) #open .csv with specified characteristics in each subdirectory
split = ii.split("_") #split file name to get additional information
data['Run']=split[3] #add this information as a new column
data['IDcheck']=split[0] #add this information as a new column
appended.append(data) #necessary to apprend? creates a list of files
a = pd.concat([data]) #should create one big file but the variable a just contains one file
I would be happy for any comment or suggestion what to try.... where is the error...
This code works for me, sharing it if ever someone has a similar question:
os.chdir(r'C:\Users\...')
rootdir = os.getcwd()
paths = []
df = pd.DataFrame()
for root, _, files in os.walk(rootdir):
for f in files:
path = root + "\\" + f
if ".csv" and "NBack" in path and not("X" in path):
splitt = f.split('_')
r = pd.read_csv(path)
r['Run'] = splitt[2]
r['IDcheck'] = splitt[0]
df = pd.concat([df, r])
Thanks Yasir for the help!

How to count number of files under specific directory in hadoop?

I'm new to map-reduce framework. I want to find out the number of files under a specific directory by providing the name of that directory.
e.g. Suppose we have 3 directories A, B, C and each one is having 20, 30, 40 part-r files respectively. So I'm interested in writing a hadoop job, which will count files/records in each directory i.e I want an output in below formatted .txt file:
A is having 20 records
B is having 30 records
C is having 40 records
These all directories are present in HDFS.
The simplest/native approach is to use built in hdfs commands, in this case -count:
hdfs dfs -count /path/to/your/dir >> output.txt
Or if you prefer a mixed approach via Linux commands:
hadoop fs -ls /path/to/your/dir/* | wc -l >> output.txt
Finally the MapReduce version has already been answered here:
How do I count the number of files in HDFS from an MR job?
Code:
int count = 0;
FileSystem fs = FileSystem.get(getConf());
boolean recursive = false;
RemoteIterator<LocatedFileStatus> ri = fs.listFiles(new Path("hdfs://my/path"), recursive);
while (ri.hasNext()){
count++;
ri.next();
}
System.out.println("The count is: " + count);

Resizing and saving images on a new directory

I am writing a simple function that reads a sequence of images, re-sizes them and then saves each set of re-sized images to a new folder. Here is my code:
function [ image ] = FrameResize(Folder, ImgType)
Frames = dir([Folder '/' ImgType]);
NumFrames = size(Frames,1);
new_size = 2;
for i = 1 : NumFrames,
image = double(imread([Folder '/' Frames(i).name]));
for j = 2 : 10,
new_size = power(new_size, j);
% Creating a new folder called 'Low-Resolution' on the
% previous directory
mkdir ('.. Low-Resolution');
image = imresize(image, [new_size new_size]);
imwrite(image, 'Low-Resolution');
end
end
end
I have mainly two doubts:
How can I save those images with specific names, like im_1_64, im_2_64, etc. according to the iteration and to the resolution?
How can I make the name of the folder being created change with each iteration so that I save images with the same resolution on the same folder?
Since you know the resolution will be: new_size x new_size, you can use this in the imwrite function:
imwrite(image, ['im_' num2str(i) '_' num2str(new_size) '.' ImgType]);
Assuming that ImgType holds the extension.
To setup the folders you can do something like this:
mkdir(num2str(new_size))
cd(num2str(new_size))
imwrite(image, ['im_' num2str(i) '_' num2str(new_size) '.' ImgType]);
cd ..
You have an answer you are satisfied with, but I strongly suggest doing two things differently:
Use fullfile to create/concatenate file and path names.
For example, instead of:
imread([Folder '/' Frames(i).name])
do
imread(fullfile(Folder,Frames(i).name))
It's good for relative paths too:
fullfile('..','Low-Resolution')
ans =
..\Low-Resolution
Use sprintf to create strings containing numerical data from variables. Instead of:
['im_' num2str(i) '_' num2str(new_size) '.' ImgType]
do
sprintf('im_%d_%d.%s', i, new_size, ImgType)
You can even specify how many digits you want per integer. Compare:
K>> sprintf('im_%d_%d.%s', i, new_size, ImgType)
ans =
im_2_64.png
K>> sprintf('im_%02d_%d.%s', i, new_size, ImgType)
ans =
im_02_64.png

Comparing many files in Bash

I'm trying to automate a task at work that I normally do by hand, that is taking database output from the permissions of multiple users and comparing them to see what they have in common. I have a script right now that uses comm and paste, but it's not giving me all the output I'd like.
Part of the problem comes in comm only dealing with two files at once, and I need to compare at least three to find a trend. I also need to determine if two out of the three have something in common, but the third one doesn't have it (so comparing the output of two comm commands doesn't work). I need these in comma separated values so it can be imported into Excel. Each user has a column, and at the end is a listing of everything they have in common. comm would work perfectly if it could compare more than two files (and show two-out-of-three comparisons).
In addition to the code I have to clean all the extra cruft off the raw csv file, here's what I have so far in comparing four users. It's highly inefficient, but it's what I know.
cat foo1 | sort > foo5
cat foo2 | sort > foo6
cat foo3 | sort > foo7
cat foo4 | sort > foo8
comm foo5 foo6 > foomp
comm foo7 foo8 > foomp2
paste foomp foomp2 > output2
sed 's/[\t]/,/g' output2 > output4.csv
cat output4.csv
Right now this outputs two users, their similarities and differences, then does the same for another two users and pastes it together. This works better than doing it by hand, but I know I could be doing more.
An example input file would be something like:
User1
Active Directory
Internet
S: Drive
Sales Records
User2
Active Directory
Internet
Pricing Lookup
S: Drive
User3
Active Directory
Internet
Novell
Sales Records
where they have AD and Internet in common, two out of three have sales records access and S: drive permission, only one of each has Novell and Pricing access.
Can someone give me a hand in what I'm missing?
Using GNU AWK (gawk) you can print a table that shows how multiple users' permissions correlate. You could also do the same thing in any language that supports associative arrays (hashes), such as Bash 4, Python, Perl, etc.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
pcount = asort(perms)
ucount = asort(users)
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u = 1; u <= ucount; u++) {
printf("%-*s", colwidth, users[u])
}
printf("\n")
for (p = 1; p <= pcount; p++) {
printf("%-*s", maxplen, perms[p])
for (u = 1; u <= ucount; u++) {
if (array[users[u], perms[p]]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
Save this file, perhaps calling it "correlate", then set it to be executable:
$ chmod u+x correlate
Then, assuming that the filenames correspond to the usernames or are otherwise meaningful (your examples are "user1" through "user3" so that works well), you can run it like this:
$ ./correlate user*
and you would get the following output based on your sample input:
user1 user2 user3
Active Directory X X X
Internet X X X
Novell X
Pricing Lookup X
S: Drive X X
Sales Records X X
Edit:
This version doesn't use asort() and so it should work on non-GNU versions of AWK. The disadvantage is that the order of rows and columns is unpredictable.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u in users) {
printf("%-*s", colwidth, u)
}
printf("\n")
for (p in perms) {
printf("%-*s", maxplen, p)
for (u in users) {
if (array[u, p]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
You can use the diff3 program. From the man page:
diff3 - compare three files line by line
Given your sample inputs, above, running diff3 results in:
====
1:3,4c
S: Drive
Sales Records
2:3,4c
Pricing Lookup
S: Drive
3:3,4c
Novell
Sales Records
Does this get you any closer to what you're looking for?
I would use the strings command to remove any binary from the files, cat them together then use uniq -c on the concatenated file to get a count of occurrences of a string

Could I do this blind relative to absolute path conversion (for perforce depot paths) better?

I need to "blindly" (i.e. without access to the filesystem, in this case the source control server) convert some relative paths to absolute paths. So I'm playing with dotdots and indices. For those that are curious I have a log file produced by someone else's tool that sometimes outputs relative paths, and for performance reasons I don't want to access the source control server where the paths are located to check if they're valid and more easily convert them to their absolute path equivalents.
I've gone through a number of (probably foolish) iterations trying to get it to work - mostly a few variations of iterating over the array of folders and trying delete_at(index) and delete_at(index-1) but my index kept incrementing while I was deleting elements of the array out from under myself, which didn't work for cases with multiple dotdots. Any tips on improving it in general or specifically the lack of non-consecutive dotdot support would be welcome.
Currently this is working with my limited examples, but I think it could be improved. It can't handle non-consecutive '..' directories, and I am probably doing a lot of wasteful (and error-prone) things that I probably don't need to do because I'm a bit of a hack.
I've found a lot of examples of converting other types of relative paths using other languages, but none of them seemed to fit my situation.
These are my example paths that I need to convert, from:
//depot/foo/../bar/single.c
//depot/foo/docs/../../other/double.c
//depot/foo/usr/bin/../../../else/more/triple.c
to:
//depot/bar/single.c
//depot/other/double.c
//depot/else/more/triple.c
And my script:
begin
paths = File.open(ARGV[0]).readlines
puts(paths)
new_paths = Array.new
paths.each { |path|
folders = path.split('/')
if ( folders.include?('..') )
num_dotdots = 0
first_dotdot = folders.index('..')
last_dotdot = folders.rindex('..')
folders.each { |item|
if ( item == '..' )
num_dotdots += 1
end
}
if ( first_dotdot and ( num_dotdots > 0 ) ) # this might be redundant?
folders.slice!(first_dotdot - num_dotdots..last_dotdot) # dependent on consecutive dotdots only
end
end
folders.map! { |elem|
if ( elem !~ /\n/ )
elem = elem + '/'
else
elem = elem
end
}
new_paths << folders.to_s
}
puts(new_paths)
end
Let's not reinvent the wheel... File.expand_path does that for you:
[
'//depot/foo/../bar/single.c',
'//depot/foo/docs/../../other/double.c',
'//depot/foo/usr/bin/../../../else/more/triple.c'
].map {|p| File.expand_path(p) }
# ==> ["//depot/bar/single.c", "//depot/other/double.c", "//depot/else/more/triple.c"]
Why not just use File.expand_path:
irb(main):001:0> File.expand_path("//depot/foo/../bar/single.c")
=> "//depot/bar/single.c"
irb(main):002:0> File.expand_path("//depot/foo/docs/../../other/double.c")
=> "//depot/other/double.c"
irb(main):003:0> File.expand_path("//depot/foo/usr/bin/../../../else/more/triple.c")
=> "//depot/else/more/triple.c"
For a DIY solution using Arrays, this comes to mind (also works for your examples):
absolute = []
relative = "//depot/foo/usr/bin/../../../else/more/triple.c".split('/')
relative.each { |d| if d == '..' then absolute.pop else absolute.push(d) end }
puts absolute.join('/')
Python code:
paths = ['//depot/foo/../bar/single.c',
'//depot/foo/docs/../../other/double.c',
'//depot/foo/usr/bin/../../../else/more/triple.c']
def convert_path(path):
result = []
for item in path.split('/'):
if item == '..':
result.pop()
else:
result.append(item)
return '/'.join(result)
for path in paths:
print convert_path(path)
prints:
//depot/bar/single.c
//depot/other/double.c
//depot/else/more/triple.c
You can use the same algorithm in Ruby.

Resources