Ruby FTP Separating files from Folders - ruby

I'm trying to crawl FTP and pull down all the files recursively.
Up until now I was trying to pull down a directory with
ftp.list.each do |entry|
if entry.split(/\s+/)[0][0, 1] == "d"
out[:dirs] << entry.split.last unless black_dirs.include? entry.split.last
else
out[:files] << entry.split.last unless black_files.include? entry.split.last
end
But turns out, if you split the list up until last space, filenames and directories with spaces are fetched wrong.
Need a little help on the logic here.

You can avoid recursion if you list all files at once
files = ftp.nlst('**/*.*')
Directories are not included in the list but the full ftp path is still available in the name.
EDIT
I'm assuming that each file name contains a dot and directory names don't. Thanks for mentioning #Niklas B.

There are a huge variety of FTP servers around.
We have clients who use some obscure proprietary, Windows-based servers and the file listing returned by them look completely different from Linux versions.
So what I ended up doing is for each file/directory entry I try changing directory into it and if this doesn't work - consider it a file :)
The following method is "bullet proof":
# Checks if the give file_name is actually a file.
def is_ftp_file?(ftp, file_name)
ftp.chdir(file_name)
ftp.chdir('..')
false
rescue
true
end
file_names = ftp.nlst.select {|fname| is_ftp_file?(ftp, fname)}
Works like a charm, but please note: if the FTP directory has tons of files in it - this method takes a while to traverse all of them.

You can also use a regular expression. I put one together. Please verify if it works for you as well as I don't know it your dir listing look different. You have to use Ruby 1.9 btw.
reg = /^(?<type>.{1})(?<mode>\S+)\s+(?<number>\d+)\s+(?<owner>\S+)\s+(?<group>\S+)\s+(?<size>\d+)\s+(?<mod_time>.{12})\s+(?<path>.+)$/
match = entry.match(reg)
You are able to access the elements by name then
match[:type] contains a 'd' if it's a directory, a space if it's a file.
All the other elements are there as well. Most importantly match[:path].

Assuming that the FTP server returns Unix-like file listings, the following code works. At least for me.
regex = /^d[r|w|x|-]+\s+[0-9]\s+\S+\s+\S+\s+\d+\s+\w+\s+\d+\s+[\d|:]+\s(.+)/
ftp.ls.each do |line|
if dir = line.match(regex)
puts dir[1]
end
end
dir[1] contains the name of the directory (given that the inspected line actually represents a directory).

As #Alex pointed out, using patterns in filenames for this is hardly reliable. Directories CAN have dots in their names (.ssh for example), and listings can be very different on different servers.
His method works, but as he himself points out, takes too long.
I prefer using the .size method from Net::FTP.
It returns the size of a file, or throws an error if the file is a directory.
def item_is_file? (item)
ftp = Net::FTP.new(host, username, password)
begin
if ftp.size(item).is_a? Numeric
true
end
rescue Net::FTPPermError
return false
end
end

I'll add my solution to the mix...
Using ftp.nlst('**/*.*') did not work for me... server doesn't seem to support that ** syntax.
The chdir trick with a rescue seems expensive and hackish.
Assuming that all files have at least one char, a single period, and then an extension, I did a simple recursion.
def list_all_files(ftp, folder)
entries = ftp.nlst(folder)
file_regex = /.+\.{1}.*/
files = entries.select{|e| e.match(file_regex)}
subfolders = entries.reject{|e| e.match(file_regex)}
subfolders.each do |subfolder|
files += list_all_files(ftp, subfolder)
end
files
end
nlst seems to return the full path to whatever it finds non-recursively... so each time you get a listing, separate the files from the folders, and then process any folder you find recrsively. Collect all the file results.
To call, you can pass a starting folder
files = list_all_files(ftp, "my_starting_folder/my_sub_folder")
files = list_all_files(ftp, ".")
files = list_all_files(ftp, "")
files = list_all_files(ftp, nil)

Related

TarWriter help adding multiple directories and files

The code in this question works, but only with a single directory. I can also make it output a file archive as well. But not both a file and a directory, or two directories. I am hoping to make it work with a list of paths, including directories and files that are all placed in the same archive. If I try to add more than one path, the tarfile becomes corrupted. I thought I could continue adding files/data to archive as long as the TarWriter object is open.
QUESTION: In addition to how I can make the above example work with multiple paths (in linked post), can someone please help explain how files and directories are added into an archive? I have looked at the directory structure/format, but I can't seem to understand why this wouldn't work with more than one directory/file.
You can add multiple paths to Dir object
Dir[File.join(path1, '**/*'), File.join(path2, '**/*')]
After which the code would be something like this:
BLOCKSIZE_TO_READ = 1024 * 1000
def create_tarball(path)
tar_filename = Pathname.new(path).realpath.to_path + '.tar'
File.open(tar_filename, 'wb') do |tarfile|
Gem::Package::TarWriter.new(tarfile) do |tar|
Dir[File.join(path1, '**/*'), File.join(path2, '**/*')].each do |file|
mode = File.stat(file).mode
relative_file = file.sub(/^#{ Regexp.escape(path) }\/?/, '')
if File.directory?(file)
tar.mkdir(relative_file, mode)
else
tar.add_file(relative_file, mode) do |tf|
File.open(file, 'rb') do |f|
while buffer = f.read(BLOCKSIZE_TO_READ)
tf.write buffer
end
end
end
end
end
end
end
tar_filename
end

How to find text file in same directory

I am trying to read a list of baby names from the year 1880 in CSV format. My program, when run in the terminal on OS X returns an error indicating yob1880.txt doesnt exist.
No such file or directory # rb_sysopen - /names/yob1880.txt (Errno::ENOENT)
from names.rb:2:in `<main>'
The location of both the script and the text file is /Users/*****/names.
lines = []
File.expand_path('../yob1880.txt', __FILE__)
IO.foreach('../yob1880.txt') do |line|
lines << line
if lines.size >= 1000
lines = FasterCSV.parse(lines.join) rescue next
store lines
lines = []
end
end
store lines
If you're running the script from the /Users/*****/names directory, and the files also exist there, you should simply remove the "../" from your pathnames to prevent looking in /Users/***** for the files.
Use this approach to referencing your files, instead:
File.expand_path('yob1880.txt', __FILE__)
IO.foreach('yob1880.txt') do |line|
Note that the File.expand_path is doing nothing at the moment, as the return value is not captured or used for any purpose; it simply consumes resources when it executes. Depending on your actual intent, it could realistically be removed.
Going deeper on this topic, it may be better for the script to be explicit about which directory in which it locates files. Consider these approaches:
Change to the directory in which the script exists, prior to opening files
Dir.chdir(File.dirname(File.expand_path(__FILE__)))
IO.foreach('yob1880.txt') do |line|
This explicitly requires that the script and the data be stored relative to one another; in this case, they would be stored in the same directory.
Provide a specific path to the files
# do not use Dir.chdir or File.expand_path
IO.foreach('/Users/****/yob1880.txt') do |line|
This can work if the script is used in a small, contained environment, such as your own machine, but will be brittle if it data is moved to another directory or to another machine. Generally, this approach is not useful, except for short-lived scripts for personal use.
Never put a script using this approach into production use.
Work only with files in the current directory
# do not use Dir.chdir or File.expand_path
IO.foreach('yob1880.txt') do |line|
This will work if you run the script from the directory in which the data exists, but will fail if run from another directory. This approach typically works better when the script detects the contents of the directory, rather than requiring certain files to already exist there.
Many Linux/Unix utilities, such as cat and grep use this approach, if the command-line options do not override such behavior.
Accept a command-line option to find data files
require 'optparse'
base_directory = "."
OptionParser.new do |opts|
opts.banner = "Usage: example.rb [options]"
opts.on('-d', '--dir NAME', 'Directory name') {|v| base_directory = Dir.chdir(File.dirname(File.expand_path(v))) }
end
IO.foreach(File.join(base_directory, 'yob1880.txt')) do |line|
# do lines
end
This will give your script a -d or --dir option in which to specify the directory in which to find files.
Use a configuration file to find data files
This code would allow you to use a YAML configuration file to define where the files are located:
require 'yaml'
config_filename = File.expand_path("~/yob/config.yml")
config = {}
name = nil
config = YAML.load_file(config_filename)
base_directory = config["base"]
IO.foreach(File.join(base_directory, 'yob1880.txt')) do |line|
# do lines
end
This doesn't include any error handling related to finding and loading the config file, but it gets the point across. For additional information on using a YAML config file with error handling, see my answer on Asking user for information, and never having to ask again.
Final thoughts
You have the tools to establish ways to locate your data files. You can even mix-and-match solutions for a more sophisticated solution. For instance, you could default to the current directory (or the script directory) when no config file exists, and allow the command-line option to manually override the directory, when necessary.
Here's a technique I always use when I want to normalize the current working directory for my scripts. This is a good idea because in most cases you code your script and place the supporting files in the same folder, or in a sub-folder of the main script.
This resets the current working directory to the same folder as where the script is situated in. After that it's much easier to figure out the paths to everything:
# Reset working directory to same folder as current script file
Dir.chdir(File.dirname(File.expand_path(__FILE__)))
After that you can open your data file with just:
IO.foreach('yob1880.txt')

Remove certain characters from several files

I want to remove the following characters from several files in a folder. What I have so far is this:
str.delete! '!##$%^&*()
which I think will work to remove the characters. What do I need to do to make it run through all the files in the folder?
You clarified your question, stating you want to remove certain characters from the contents of files in a directory. I created a straight forward way to traverse a directory (and optionally, subdirectories) and remove specified characters from the file contents. I used String#delete like you started with. If you want to remove more advanced patterns you might want to change it to String#gsub with regular expressions.
The example below will traverse a tmp directory (and all subdirectories) relative to the current working directory and remove all occurrences of !, $, and # inside the files found. You can of course also pass the absolute path, e.g., C:/some/dir. Notice I do not filter on files, I assume it's all text files in there. You can of course add a file extension check if you wish.
def replace_in_files(dir, chars, subdirs=true)
Dir[dir + '/*'].each do |file|
if File.directory?(file) # Traverse inner directories if subdirs == true
replace_in_files(file, chars, subdirs) if subdirs
else # Replace file contents
replaced = File.read(file).delete(chars)
File.write(file, replaced)
end
end
end
replace_in_files('tmp', '!$#')
I think this might work, although I'm a little shaky on the Dir class in Ruby.
Dir.foreach('/path/to/dir') do |file|
file.delete '!##$%^&*()
end
There's a more general version of your question here: Iterate through every file in one directory
Hopefully a more thorough answer will be forthcoming but maybe this'll get you where you need.
Dir.foreach('filepath') do |f|
next if Dir.exists?(f)
file = File.new("filepath/#{f}",'r+')
text = file.read.delete("'!##$%^&*()")
file.rewind
file.write(text)
file.close
end
The reason you can't do
file.write(file.read.delete("'!##$%^&*()"))
is that file.read leaves the "cursor" at the end of the text. Instead of writing over the file, you would be appending to the file, which isn't what you want.
You could also add a method to the File class that would move the cursor to the beginning of the file.
class File
def newRead
data = self.read
self.rewind
data
end
end
Dir.foreach('filepath') do |f|
next if Dir.exists?(f)
file = File.new("filepath/#{f}",'r+')
file.write(file.newRead.delete("'!##$%^&*()"))
file.close
end

How to check the content of each .txt file in a folder with Ruby

I have a folder that contains files. I was wondering how I can chech every .txt file in the folder if it contains the word "BREAK". I know it must be very easy but I kinda miss the way of getting it done.
This is what I've tried so far
Dir.glob('/path/to/dir/*.txt') do |txt_file|
# And here I need a method that opens the 'txt_file'
# and checks if it contains "BREAK"
end
The below would return an array of files containing "BREAK"
files = Dir.glob('/path/to/dir/*.txt').select do |txt_file|
File.read(txt_file).include? "BREAK"
end

RubyZip - files from different directories have path in zip

I'm trying to use RubyZip to package up some files. At the moment I have a method which happily zips on particular directory and sub-directories.
def zip_directory(zipfile)
Dir["#{#directory_to_zip}/**/**"].reject{|f| reject_file(f)}.each do |file_path|
file_name = file_path.sub(#directory_to_zip+'/','');
zipfile.add(file_name, file_path)
end
end
However, I want to include a file from a completely different folder. I have a the following method to solve this:
def zip_additional(zipfile)
additional_files.reject{|f| reject_file(f)}.each do |file_path|
file_name = file_path.split('\\').last
zipfile.add(file_name, file_path)
end
end
While the file is added, it also copies the directory structure instead of placing the file at the root of the folder. This is really annoying and makes it more difficult to work with.
How can I get around this?
Thanks
Ben
there is setting to include (or exclude) the full path for zip libraries, check that setting
Turns out it was because the filename had the pull path in. My split didn't work as the path used a / instead of a . With the path removed from the filename it just worked.

Resources