Parsing a directory of files with regex and ruby - ruby

I'm trying to do a simple regex to grab specific text out of a bunch of text files in a directory. The code I'm using is below:
input_dir = File.join('path/to/file/dir/', "*.txt")
Dir.glob(input_dir) do |file|
if /\.txt$/i.match file
File.open(file, "r") do |_file|
/==BEGIN==(.*)==END==/.match _file.read
puts $1
end
end
end
That works for exactly 1 of the files in the directory, but all other files return nil. Am I missing something here?

Hard to guess with so little data, but could it be that in most files (except one), ==BEGIN== and ==END== are on different lines?
Does /==BEGIN==(.*)==END==/m.match _file.read change anything? The /m modifier allows the dot to also match newlines in Ruby.

Related

Ignoring hidden files with regular expressions in Ruby

I'm trying to read every file in a specified directory. I'd like to ignore hidden files. I've found a way to do this, but I'm pretty sure it is the most inefficient way to do this.
This is what I've tried,
Find.find(directory) do |path|
file_paths << path if path =~ /.*\./ and !path.split("/")[-1].to_s.starts_with?(".")
end
This works. But I hate it.
I then tried to do this,
file_paths << path if path =~ /.*\./ and path =~ /^\./
But this returned nothing for me. What am I doing wrong here?
You could just use Dir
file_paths = Dir.glob("#{directory}/*")
Dir#glob Docs:
Returns the filenames found by expanding pattern which is an Array of the patterns or the pattern String, either as an array or as parameters to the block.
Note, this will not match Unix-like hidden files (dotfiles). In order to include those in the match results, you must use something like “{,.}”.
per #arco444 if you want this to search recursively
file_paths = Dir.glob("#{directory}/**/*")
If you wanted to ignore files starting with ., the below would append those that don't to the file_paths array
Find.find(directory) do |path|
if File.file?(path)
file_paths << path unless File.basename(path).start_with?(".")
end
end
Note that this will not necessarily ignore hidden files, for the reasons mentioned in the comments. It also currently includes "hidden" directories, i.e. a file such as /some/.hidden/directory/normal.file would be included in the list.

Remove certain characters from several files

I want to remove the following characters from several files in a folder. What I have so far is this:
str.delete! '!##$%^&*()
which I think will work to remove the characters. What do I need to do to make it run through all the files in the folder?
You clarified your question, stating you want to remove certain characters from the contents of files in a directory. I created a straight forward way to traverse a directory (and optionally, subdirectories) and remove specified characters from the file contents. I used String#delete like you started with. If you want to remove more advanced patterns you might want to change it to String#gsub with regular expressions.
The example below will traverse a tmp directory (and all subdirectories) relative to the current working directory and remove all occurrences of !, $, and # inside the files found. You can of course also pass the absolute path, e.g., C:/some/dir. Notice I do not filter on files, I assume it's all text files in there. You can of course add a file extension check if you wish.
def replace_in_files(dir, chars, subdirs=true)
Dir[dir + '/*'].each do |file|
if File.directory?(file) # Traverse inner directories if subdirs == true
replace_in_files(file, chars, subdirs) if subdirs
else # Replace file contents
replaced = File.read(file).delete(chars)
File.write(file, replaced)
end
end
end
replace_in_files('tmp', '!$#')
I think this might work, although I'm a little shaky on the Dir class in Ruby.
Dir.foreach('/path/to/dir') do |file|
file.delete '!##$%^&*()
end
There's a more general version of your question here: Iterate through every file in one directory
Hopefully a more thorough answer will be forthcoming but maybe this'll get you where you need.
Dir.foreach('filepath') do |f|
next if Dir.exists?(f)
file = File.new("filepath/#{f}",'r+')
text = file.read.delete("'!##$%^&*()")
file.rewind
file.write(text)
file.close
end
The reason you can't do
file.write(file.read.delete("'!##$%^&*()"))
is that file.read leaves the "cursor" at the end of the text. Instead of writing over the file, you would be appending to the file, which isn't what you want.
You could also add a method to the File class that would move the cursor to the beginning of the file.
class File
def newRead
data = self.read
self.rewind
data
end
end
Dir.foreach('filepath') do |f|
next if Dir.exists?(f)
file = File.new("filepath/#{f}",'r+')
file.write(file.newRead.delete("'!##$%^&*()"))
file.close
end

Parsing a Zip file and extracting records from text files

I am really new to Ruby and could use some help with a program. I need to open a zip file that contains multiple text files that has many rows of data (eg.)
CDI|3|3|20100515000000|20100515153000|2008|XXXXX4791|0.00|0.00
CDI|3|3|20100515000000|20100515153000|2008|XXXXX5648|0.00|0.00
CHO|3|3|20100515000000|20100515153000|2114|XXXXX3276|0.00|0.00
CHO|3|3|20100515000000|20100515153000|2114|XXXXX4342|0.00|0.00
MITR|3|3|20100515000000|20100515153000|0000|XXXXX7832|0.00|0.00
HR|3|3|20100515000000|20100515153000|1114|XXXXX0238|0.00|0.00
I first need to extract the zip file, read the text files located in the zip file and write only the complete rows that start with (CDI and CHO) to two output files, one for the rows of data starting with CDI and one for the rows of data starting with CHO (basically parsing the file). I have to do it with Ruby and possibly try to set the program to an auto function for arrival of continuous zip files of the same stature. I completely appreciate any advice, direction or help via some sample anyone can give.
One means is using the ZipFile library.
require 'zip/zip'
# To open the zip file and pass each entry to a block
Zip::ZipFile.foreach(path_to_zip) do |text_file|
# Read from entry, turn String into Array, and pass to block
text_file.read.split("\n").each do |line|
if line.start_with?("CDI") || line.start_with?("CHO")
# Do something
end
end
end
I'm not sure if I entirely follow your question. For starters, if you're looking to unzip files using Ruby, check out this question. Once you've got the file unzipped to a readable format, you can try something along these lines to print to the two separate outputs:
cdi_output = File.open("cdiout.txt", "a") # Open an output file for CDI
cho_output = File.open("choout.txt", "a") # Open an output file for CHO
File.open("text.txt", "r") do |f| # Open the input file
while line = f.gets # Read each line in the input
cdi_output.puts line if /^CDI/ =~ line # Print if line starts with CDI
cho_output.puts line if /^CHO/ =~ line # Print if line starts with CHO
end
end
cdi_output.close # Close cdi_output file
cho_output.close # Close cho_output file

Ruby: Dir.chdir using data from a text file in windows

I am trying to use a script to change the working directory using Dir.chdir
This works:
dirs = ['//servername/share','//servername2/share']
dirs.each do |dir|
Dir.chdir dir
end
If I put the above share information into a text file (each share on a new line) and try to load:
File.foreach("shares.txt") {|dir|
Dir.chdir dir
}
I get this error:
'chdir': No such file or directory - //servername/share (Errno::ENOENT)
How can I read the shares from a text file and change to that directory? Is there a better way to do this?
Try
Dir.chdir dir.strip
or
Dir.chdir dir.chomp
Reason:
With File.foreach you get lines including a newlines (\n).
strip will delete leading and trailing spaces, chomp will delete trailing newlines.
Another possibility: In your example you use absolute paths. This should work.
If you use relative paths, then check, in which directory you are (you change it!). To keep the directory you may use the block-version of Dir.chdir.

Ruby FTP Separating files from Folders

I'm trying to crawl FTP and pull down all the files recursively.
Up until now I was trying to pull down a directory with
ftp.list.each do |entry|
if entry.split(/\s+/)[0][0, 1] == "d"
out[:dirs] << entry.split.last unless black_dirs.include? entry.split.last
else
out[:files] << entry.split.last unless black_files.include? entry.split.last
end
But turns out, if you split the list up until last space, filenames and directories with spaces are fetched wrong.
Need a little help on the logic here.
You can avoid recursion if you list all files at once
files = ftp.nlst('**/*.*')
Directories are not included in the list but the full ftp path is still available in the name.
EDIT
I'm assuming that each file name contains a dot and directory names don't. Thanks for mentioning #Niklas B.
There are a huge variety of FTP servers around.
We have clients who use some obscure proprietary, Windows-based servers and the file listing returned by them look completely different from Linux versions.
So what I ended up doing is for each file/directory entry I try changing directory into it and if this doesn't work - consider it a file :)
The following method is "bullet proof":
# Checks if the give file_name is actually a file.
def is_ftp_file?(ftp, file_name)
ftp.chdir(file_name)
ftp.chdir('..')
false
rescue
true
end
file_names = ftp.nlst.select {|fname| is_ftp_file?(ftp, fname)}
Works like a charm, but please note: if the FTP directory has tons of files in it - this method takes a while to traverse all of them.
You can also use a regular expression. I put one together. Please verify if it works for you as well as I don't know it your dir listing look different. You have to use Ruby 1.9 btw.
reg = /^(?<type>.{1})(?<mode>\S+)\s+(?<number>\d+)\s+(?<owner>\S+)\s+(?<group>\S+)\s+(?<size>\d+)\s+(?<mod_time>.{12})\s+(?<path>.+)$/
match = entry.match(reg)
You are able to access the elements by name then
match[:type] contains a 'd' if it's a directory, a space if it's a file.
All the other elements are there as well. Most importantly match[:path].
Assuming that the FTP server returns Unix-like file listings, the following code works. At least for me.
regex = /^d[r|w|x|-]+\s+[0-9]\s+\S+\s+\S+\s+\d+\s+\w+\s+\d+\s+[\d|:]+\s(.+)/
ftp.ls.each do |line|
if dir = line.match(regex)
puts dir[1]
end
end
dir[1] contains the name of the directory (given that the inspected line actually represents a directory).
As #Alex pointed out, using patterns in filenames for this is hardly reliable. Directories CAN have dots in their names (.ssh for example), and listings can be very different on different servers.
His method works, but as he himself points out, takes too long.
I prefer using the .size method from Net::FTP.
It returns the size of a file, or throws an error if the file is a directory.
def item_is_file? (item)
ftp = Net::FTP.new(host, username, password)
begin
if ftp.size(item).is_a? Numeric
true
end
rescue Net::FTPPermError
return false
end
end
I'll add my solution to the mix...
Using ftp.nlst('**/*.*') did not work for me... server doesn't seem to support that ** syntax.
The chdir trick with a rescue seems expensive and hackish.
Assuming that all files have at least one char, a single period, and then an extension, I did a simple recursion.
def list_all_files(ftp, folder)
entries = ftp.nlst(folder)
file_regex = /.+\.{1}.*/
files = entries.select{|e| e.match(file_regex)}
subfolders = entries.reject{|e| e.match(file_regex)}
subfolders.each do |subfolder|
files += list_all_files(ftp, subfolder)
end
files
end
nlst seems to return the full path to whatever it finds non-recursively... so each time you get a listing, separate the files from the folders, and then process any folder you find recrsively. Collect all the file results.
To call, you can pass a starting folder
files = list_all_files(ftp, "my_starting_folder/my_sub_folder")
files = list_all_files(ftp, ".")
files = list_all_files(ftp, "")
files = list_all_files(ftp, nil)

Resources