I have a group of large text files with a ton of information in them with inconsistent formatting. I don't really care all that much about most of the info, but I'm trying to extract IDs that are included in the file. I've drafted a fairly simple script to do this (IDs are 3 digits - 7 digits).
puts("What's the name of the file you'd like to check? (don't include .txt)")
file_to_check = gets.chomp
file_to_write = file_to_check + "IDs" + ".txt"
file_to_check = file_to_check + ".txt"
output_text = ""
count_of_lines = 0
File.open(file_to_check, "r").each_line do |line|
count_of_lines += 1
if /.*\d{3}-\d{7}.*/ =~ line
temp_case = line.match(/\d{3}-\d{7}/).to_s
temp_case = temp_case + "\n"
output_text = output_text + temp_case
else
# puts("this failed")
end
end
File.open(file_to_write, "w") do |file|
file.puts(output_text)
file.puts(count_of_lines)
end
One of the files includes characters that VIM shows as ^Z, which seem to be killing the script before it actually gets to the end of the file.
Is there anything I can do to have Ruby ignore these characters and keep moving through the file?
Per Mircea's comment, the answer is here. I used "rt" based on one of the comments on the selected answer.
Related
I'm trying to write a very simple ruby script that opens a text file, removes the \n from the end of lines UNLESS the line starts with a non-alphabetic character OR the line itself is blank (\n).
The code below works fine, except that it skips all of the content beyond the last \n line. When I add \n\n to the end of the file, it works perfectly. Examples: A file with this text in it works great and pulls everything to one line:
Hello
there my
friend how are you?
becomes Hello there my friend how are you?
But text like this:
Hello
there
my friend
how
are you today
returns just Hello and There, and completely skips the last 3 lines. If I add 2 blank lines to the end, it will pick up everything and behave as I want it to.
Can anybody explain to me why this happens? Obviously I know I can fix this instance by appending \n\n to the end of the source file at the start, but that doesn't help me understand why the .gets isn't working as I'd expect.
Thanks in advance for any help!
source_file_name = "somefile.txt"
destination_file_name = "some_other_file.txt"
source_file = File.new(source_file_name, "r")
para = []
x = ""
while (line = source_file.gets)
if line != "\n"
if line[0].match(/[A-z]/) #If the first character is a letter
x += line.chomp + " "
else
x += "\n" + line.chomp + " "
end
else
para[para.length] = x
x = ""
end
end
source_file.close
fixed_file = File.open(destination_file_name, "w")
para.each do |paragraph|
fixed_file << "#{paragraph}\n\n"
end
fixed_file.close
Your problem lies in the fact you only add your string x to the para array if and only if you encounter an empty line ('\n'). Since your second example does not contain the empty line at the end, the final contents of x are never added to the para array.
The easy way to fix this without changing any of your code, is add the following lines after closing your while loop:
if(x != "")
para.push(x)
end
I would prefer to add the strings to my array right away rather then appending them onto x until you hit an empty line, but this should work with your solution.
Also,
para.push(x)
para << x
both read much nicer and look more straightforward than
para[para.length] = x
That one threw me off for a second, since in non-dynamic languages, that would give you an error. I advise using one of those instead, simply because it's more readable.
Your code is like a c code to me, ruby way should be this, which substitutes your above 100 lines.
File.write "dest.txt", File.read("src.txt")
It's easier to use a multiline regex. Maybe:
source_file.read.gsub(/(?<!\n)\n([a-z])/im, ' \\1')
I have been trying to work out a file rename program based on ruby, as a programming exercise for myself (I am aware of rename under linux, but I want to learn Ruby, and rename is not available in Mac).
From the code below, the issue is that the .include? method always returns false even though I see the filename contains such search pattern. If I comment out the include? check, gsub() does not seem to generate a new file name at all (i.e. file name remains the same). So can someone please take a look at see what I did wrong? Thanks a bunch in advance!
Here is the expected behavior:
Assuming that in current folder there are three files: a1.jpg, a2.jpg, and a3.jpg
The Ruby script should be able to rename it to b1.jpg, b2.jpg, b3.jpg
#!/Users/Antony/.rvm/rubies/ruby-1.9.3-p194/bin/ruby
puts "Enter the file search query"
searchPattern = gets
puts "Enter the target to replace"
target = gets
puts "Enter the new target name"
newTarget = gets
Dir.glob("./*").sort.each do |entry|
origin = File.basename(entry, File.extname(entry))
if origin.include?(searchPattern)
newEntry = origin.gsub(target, newTarget)
File.rename( origin, newEntry )
puts "Rename from " + origin + " to " + newEntry
end
end
Slightly modified version:
puts "Enter the file search query"
searchPattern = gets.strip
puts "Enter the target to replace"
target = gets.strip
puts "Enter the new target name"
newTarget = gets.strip
Dir.glob(searchPattern).sort.each do |entry|
if File.basename(entry, File.extname(entry)).include?(target)
newEntry = entry.gsub(target, newTarget)
File.rename( entry, newEntry )
puts "Rename from " + entry + " to " + newEntry
end
end
Key differences:
Use .strip to remove the trailing newline that you get from gets. Otherwise, this newline character will mess up all of your match attempts.
Use the user-provided search pattern in the glob call instead of globbing for everything and then manually filtering it later.
Use entry (that is, the complete filename) in the calls to gsub and rename instead of origin. origin is really only useful for the .include? test. Since it's a fragment of a filename, it can't be used with rename. I removed the origin variable entirely to avoid the temptation to misuse it.
For your example folder structure, entering *.jpg, a, and b for the three input prompts (respectively) should rename the files as you are expecting.
I used the accepted answer to fix a bunch of copied files' names.
Dir.glob('./*').sort.each do |entry|
if File.basename(entry).include?(' copy')
newEntry = entry.gsub(' copy', '')
File.rename( entry, newEntry )
end
end
Your problem is that gets returns a newline at the end of the string. So, if you type "foo" then searchPattern becomes "foo\n". The simplest fix is:
searchPattern = gets.chomp
I might rewrite your code slightly:
$stdout.sync
print "Enter the file search query: "; search = gets.chomp
print "Enter the target to replace: "; target = gets.chomp
print " Enter the new target name: "; replace = gets.chomp
Dir['*'].each do |file|
# Skip directories
next unless File.file?(file)
old_name = File.basename(file,'.*')
if old_name.include?(search)
# Are you sure you want gsub here, and not sub?
# Don't use `old_name` here, it doesn't have the extension
new_name = File.basename(file).gsub(target,replace)
File.rename( file, new_path )
puts "Renamed #{file} to #{new_name}" if $DEBUG
end
end
Here's a short version I've used today (without pattern matching)
Save this as rename.rb file and run it inside the command prompt with ruby rename.rb
count = 1
newname = "car"
Dir["/path/to/folder/*"].each do |old|
File.rename(old, newname + count.to_s)
count += 1
end
I had /Copy of _MG_2435.JPG converted into car1, car2, ...
I made a small script to rename the entire DBZ serie by seasons and implement this:
count = 1
new_name = "Dragon Ball Z S05E"
format_file = ".mkv"
Dir.glob("dragon ball Z*").each do |old_name|
File.rename(old_name, new_name + count.to_s + format_file)
count += 1
end
The result would be:
Dragon Ball Z S05E1
Dragon Ball Z S05E2
Dragon Ball Z S05E3
In a folder, I wanted to remove the trailing underscore _ of any audio filename while keeping everything else. Sharing my code here as it might help someone.
What the program does:
Prompts the user for the:
Directory path: c:/your/path/here (make sure to use slashes /, not backslashes, \, and without the final one).
File extension: mp3 (without the dot .)
Trailing characters to remove: _
Looks for any file ending with c:/your/path/here/filename_.mp3 and renames it c:/your/path/here/filename.mp3 while keeping the file’s original extension.
puts 'Enter directory path'
path = gets.strip
directory_path = Dir.glob("#{path}/*")
# Get file extension
puts 'Enter file extension'
file_extension = gets.strip
# Get trailing characters to remove
puts 'Enter trailing characters to remove'
trailing_characters = gets.strip
suffix = "#{trailing_characters}.#{file_extension}"
# Rename file if condition is met
directory_path.each do |file_path|
next unless file_path.end_with?(suffix)
File.rename(file_path, "#{file_path.delete_suffix(suffix)}.#{file_extension}")
end
What's wrong with my code? Is FileNameArray being reused?
f.rb:17: warning: already initialized constant FileNameArray
number = 0
while number < 99
number = number + 1
if number <= 9
numbers = "000" + number.to_s
elsif
numbers = "00" + number.to_s
end
files = Dir.glob("/home/product/" + numbers + "/*/*.txt")
files.each do |file_name|
File.open(file_name,"r:utf-8").each do | txt |
if txt =~ /http:\/\//
if txt =~ /static.abc.com/ or txt =~ /static0[1-9].abc.com/
elsif
$find = txt
FileNameArray = file_name.split('/')
f = File.open("error.txt", 'a+')
f.puts FileNameArray[8], txt , "\n"
f.close
end
end
end
end
end
You might be a ruby beginner, I tried to rewrite the same code in ruby way...
(1..99).each do |number|
Dir.glob("/home/product/" + ("%04d" % numbers) + "/*/*.txt").each do |file_name|
File.open(file_name,"r:utf-8").each do | txt |
next unless txt =~ /http:\/\//
next if txt =~ /static.abc.com/ || txt =~ /static0[1-9].abc.com/
$find = txt
file_name_array = file_name.split('/')
f = File.open("error.txt", 'a+')
f.puts file_name_array[8], txt , "\n"
f.close
end
end
end
Points to note down,
In ruby if you use a variable prefixed with $ symbol, it is taken as a global variable. So use $find, only if it is required.
In ruby a constant variable starts with capital letter, usually we are NOT supposed to change a constant value. This might have caused the error in your program.
(1..99) is a literal used to create instance of Range class, which returns values from 1 to 99
In Ruby variable name case matters. Local variables must start with a lower case character. Constants - with an upper case.
So, please try to rename FileNameArray to fileNameArray.
Also, glob takes advanced expressions that can save you one loop and a dozen of LOCs. In your case this expression should look something like:
Dir.glob("/home/product/00[0-9][0-9]/*/*.txt")
I want to split a txt file into multiple files where each file contains no more than 5Mb. I know there are tools for this, but I need this for a project and want to do it in Ruby. Also, I prefer to do this with File.open in block context if possible, but I fail miserably :o(
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
File.open("#{file_num}.txt", 'w') do |data_out|
data_in.each_line do |line|
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
# next file
end
end
end
end
This work, but I don't think it is elegant. Also, I still wonder if it can be done with File.open in block context only.
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
data_out = File.open("#{file_num}.txt", 'w')
data_in.each_line do |line|
data_out = File.open("#{file_num}.txt", 'w') unless data_out.respond_to? :write
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
data_out.close
end
end
data_out.close if data_out.respond_to? :close
end
Cheers,
Martin
[Updated] Wrote a short version without any helper variables and put everything in a method:
def chunker f_in, out_pref, chunksize = 1_073_741_824
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
fh_out << fh_in.read(chunksize)
end
end
end
end
chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)
Instead of a line loop you can use .read(length) and do a loop only for the EOF marker and the file cursor.
This takes care that the chunky files are never bigger than your desired chunk size.
On the other hand it never takes care for line breaks (\n)!
Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001).
This is only possible because .read(chunksize) is used. In the second example below, it could not be used!
Update: Splitting with line break recognition
If your really need complete lines with \n then use this modified code snippet:
def chunker f_in, out_pref, chunksize = 1_073_741_824
outfilenum = 1
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
loop do
line = fh_in.readline
fh_out << line
break if fh_out.size > (chunksize-line.length) || fh_in.eof?
end
end
outfilenum += 1
end
end
end
I had to introduce a helper variable line because I want to ensure that the chunky file size is always below the chunksize limit! If you don't do this extended check you will get also file sizes above the limit. The while statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc or other complex calculations will make the code more unreadable and not shorter than this example.)
Unfortunately you have to have a second EOF check, because the last chunk iteration will mostly result in a smaller chunk.
Also two helper variables are needed: the line is described above, the outfilenum is needed, because the resulting file sizes mostly do not match the exact chunksize.
For files of any size, split will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:
system("split -C 1M -d test.txt ''")
The options are:
-C 1M Put lines totalling no more than 1M in each chunk
-d Use decimal suffixes in the output filenames
test.txt The name of the input file
'' Use a blank output file prefix
Unless you're on Windows, this is the way to go.
Instead of opening your outfile inside the infile block, open the file and assign it to variable. When you hit the filesize limit, close the file and open a new one.
This code actually works, it's simple and it uses array which make it faster:
#!/usr/bin/env ruby
data = Array.new()
MAX_BYTES = 3500
MAX_LINES = 32
lineNum = 0
file_num = 0
bytes = 0
filename = 'W:/IN/tangoZ.txt_100.TXT'
r = File.exist?(filename)
puts 'File exists =' + r.to_s + ' ' + filename
file=File.open(filename,"r")
line_count = file.readlines.size
file_size = File.size(filename).to_f / 1024000
puts 'Total lines=' + line_count.to_s + ' size=' + file_size.to_s + ' Mb'
puts ' '
file = File.open(filename,"r")
#puts '1 File open read ' + filename
file.each{|line|
bytes += line.length
lineNum += 1
data << line
if bytes > MAX_BYTES then
# if lineNum > MAX_LINES then
bytes = 0
file_num += 1
#puts '_2 File open write ' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
data.clear
lineNum = 0
end
}
## write leftovers
file_num += 1
#puts '__3 File open write FINAL' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
I'm writing this little HelloWorld as a followup to this and the numbers do not add up
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each do |line|
total_bytes += line.unpack("U*").length
end
puts "original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
The result is not the same as the file size. I think I just need to know what format I need to plug in... or maybe I've missed the point entirely. How can I measure the file size line by line?
Note: I'm on Windows, and the file is encoded as type ANSI.
Edit: This produces the same results!
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each_byte do |whatever|
total_bytes += 1
end
puts "Original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
so anybody who can help now...
IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.
See the relevant Pickaxe section
May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...
Aha. I think I get it now.
Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:
class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end
if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end
Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.
I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.
If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:
class String
def size_in_bytes
self.unpack("C*").size
end
end
The unpack version is about 8 times faster than the each_byte one on my machine, btw.
You might try IO#each_byte, e.g.
total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"
That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte until you encounter \r\n. The IO class provides a bunch of pretty low-level read methods that might be helpful.
You potentially have several overlapping issues here:
Linefeed characters \r\n vs. \n (as per your previous post). Also EOF file character (^Z)?
Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?
Interaction of the $KCODE global variable (deprecated in ruby 1.9. See String#encoding and friends if you're running under 1.9). Are there, for example, accented characters in your file?
Your format string for #unpack. I think you want C* here if you really want to count bytes.
Note also the existence of IO#each_line (just so you can throw away the while and be a little more ruby-idiomatic ;-)).
The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.
So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.
BTW there's no EOF 'character'.
f = File.new("log.txt")
begin
while (line = f.readline)
line.chomp
puts line.length
end
rescue EOFError
f.close
end
Here is a simple solution, presuming that the current file pointer is set to the start of a line in the read file:
last_pos = file.pos
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
file.seek(backup_dist, IO::SEEK_CUR)
in this example "file" is the file from which you are reading. To do this in a loop:
last_pos = file.pos
begin loop
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
last_pos = current_pos
file.seek(backup_dist, IO::SEEK_CUR)
end loop