I have a textfile ("dict.txt") with 8K+ English words:
apple -- description text
angry -- description text
bear -- description text
...
I need to delete all text after "--" on each line of my file.
What is the easiest and fastest way to solve this problem?
Starting with:
words = [
'apple -- description text',
'angry -- description text',
'bear -- description text',
]
If you want just the words preceeding --:
words.map{ |w| w.split(/\s-+\s/).first } # => ["apple", "angry", "bear"]
Or:
words.map{ |w| w[/^(.+) --/, 1] } # => ["apple", "angry", "bear"]
If you want the words AND --:
words.map{ |w| w[/^(.+ --)/, 1] } # => ["apple --", "angry --", "bear --"]
If the goal is to create a version of the file without the descriptions:
File.open('new_dict.txt', 'w') do |fo|
File.foreach('dict.txt') do |li|
fo.puts li.split(/\s-+\s/).first
end
end
In general, to avoid scalability problems if/when your input file grows to huge proportions, use foreach to iterate over the input file and process it as single lines. It's a wash as far as processing speed goes when iterating line-by-line or trying to slurp it all in and process as a buffer or an array. Slurping a huge file can slow a machine to a crawl or crash your code making it infinitely slower; Line-by-line IO is surprising fast and without that potential problem.
File.read("dict.txt").gsub(/(?<=--).*/, "")
output
apple --
angry --
bear --
...
lines_without_description = File.read('dict.txt').lines.map{|line| line[0..line.index('-')+1]}
File.open('dict2.txt', 'w'){|f| f.write(lines_without_description.join("\n"))}
If you want speed, you might want to think about doing it with sed on the command line:
sed -r 's/(.*?) -- .*/\1/g' < dict.txt > new_dict.txt
This creates a new file new_dict.txt containing only the words.
Related
I have a text file that starts with:
Title
aaa
bbb
ccc
I don't know what the line would include, but I know that the structure of the file will be Title, then an empty line, then the actual lines. I want to modify it to:
New Title
fff
aaa
bbb
ccc
I had this in mind:
lineArray = File.readlines(destinationFile).drop(2)
lineArray.insert(0, 'fff\n')
lineArray.insert(0, '\n')
lineArray.insert(0, 'new Title\n')
File.writelines(destinationFile, lineArray)
but writelines doesn't exist.
`writelines' for File:Class (NoMethodError)
Is there a way to delete the first two lines of the file an add three new lines?
I'd start with something like this:
NEWLINES = {
0 => "New Title",
1 => "\nfff"
}
File.open('test.txt.new', 'w') do |fo|
File.foreach('test.txt').with_index do |li, ln|
fo.puts (NEWLINES[ln] || li)
end
end
Here's the contents of test.txt.new after running:
New Title
fff
aaa
bbb
ccc
The idea is to provide a list of replacement lines in the NEWLINES hash. As each line is read from the original file the line number is checked in the hash, and if the line exists then the corresponding value is used, otherwise the original line is used.
If you want to read the entire file then substitute, it reduces the code a little, but the code will have scalability issues:
NEWLINES = [
"New Title",
"",
"fff"
]
file = File.readlines('test.txt')
File.open('test.txt.new', 'w') do |fo|
fo.puts NEWLINES
fo.puts file[(NEWLINES.size - 1) .. -1]
end
It's not very smart but it'll work for simple replacements.
If you really want to do it right, learn how diff works, create a diff file, then let it do the heavy lifting, as it's designed for this sort of task, runs extremely fast, and is used millions of times every day on *nix systems around the world.
Use put with the whole array:
File.open("destinationFile", "w+") do |f|
f.puts(lineArray)
end
If your files are big, the performance and memory implications of reading them into memory in their entirety are worth thinking about. If that's a concern, then your best bet is to treat the files as streams. Here's how I would do it.
First, define your replacement text:
require "stringio"
replacement = StringOI.new <<END
New Title
fff
END
I've made this a StringIO object, but it could also be a File object if your replacement text is in a file.
Now, open your destination file (a new file) and write each line from the replacement text into it.
dest = File.open(dest_fn, 'wb') do |dest|
replacement.each_line {|ln| dest << ln }
We could have done this more efficiently, but there's a good reason to do it this way: Now we can call replacement.lineno to get the number of lines read, instead of iterating over it a second time to count the lines.
Next, open the original file and seek ahead by calling gets replacement.lineno times:
orig = File.open(orig_fn, 'r')
replacement.lineno.times { orig.gets }
Finally, write the remaining lines from the original file to the new file. We'll do it more efficiently this time with File.copy_stream:
File.copy_stream(orig, dest)
orig.close
dest.close
That's it. Of course, it's a drag closing those files manually (and when we do we should do it in an ensure block), so it's better to use the block form of File.open to automatically close them. Also, we can move the orig.gets calls into the replacement.each_line loop:
File.open(dest_fn, 'wb') do |dest|
File.open(orig_fn, 'r') do |orig|
replacement.each_line {|ln| dest << ln; orig.gets }
File.copy_stream(orig, dest)
end
end
First create an input test file.
FNameIn = "test_in"
text = <<_
Title
How now,
brown cow?
_
#=> "Title\n\nHow now,\nbrown cow?\n"
File.write(FNameIn, text)
#=> 27
Now read and write line-by-line.
FNameOut = "test_out"
File.open(FNameIn) do |fin|
fin.gets; fin.gets
File.open(FNameOut, 'w') do |fout|
fout.puts "New Title"
fout.puts
fout.puts "fff"
until fin.eof?
fout.puts fin.gets
end
end
end
Check the result:
puts File.read(FNameOut)
# New Title
#
# fff
# How now,
# brown cow?
Ruby will close each of the two files when its block terminates.
If the files are not large, you could instead write:
File.write(FNameOut,
["New Title\n", "\n", "fff\n"].concat(File.readlines(FNameIn).drop(2)).join)
I am extracting files from a zip archive in Ruby using RubyZip, and I need to label files based on characteristics of their filenames:
Example:
I have the following hash:
labels = {
:data_file=>/.\.dat/i,
:metadata=>/.\.xml/i,
:text_location=>/.\.txt/i
}
So, I have the file name of each file in the zip, let's say an example is
filename = 382582941917841df.xml
Assume that each file will match only one Regex in the labels hash, and if not it doesn't matter, just choose the first match. (In this case the regular expressions are all for detecting extensions, but it could be to detect any filename mask like DSC****.jpg for example.
I am doing this now:
label_match =~ labels.find {|key,value| filename =~ value}
---> label_match = [:metadata, /.\.xml/]
label_sym = label_match.nil? ? nil: label_match.first
So this works fine, however doesn't seem very Ruby-like. Is there something I am missing to clean this up nicely?
A case when does this effortlessly:
filename = "382582941917841df.xml"
category = case filename
when /.\.dat/i ; :data_file
when /.\.xml/i ; :metadata
when /.\.txt/i ; :text_location
end
p category # => :metadata ; nil if nothing matched
I think you're doing it backwards and the hard way. Ruby makes it easy to get the extension of a file, which then makes it easy to map it to something.
Starting with something like:
FILENAMES = %w[ foo.bar foo.baz 382582941917841df.xml DSC****.jpg]
FILETYPES = {
'.bar' => 'bar',
'.baz' => 'baz',
'.xml' => 'metadata',
'.dat' => 'data',
'.jpg' => 'image'
}
FILENAMES.each do |fn|
puts "#{ fn } is a #{ FILETYPES[File.extname(fn)] } file"
end
# >> foo.bar is a bar file
# >> foo.baz is a baz file
# >> 382582941917841df.xml is a metadata file
# >> DSC****.jpg is a image file
File.extname is built into Ruby. The File class contains many similar methods useful for finding out things about files known by the OS and/or tearing apart file paths and file names so it's a really good thing to become very familiar with.
It's also important to understand that an improperly written regexp, such as /.\.dat/i can be the source of a lot of pain. Consider these:
'foo.xml.dat'[/.\.dat/] # => "l.dat"
'foo.database.20010101.csv'[/.\.dat/] # => "o.dat"
Are the files really "data" files?
Why is the character in front of the delimiting . important or necessary?
Do you really want to slow your code using unanchored regexp patterns when a method, such as extname will be faster and less maintenance?
Those are things to consider when writing code.
Rather than using nil to indicate the label when there is no match, consider using another symbol like :unknown.
Then you can do:
labels = {
:data_file=>/.\.dat/i,
:metadata=>/.\.xml/i,
:text_location=>/.\.txt/i,
:unknown=>/.*/
}
label = labels.find {|key,value| filename =~ value}.first
I read file using ruby and use .split to split line.
Example.txt
1
2
3
line1,line2,line3= #line.to_s.split("\n",3)
#actual
line1 => ["1
line2 => ", "2
line3 => ", "3"]
#what I expect
line1=1
line2=2
line3=3
how can i get what i expected?
Edit: it 's just 1 new line because I can't enter 1 new line in my question. To be more specific:
Example.txt
first_line\nsecond_line\nthird_line
File.open('Example.txt', 'r') do |f1|
#line = f1.readlines
f1.close
end
line1,line2,line3= #line.to_s.split("\n",3)
#actual
line1 => ["first_line
line2 => ", "second_line
line3 => ", "third_line"]
#what I expect
line1=first_line
line2=second_line
line3=third_line
You can't split using '\n', if you're trying to use line-ends. You MUST use "\n".
Strings using '\n' do not interpret \n as a line-ending, instead, they treat it as a literal backslash followed by "n":
'\n' # => "\\n"
"\n" # => "\n"
The question isn't at all clear, nor is the input file example clear given the little example code presented, however, guessing at what you want from the desired result...
If the input is a file called 'example.txt' looking like:
1
2
3
You can read it numerous ways:
File.read('example.txt').split("\n")
# => ["1", "2", "3"]
Or:
File.readlines('example.txt').map(&:chomp)
# => ["1", "2", "3"]
Either of those work, however, there is a very bad precedence set when reading files into memory like this. It's called "slurping" and can crash your code or take it, and the machine it's running on, to a crawl if the file is larger than the available memory. And, even if it fits into memory, loading a huge file into memory can cause pauses as memory is allocated, and reallocated. So, don't do that.
Instead, read the file line-by-line and process it that way if at all possible:
File.foreach('example.txt') do |line|
puts line
end
# >> 1
# >> 2
# >> 3
Don't do this:
File.open('Example.txt', 'r') do |f1|
#line = f1.readlines
f1.close
end
Ruby will automatically close a file opened like this:
File.open('Example.txt', 'r') do |f1|
...
end
There is no need to use close inside the block.
Depends a bit on what exactly you're expecting (and what #line is). If you are looking for numbers you can use:
line1, line2, line3 = #line.to_s.scan(/\d/)
If you want other characters you can use some other regular expression.
I want to grab only the first line of columns 46 to 245 of source.txt and write it to output.txt
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
Bonus: I also need to keep a count of the number of characters in this range, as some may be whitespace. i.e. 38 characters and the rest whitespace.
Example:
source_file: (first line only, columns 45 to 245): 13287912721981239854 + 180 blank columns
output_file: 13287912721981239854
count = 20 characters
Update: appending [46..245].delete(' ').size gives me the desired count.
If I am understanding what you are asking correctly, there's no reason to grab the whole file when you only want the first line. If this isn't what you're asking for, then you need to specify what you're trying to pull out of the source file more clearly.
This should grab the data you need:
output_line = source_file.gets [45..244]
If you write:
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
}
You will open, then close, your output file for each line read from the output file. That is the wrong way to do it, even if you only want to read one line of input.
Instead try something like one of these:
File.open(output_file, 'a') do |fo|
File.open('path/to/input_file') do |fi|
fo.puts fi.readline[46..245]
end
end
This uses IO.readline, which reads a single line from the file. The block falls through afterwards, causing both the input and output files to be closed automatically. Also, it opens the output file as 'a' which is append-mode only. 'a+' is wrong unless you intend to append and read, which is rarely done. From the documentation:
"a+" Read-write, starts at end of file if file exists,
otherwise creates a new file for reading and
writing
Or:
File.open(output_file, 'a') do |fo|
File.foreach('path/to/input_file') do |li|
fo.puts li[46..245]
break
end
end
foreach is used most often when we're reading a file line-by-line. It's the mainstay for reading files in a scalable manner. It wants to loop over the file inside the block, which is why break is there, to break out of that loop.
Or:
File.foreach('path/to/input_file') do |li|
File.write(output_file, li[46..245], -1, :mode => 'a')
break
end
File.write is useful when you have a blob of text or binary, and want to write it in one chunk, then move on. The -1 tells Ruby to move to the end of the file. :mode => 'a' overrides the default mode which would normally truncate an existing file.
Maybe this will do the job:
line = f.readline
columns = line.split
File.open("output.txt", "w") do |out|
columns[46, (245 - 46 + 1)].each do |column|
out.puts column
end
end
break # only process first line
I have used 245 - 46 + 1 to indicate this is the number of columns we are interested in. I have also assumed that columns are separate by whitespaces. If that is not the case you will need to change the delimiter of split.
I'm trying to use a list of hundreds of common misspellings to clean some input before searching for duplicates.
It's a time-critical process, so I'm hoping that there's a quicker way than having hundreds of regexes (or one with a hundred branches).
Is there an efficient way to perform hundreds of text substitutions in Ruby?
An alternative approach, if your input data is separated words, would simply be to build a hash table of {error => correction}.
Hash table lookup is fast, so if you can bend your input data to this format, it will almost certainly be fast enough.
I'm happy to say I just found "RegexpTrie" which is a useable replacement to the code, and need for, Perl's Regexp::Assemble.
Install it, and give it a try:
require 'regexp_trie'
foo = %w(miss misses missouri mississippi)
RegexpTrie.union(foo)
# => /miss(?:(?:es|ouri|issippi))?/
RegexpTrie.union(foo, option: Regexp::IGNORECASE)
# => /miss(?:(?:es|ouri|issippi))?/i
Here's a comparison of the outputs. The first, commented outputs in the array, are from Regexp::Assemble and the trailing output is from RegexpTrie:
require 'regexp_trie'
[
'how now brown cow', # /(?:[chn]ow|brown)/
'the rain in spain stays mainly on the plain', # /(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)/
'jackdaws love my giant sphinx of quartz', # /(?:jackdaws|quartz|sphinx|giant|love|my|of)/
'fu foo bar foobar', # /(?:f(?:oo(?:bar)?|u)|bar)/
'ms miss misses missouri mississippi' # /m(?:iss(?:(?:issipp|our)i|es)?|s)/
].each do |s|
puts "%-43s # /%s/" % [s, RegexpTrie.union(s.split).source]
end
# >> how now brown cow # /(?:how|now|brown|cow)/
# >> the rain in spain stays mainly on the plain # /(?:the|rain|in|s(?:pain|tays)|mainly|on|plain)/
# >> jackdaws love my giant sphinx of quartz # /(?:jackdaws|love|my|giant|sphinx|of|quartz)/
# >> fu foo bar foobar # /(?:f(?:oo(?:bar)?|u)|bar)/
# >> ms miss misses missouri mississippi # /m(?:iss(?:(?:es|ouri|issippi))?|s)/
Regarding how to use the Wikipedia link and misspelled words:
require 'nokogiri'
require 'open-uri'
require 'regexp_trie'
URL = 'https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines'
doc = Nokogiri::HTML(open(URL))
corrections = doc.at('div#mw-content-text pre').text.lines[1..-1].map { |s|
a, b = s.chomp.split('->', 2)
[a, b.split(/,\s+/) ]
}.to_h
# {"abandonned"=>["abandoned"],
# "aberation"=>["aberration"],
# "abilityes"=>["abilities"],
# "abilties"=>["abilities"],
# "abilty"=>["ability"],
# "abondon"=>["abandon"],
# "abbout"=>["about"],
# "abotu"=>["about"],
# "abouta"=>["about a"],
# ...
# }
misspelled_words_regex = /\b(?:#{RegexpTrie.union(corrections.keys, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:a(?:b(?:andonned|eration|il(?:ityes|t(?:ies|y))|o(?:ndon(?:(?:ed|ing|s))?|tu|ut(?:it|the|a)...
At this point you can use gsub(misspelled_words_regex, corrections), however, the values in corrections contain some arrays because multiple words or phrases could have been used to replace the misspelled word. You'll have to do something to determine which of the choices to use.
Ruby is missing a very useful module found in Perl, called Regexp::Assemble. Python has hachoir-regex which appears to do the same sort of thing.
Regexp::Assemble creates a very efficient regular expression, based on lists of words and simple expressions. It's really remarkable ... or ... diabolical?
Check out the example for the module; It's extremely simple to use in its basic form:
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])
Notice how it's combining the patterns. It'd do the same with regular words, only it would be even more efficient because common strings will be combined:
use Regexp::Assemble;
my $lorem = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.';
my $ra = Regexp::Assemble->new('flags' => 'i');
$lorem =~ s/[^a-zA-Z ]+//g;
$ra->add(split(' ', lc($lorem)));
print $ra->anchor_word(1)->as_string, "\n";
Which outputs:
\b(?:a(?:dipisicing|liqua|met)|(?:consectetu|tempo)r|do(?:lor(?:emagna)?)?|e(?:(?:li)?t|iusmod)|i(?:ncididunt|psum)|l(?:abore|orem)|s(?:ed|it)|ut)\b
This code ignores case and honors word boundaries.
I'd recommend writing a little Perl app that can take a list of words and use that module to output the stringified version of the regex pattern. You should be able to import that pattern into Ruby. That would let you very quickly find misspelled words. You could even have it output the pattern to a YAML file, then load that file into your Ruby code. Periodically parse the misspelled word pages, run the output through the Perl code, and your Ruby code would have an updating pattern.
You could use that pattern against a chunk of text just to see if there are misspelled words. If so, then you break the text down into sentences or words and check against the regex again. Don't immediately test against words because most words will be spelled correctly. It's almost like a binary search against your text - test the whole thing, if there's a hit then break into smaller blocks to narrow the search until you've found the individual misspellings. How you break down the chunks depends on the amount of incoming text. A regex pattern can test the entire text block and return a nil or index value, in addition to individual words the same way, so you gain a lot of speed doing big chunks of the text.
Then, if you know you have a misspelled word you can do a hash lookup for the correct spelling. It would be a big hash, but the task of sifting out the good vs. bad spellings is what will take the longest. The lookup would be extremely fast.
Here's some example code:
get_words.rb
#!/usr/bin/env ruby
require 'open-uri'
require 'nokogiri'
require 'yaml'
words = {}
['0-9', *('A'..'Z').to_a].each do |l|
begin
print "Reading #{l}... "
html = open("http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/#{l}").read
puts 'ok'
rescue Exception => e
puts "got \"#{e}\""
next
end
doc = Nokogiri::HTML(html)
doc.search('div#bodyContent > ul > li').each do |n|
n.content =~ /^(\w+) \s+ \(([^)]+)/x
words[$1] = $2
end
end
File.open('wordlist.yaml', 'w') do |wordfile|
wordfile.puts words.to_yaml
end
regex_assemble.pl
#!/usr/bin/env perl
use Regexp::Assemble;
use YAML;
use warnings;
use strict;
my $ra = Regexp::Assemble->new('flags' => 'i');
my %words = %{YAML::LoadFile('wordlist.yaml')};
$ra->add(map{ lc($_) } keys(%words));
print $ra->chomp(1)->anchor_word(1)->as_string, "\n";
Run the first, then run the second piping its output to a file to capture the emitted regex.
And more examples of words and the generated output:
'how now brown cow' => /\b(?:[chn]ow|brown)\b/
'the rain in spain stays mainly on the plain' => /\b(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)\b/
'jackdaws love my giant sphinx of quartz' => /\b(?:jackdaws|quartz|sphinx|giant|love|my|of)\b/
'fu foo bar foobar' => /\b(?:f(?:oo(?:bar)?|u)|bar)\b/
'ms miss misses missouri mississippi' => /\bm(?:iss(?:(?:issipp|our)i|es)?|s)\b/
Ruby's Regexp.union is nowhere close to the sophistication of Regexp::Assemble. After capturing the list of misspelled words, there are 4225 words, consisting of 41,817 characters. After running Perl's Regexp::Assemble against that list, a 30,954 character regex was generated. I'd say that's efficient.
Try it the other way around. Rather than correcting misspellings and checking for duplicates on the result, drop everything to a soundalike format (like Metaphone, or Soundex), and check for duplicates in that format.
Now, I don't know which way is likely to be faster - on the one hand, you've got hundreds of regexes, each of which will fail to match almost instantly and return. On the other, you've got 30-odd potential regex replacements, one or two of which will definitely match for every word.
Now, metaphone is pretty fast - there's really not a lot to the algorithm - so I can only suggest that you try it out and measure if either is fast enough for your use.