ngram a database file in Ruby - ruby

I am trying to ngram my database file. It works when I ngram a parsed string, but I do not know how to do the same for my database file.
I have the following code so far:
(hopefully I am in the right track)
require 'ngram'
require 'sqlite3'
ngram = NGram.new({
:size => 2,
:word_separator => " ",
:padchar => "_"
})
p ngram.parse('something')
# => ["__", "_t", "te", "es", "st", "t_", "__"]
p ngram.parse('test phrase')
db = SQLite3::Database.new("sample.db") #opens db
#ngram sample.db
Help is very much appreciated!

From the github code of ngram gem's parse method:
def parse(phrase)
words = phrase.split(#separator)
if words.length == 1
process(phrase)
else
words.map { |w| process(w) }
end
end
So, it's expecting a string object so that it can call String#split on it. That's why it works with your first example where you pass a string as an argument to the ngram.parse method.
I am not exactly sure what you want to accomplish here, but as long as you pass a string to the ngram.parse method, it would work. Or, at least, pass an argument that responds to the split method.

Related

How to pass method arguments use as Hash path?

E.G.
def do_the_thing(file_to_load, hash_path)
file = File.read(file)
data = JSON.parse(file, { symbolize_names: true })
data[sections.to_sym]
end
do_the_thing(file_I_want, '[:foo][:bar][0]')
Tried a few methods but failed so far.
Thanks for any help in advance :)
Assuming you missed the parameters names...
Lets assume our file is:
// test.json
{
"foo": {
"bar": ["foobar"]
}
}
Recomended solution
Does your param really need to be a string??
If your code can be more flexible, and pass arguments as they are on ruby, you can use the Hash dig method:
require 'json'
def do_the_thing(file, *hash_path)
file = File.read(file)
data = JSON.parse(file, symbolize_names: true)
data.dig(*hash_path)
end
do_the_thing('test.json', :foo, :bar, 0)
You should get
"foobar"
It should work fine !!
Read the rest of the answer if that doesn't satisfy your question
Alternative solution (using the same argument)
If you REALLY need to use that argument as string, you can;
Treat your params to adapt to the first solution, it won't be a small or fancy code, but it will work:
require 'json'
BRACKET_REGEX = /(\[[^\[]*\])/.freeze
# Treats the literal string to it's correspondent value
def treat_type(param)
# Remove the remaining brackets from the string
# You could do this step directly on the regex if you want to
param = param[1..-2]
case param[0]
# Checks if it is a string
when '\''
param[1..-2]
# Checks if it is a symbol
when ':'
param[1..-1].to_sym
else
begin
Integer(param)
rescue ArgumentError
param
end
end
end
# Converts your param to the accepted pattern of 'dig' method
def string_to_args(param)
# Scan method will break the match results of the regex into an array
param.scan(BRACKET_REGEX).flatten.map { |match| treat_type(match) }
end
def do_the_thing(file, hash_path)
hash_path = string_to_args(hash_path)
file = File.read(file)
data = JSON.parse(file, symbolize_names: true)
data.dig(*hash_path)
end
so:
do_the_thing('test.json', '[:foo][:bar][0]')
returns
"foobar"
This solution though is open to bugs when the "hash_path" is not on an acceptable pattern, and treating it's bugs might make the code even longer
Shortest solution (Not safe)
You can use Kernel eval method which I EXTREMELY discourage to use for security reasons, read the documentation and understand its danger before using it
require 'json'
def do_the_thing(file, hash_path)
file = File.read(file)
data = JSON.parse(file, symbolize_names: true)
eval("data#{hash_path}")
end
do_the_thing('test.json', '[:foo][:bar][0]')
If the procedure you were trying to work with was just extracting the JSON data to an object, you might find yourself using either of the following scenarios:
def do_the_thing(file_to_load)
file = File.read(file)
data = JSON.parse(file, { symbolize_names: true })
data[sections.to_sym]
end
do_the_thing(file_I_want)[:foo][:bar][0]
or use the dig function of Hash :
def do_the_thing(file_to_load, sections)
file = File.read(file)
data = JSON.parse(file, { symbolize_names: true })
data.dig(*sections)
end
do_the_thing(file_I_want, [:foo, :bar, 0])

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Extract url params in ruby

I would like to extract parameters from url. I have following path pattern:
pattern = "/foo/:foo_id/bar/:bar_id"
And example url:
url = "/foo/1/bar/2"
I would like to get {foo_id: 1, bar_id: 2}. I tried to convert pattern into something like this:
"\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)"
I failed on first step when I wanted to replace backslash in url:
formatted = pattern.gsub("/", "\/")
Do you know how to fix this gsub? Maybe you know better solution to do this.
EDIT:
It is plain Ruby. I am not using RoR.
As I said above, you only need to escape slashes in a Regexp literal, e.g. /foo\/bar/. When defining a Regexp from a string it's not necessary: Regexp.new("foo/bar") produces the same Regexp as /foo\/bar/.
As to your larger problem, here's how I'd solve it, which I'm guessing is pretty much how you'd been planning to solve it:
PATTERN_PART_MATCH = /:(\w+)/
PATTERN_PART_REPLACE = '(?<\1>.+?)'
def pattern_to_regexp(pattern)
expr = Regexp.escape(pattern) # just in case
.gsub(PATTERN_PART_MATCH, PATTERN_PART_REPLACE)
Regexp.new(expr)
end
pattern = "/foo/:foo_id/bar/:bar_id"
expr = pattern_to_regexp(pattern)
# => /\/foo\/(?<foo_id>.+?)\/bar\/(?<bar_id>.+?)/
str = "/foo/1/bar/2"
expr.match(str)
# => #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
Try this:
regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
matches = "/foo/1/bar/2".match(regex)
Hash[matches.names.zip(matches[1..-1])]
IRB output:
2.3.1 :032 > regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
=> /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
2.3.1 :033 > matches = "/foo/1/bar/2".match(regex)
=> #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
2.3.1 :034 > Hash[matches.names.zip(matches[1..-1])]
=> {"foo_id"=>"1", "bar_id"=>"2"}
I'd advise reading this article on how Rack parses query params. The above works for your example you gave, but is not extensible for other params.
http://codefol.io/posts/How-Does-Rack-Parse-Query-Params-With-parse-nested-query
This might help you, the foo id and bar id will be dynamic.
require 'json'
#url to scan
url = "/foo/1/bar/2"
#scanning ids from url
id = url.scan(/\d/)
#gsub method to replacing values from url
url_with_id = url.gsub(url, "{foo_id: #{id[0]}, bar_id: #{id[1]}}")
#output
=> "{foo_id: 1, bar_id: 2}"
If you want to change string to hash
url_hash = eval(url_with_id)
=>{:foo_id=>1, :bar_id=>2}

How do I pass a hash from commandline?

I have a ruby script that has a hash.
Example:
animal_sound = { 'dog' => 'bark', 'cat' => 'meow' }
I want to add 'snake' => 'hiss'
Example:
myscript.rb --addsound "'snake' => 'hiss'"
Then in my script have it add it to animal_sound.
Example:
animal_sound.merge! 'snake' => 'hiss'
=> {"dog"=>"bark", "cat"=>"meow", "snake"=>"hiss"}
Is there a way to do this?
Here is the whole script:
#!/usr/bin/env ruby
require 'rubygems'
require 'micro-optparse'
options = Parser.new do |p|
p.option :addsound, "add sound"
end.process!
animal_sound = { 'dog' => 'bark', 'cat' => 'meow' }
if options[:add_sound]
newsound = options[:add_sound]
animal_sound.merge! newsound
end
puts animal_sound
When I run my script I get:
$ bin/myscript.rb --addsound "'snake' => 'hiss'"
bin/myscript.rb:14:in `merge!': can't convert true into Hash (TypeError)
from bin/myscript.rb:14:in `<main>'
SOLVED:
Using PSkocik's solution I got the script to work using animal, sound = options[:addsound].split(' => '); animal_sound[animal] = sound
I also used Simone Carletti's idea to simplify the CLI command. FYI it also works if I want to pass in hash format, like myscript.rb --addsound "'snake' => 'hiss'". Of course the split has to be changed back to split(' => '). I like the simpler CLI using the :.
Example:
myscript.rb --addsound snake:hiss
Final Code:
#!/usr/bin/env ruby
require 'rubygems'
require 'micro-optparse'
options = Parser.new do |p|
p.option :addsound, "add sound", default: ""
end.process!
animal_sound = { 'dog' => 'bark', 'cat' => 'meow' }
if options[:addsound]
animal, sound = options[:addsound].split(':')
animal_sound[animal] = sound
end
puts animal_sound
Command line:
$ bin/myscript.rb --addsound snake:hiss
{"dog"=>"bark", "cat"=>"meow", "snake"=>"hiss"}
I never could get the merge to work.
Each post was helpful. Thanks.
It's a good idea to keep the CLI interface detached from the underlying implementation. In fact, you may decide to switch the script in the future from Ruby to another language, and you don't really want to change the way the code is invoked.
My suggestion is to pass a serialized value, for example
myscript.rb --addsound snake:hiss
In the code, simply decompose the content and merge it.
if options[:add_sound]
animal, sound = options[:add_sound].split(":")
animal_sound.merge!(animal => sound)
end
p.option :addsound, "add sound"
^ this makes it a flag (true or false)
What you want is make it into a switch whose value is the next argument:
p.option :addsound, "add sound", default: ""
^ this makes it a switch, the string value will be assigned to options[:addsound]
newsound = options[:addsound]
^ Here you need to drop the underscore and parse the string into a hash.
Eval is evil.
For example, you could split it on ' => ' and forget about quoting:
newsound = [ options[:addsound].split(' => ') ].to_h #and then merge it
(Passing the argument like so --addsound snake:hiss and then splitting on ':' instead of ' => ' is another good option.)
^splitting on ' => ' should yield a two-member array. Here I put it into another array (arrays of two-member arrays are convertible to hashes) to make it convertible into a hash.
Or you do completely without merging and constructing another hash:
animal, sound = options[:addsound].split(' => ')
animal_sound[animal] = sound
In regards to your error
Notice the line if options[:add_sound]. That basically evaluates to if true. You are getting your error because you are setting newsound to true, and trying to merge a Boolean into a hash. To my knowledge, the .merge only works like so: hash1.merge(hash2).
Passing command line argument
Rather than passing the argument "'snake' => 'hiss'", I suggest making this a comma-delineated list, like so: "snake,hiss". From there, in your if options[:add_sound] block, you can split the string into an array, using a comma as a splitter. Finally, rather than using .merge, you can add your key:value as you normally would for any hash in Ruby. animal_sound[arr[0]] = arr[1].
Mind you, this method will work best with a single key:value pair. I am sure you can submit multiple pairs, but you would need to (by this method) split into more arrays by an additional character(like / maybe).

Taking multiple lines from a file and creating hash

I'm taking a file and reading in it's contents and creating a hash based on newlines. I've been able to make a hash based on the contents of each line, but how can I create a hash based on the content of everything before the next blank newline? Below is what I have so far.
Input:
Title 49th parallel
URL http://artsweb.bham.ac.uk/
Domain artsweb.bham.ac.uk
Title ABAA booknet
URL http://abaa.org/
Domain abaa.org
Code:
File.readlines('A.cfg').each do |line|
unless line.strip.empty?
hash = Hash[*line.strip.split("\t")]
puts hash
end
puts "\n" if line.strip.empty?
end
Outputs:
{"Title"=>"49th parallel"}
{"URL"=>"http://artsweb.bham.ac.uk/"}
{"Domain"=>"artsweb.bham.ac.uk"}
{"Title"=>"ABAA booknet"}
{"URL"=>"http://abaa.org/"}
{"Domain"=>"abaa.org"}
Desired Output:
{"Title"=>"49th parallel", "URL"=>"http://artsweb.bham.ac.uk/", "Domain"=>"artsweb.bham.ac.uk"}
{"Title"=>"ABAA booknet", "URL"=>"http://abaa.org/", "Domain"=>"abaa.org"}
Modifying your existing code, this does what you want:
hash = {}
File.readlines('A.cfg').each do |line|
if line.strip.empty?
puts hash if not hash.empty?
hash = {}
puts "\n"
else
hash.merge!(Hash[*line.strip.split("\t")])
end
end
puts hash
You can likely simplify that depending on what you're actually doing with the data.
open('A.cfg', &:read)
.strip.split(/#$/{2,}/)
.map{|s| Hash[s.scan(/^(\S+)\s+(\S+)/)]}
gives
[
{
"Title" => "49th",
"URL" => "http://artsweb.bham.ac.uk/",
"Domain" => "artsweb.bham.ac.uk"
},
{
"Title" => "ABAA",
"URL" => "http://abaa.org/",
"Domain" => "abaa.org"
}
]
read the whole content of the file using read:
contents = ""
File.open('A.cfg').do |file|
contents = file.read
end
And then split the contents on two newline characters:
contents.split("\n\n")
And lastly, create a function pretty similar to what you already have to parse those chunks.
Please note that if you are working on windows it may happen that you need to split on a different sequence because of the carriage return character.

Resources