Problem with initializing a hash in ruby - ruby

I have a text file from which I want to create a Hash for faster access. My text file is of format (space delimited)
author title date popularity
I want to create a hash in which author is the key and the remaining is the value as an array.
created_hash["briggs"] = ["Manup", "Jun,2007", 10]
Thanks in advance.

require 'date'
created_hash = File.foreach('test.txt', mode: 'rt', encoding: 'UTF-8').
reduce({}) {|hsh, l|
name, title, date, pop = l.split
hsh.tap {|hsh| hsh[name] = [title, Date.parse(date), pop.to_i] }
}
I threw some type conversion code in there, just for fun. If you don't want that, the loop body becomes even simpler:
k, *v = l.split
hsh.tap {|hsh| hsh[k] = v }
You can also use readlines instead of foreach. Note that IO#readlines reads the entire file into an array first. So, you need enough memory to hold both the entire array and the entire hash. (Of course, the array will be eligible for garbage collection as soon as the loop finishes.)

Just loop through each line of the file, use the first space-delimited item as the hash key and the rest as the hash value. Pretty much exactly as you described.
created_hash = {}
file_contents.each_line do |line|
data = line.split(' ')
created_hash[data[0]] = data.drop 1
end

Related

Ruby - Extra punctuation in file when using regex and csv class to write to a file

I'm using regex to grab parameters from an html file.
I've tested the regexp and it seems to be fine- it appears that the csv conversion is what's causing the issue, but I'm not sure.
Here is what I have:
mechanics_file= File.read(filename)
mechanics= mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/)
id_file= File.read(filename)
id=id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/)
puts id.zip(mechanics)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
The puts output looks like this:
2073
Acting
2689
Action / Movement Programming
But the contents of the csv look like this:
"[""2073""]","[""Acting""]"
"[""2689""]","[""Action / Movement Programming""]"
How do I get rid of all of the extra quotes and brackets? Am I doing something wrong in the process of writing to a csv?
This is my first project in ruby so I would appreciate a child-friendly explanation :) Thanks in advance!
String#scan returns an Array of Arrays (bold emphasis mine):
scan(pattern) → array
Both forms iterate through str, matching the pattern (which may be a Regexp or a String). For each match, a result is generated and either added to the result array or passed to the block. If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
a = "cruel world"
# […]
a.scan(/(...)/) #=> [["cru"], ["el "], ["wor"]]
So, id looks like this:
id == [['2073'], ['2689']]
and mechanics looks like this:
mechanics == [['Acting'], ['Action / Movement Programming']]
id.zip(movements) then looks like this:
id.zip(movements) == [[['2073'], ['Acting']], [['2689'], ['Action / Movement Programming']]]
Which means that in your loop, each row looks like this:
row == [['2073'], ['Acting']]
row == [['2689'], ['Action / Movement Programming']]
CSV#<< expects an Array of Strings, or things that can be converted to Strings as an argument. You are passing it an Array of Arrays, which it will happily convert to an Array of Strings for you by calling Array#to_s on each element, and that looks like this:
[['2073'], ['Acting']].map(&:to_s) == [ '["2073"]', '["Acting"]' ]
[['2689'], ['Action / Movement Programming']].map(&:to_s) == [ '["2689"]', '["Action / Movement Programming"]' ]
Lastly, " is the string delimiter in CSV, and needs to be escaped by doubling it, so what actually gets written to the CSV file is this:
"[""2073""]", "[""Acting""]"
"[""2689""]", "[""Action / Movement Programming""]"
The simplest way to correct this, would be to flatten the return values of the scans (and maybe also convert the IDs to Integers, assuming that they are, in fact, Integers):
mechanics_file = File.read(filename)
mechanics = mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/).flatten
id_file = File.read(filename)
id = id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/).flatten.map(&:to_i)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
Another suggestion would be to forgo the Regexps completely and use an HTML parser to parse the HTML.

How to print elements with same Xpath below same column (same header)

I'm trying to parse an XML file with REXML on Ruby.
What I want is print all values and the corresponding element name as header. The issue I have
is that some nodes have child elements that appear repeated and have the same Xpath, so for those
elements I want to printing in the same column. Then for the small sample below, the output desired
for the elements of Node_XX would be:
Output I'm looking for:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
What I have so far is the code below, but I don´t know how to do in order repeated
elements be printed in the same column.
Thanks in advance for any help.
Code I have so far:
#!/usr/bin/env ruby
require 'rexml/document'
include REXML
xmldoc = Document.new File.new("input.xml")
arr_H_Xpath = [] # Array to store only once all Xpath´s (without Xpath repeated)
arr_H_Values = [] # Array for headers (each child element´s name)
arr_Values = [] # Values of each child element.
xmldoc.elements.each("//Node_XYZ") {|element|
element.each_recursive do |child|
if (child.has_text? && child.text =~ /^[[:alnum:]]/) && !arr_H_Xpath.include?(child.xpath.gsub(/\[.\]/,"")) # Check if element has text and Xpath is stored in arr_H_Xpath.
arr_H_Xpath << child.xpath.gsub(/\[.\]/,"") #Remove the [..] for repeated XPaths
arr_H_Values << child.xpath.gsub(/\/\w.*\//,"") #Get only name of child element to use it as header
arr_Values << child.text
end
print arr_H_Values + "|"
arr_H_Values.clear
end
puts arr_Values.join("|")
}
The input.xml is:
<TopNode>
<NodeX>
<Node_XX>
<RepCD_valm>
<RepVal>MTSJ</RepVal>
</RepCD_valm>
<RepCD_yur>
<Yur>
<YurVal>AB01-J</YurVal>
</Yur>
</RepCD_yur>
<CodesDif>
<CD_Ranges>
<CD_val>45</CD_val>
<HJY_val>01</HJY_val>
<CD_Sub>
<CD_SubA>87</CD_SubA>
<CD_SubB>12</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
<CodesDif>
<CD_Ranges>
<CD_val>34</CD_val>
<HJY_val>11</HJY_val>
<CD_Sub>
<CD_SubA>43</CD_SubA>
<CD_SubB>62</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
</Node_XX>
<Node_XY>
....
....
....
</Node_XY>
</NodeX>
</TopNode>
Here's one way to solve your problem. It is probably a little unusual, but I was experimenting. :)
First, I chose a data structure that can store the headers as keys and multiple values per key to represent the additional row(s) of data: a MultiMap. It is like a hash with multiple keys.
With the multimap, you can store the elements as key-value pairs:
data = Multimap.new
doc.xpath('//RepVal|//YurVal|//CD_val|//HJY_val|//CD_SubA|//CD_SubB').each do |elem|
data[elem.name] = elem.inner_text
end
The content of data is:
{"RepVal"=>["MTSJ"],
"YurVal"=>["AB01-J"],
"CD_val"=>["45", "34"],
"HJY_val"=>["01", "11"],
"CD_SubA"=>["87", "43"],
"CD_SubB"=>["12", "62"]}
As you can see, this was a simple way to collect all the information you need to create your table. Now it is just a matter of transforming it to your pipe-delimited format. For this, or any delimited format, I recommend using CSV:
out = CSV.generate({col_sep: "|"}) do |csv|
columns = data.keys.to_a.uniq
csv << columns
while !data.values.empty? do
csv << columns.map { |col| data[col].shift }
end
end
The output is:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
Explanation:
CSV.generate creates a string. If you wanted to create an output file directly, use CSV.open instead. See the CSV class for more information. I added the col_sep option to delimit with a pipe character instead of the default of a comma.
Getting a list of columns would just be the keys if data was a hash. But since it is a Multimap which will repeat key names, I have to call .to_a.uniq on it. Then I add them to the output using csv << columns.
In order to create the second row (and any subsequent rows), we slice down and get the first value for each key of data. That's what the data[col].shift does: it actually removes the first value from each value in data. The loop is in place to keep going as long as there are more values (more rows).

Finding maximum CSV field sizes for large file using Ruby

I'm trying to determine maximum field sizes for a large CSV file (~5GB) with over 300 fields that I want to dump into a MySQL table. The CSV file schema I have for the file gives incorrect maximum field lengths, so I run into errors on the table import. I'm running Ruby 2.0 on Windows.
I'm using an array to store the maximum field lengths according to the field's index (or column) location, i.e. ignoring the fields actual name in the header. I've tried fancier things like using hashes, inject, and zip, etc, but it seems that a simple array works fastest here.
field_lengths[0] = Max length of first field
field_lengths[1] = Max length of second field
etc.
The file is too large to slurp at once or parse column-wise using CSV. So, I open the CSV file and use CSV#foreach to parse each line (ignoring the header using the :headers => true option). For each row, I loop through the parsed array of field values and compare the field's length with the current maximum length stored in the field_length array. I realize there are much easier ways to do this with smaller files. This method works OK for larger files, but I still haven't been able to make it to the end of my particular file using this method.
To get around not being able to finish the file, I currently define a number of lines to read including the header (=n), and break once I've reached the n-th line. In the example below, I read 101 lines from the CSV file. (1 header row + 100 actual data rows). I'm not sure how many total rows are in the file since the process hasn't finished.
require 'csv'
require 'pp'
data_file = 'big_file.csv'
# We're only reading the first 101 lines in this example
n = 101
field_lengths = []
File.open(data_file) do |f|
CSV.foreach(f, :headers => true, :header_converters => :symbol) do |csv_row|
break if $. > n
csv_row.fields.each_with_index do |a,i|
field_lengths[i] ||= a.to_s.length
field_lengths[i] = a.to_s.length if field_lengths[i] < a.to_s.length
end
end
end
pp field_lengths
IO#read can read a certain number of bytes, but if I parse the file by bytes the records may get split. Does anyone have alternate suggestions for parsing the CSV file, by splitting it up into smaller files? O'Reilly's Ruby Cookbook (Lucas Carlson & Leonard Richardson, 2006, 1st ed), suggests breaking a large file into chunks (as below), but I'm not sure how to extend it to this example, particularly dealing with the line breaks etc.
class File
def each_chunk(chunk_size = 1024)
yield read(chunk_size) until eof?
end
end
open("bigfile.txt") do |f|
f.each_chunk(15) {|chunk| puts chunk}
end
You're using CSV.foreach wrong, it takes a string for a filename:
field_lengths = {}
CSV.foreach(data_file, :headers => true, :header_converters => :symbol) do |csv_row|
csv_row.each do |k, v|
field_lengths[k] = [field_lengths[k] || 0, v.length].max
end
end
pp field_lengths
Given a CSV::Table in a variable called csv, you can change it to be "by column" and use collect or map...or inject. There's lots of approaches.
e.g.
csv.by_col!
field_lengths = cols.map{|col| col.map{|r| r.is_a?(String) ? r.to_s.length : r.map{|v| v.to_s.length}.max }.max}
The first map iterates through the columns.
The second map iterates through the header (String) then the array of row values.
So the third map iterates through the rows where you can convert the values into a string and return its length.
The first max is the maximum value string length.
The second max is the maximum between the header's lengths and the max value length.
At the end, you may wish to return the csv to col_or_row
e.g.
csv.by_col_or_row!

copy the lines of a file into hashmap in ruby

I have a file with multiple lines. In each line, there two words and a number, split by a comma - for example a, b, 1. It means that string A and string B have the key as 1. I wrote the below piece of code
File.open(ARGV[0], 'r') do |f1|
while line = f1.gets
puts line
end
end
i'm looking for an idea of how to split and copy the characters and number in such a way that the first two words have the last number as key in the hashmap.
Does this work for you?
hash = {}
File.readlines(ARGV[0]).each do |line|
var = line.gsub(' ','').split(',')
hash[var[2]] = var[0],var[1]
end
This would give:
hash['1'] = ['a','b']
I don't know if you want to store number one as an integer or a string, if it's a integer you're looking for, just do var[2].to_i before storing.
Modified your code a little bit, i think it's shorter this way, if i'm in any way wrong, do let me know.

How do I convert a Ruby string with brackets to an array?

I would like to convert the following string into an array/nested array:
str = "[[this, is],[a, nested],[array]]"
newarray = # this is what I need help with!
newarray.inspect # => [['this','is'],['a','nested'],['array']]
You'll get what you want with YAML.
But there is a little problem with your string. YAML expects that there's a space behind the comma. So we need this
str = "[[this, is], [a, nested], [array]]"
Code:
require 'yaml'
str = "[[this, is],[a, nested],[array]]"
### transform your string in a valid YAML-String
str.gsub!(/(\,)(\S)/, "\\1 \\2")
YAML::load(str)
# => [["this", "is"], ["a", "nested"], ["array"]]
You could also treat it as almost-JSON. If the strings really are only letters, like in your example, then this will work:
JSON.parse(yourarray.gsub(/([a-z]+)/,'"\1"'))
If they could have arbitrary characters (other than [ ] , ), you'd need a little more:
JSON.parse("[[this, is],[a, nested],[array]]".gsub(/, /,",").gsub(/([^\[\]\,]+)/,'"\1"'))
For a laugh:
ary = eval("[[this, is],[a, nested],[array]]".gsub(/(\w+?)/, "'\\1'") )
=> [["this", "is"], ["a", "nested"], ["array"]]
Disclaimer: You definitely shouldn't do this as eval is a terrible idea, but it is fast and has the useful side effect of throwing an exception if your nested arrays aren't valid
Looks like a basic parsing task. Generally the approach you are going to want to take is to create a recursive function with the following general algorithm
base case (input doesn't begin with '[') return the input
recursive case:
split the input on ',' (you will need to find commas only at this level)
for each sub string call this method again with the sub string
return array containing the results from this recursive method
The only slighlty tricky part here is splitting the input on a single ','. You could write a separate function for this that would scan through the string and keep a count of the openbrackets - closedbrakets seen so far. Then only split on commas when the count is equal to zero.
Make a recursive function that takes the string and an integer offset, and "reads" out an array. That is, have it return an array or string (that it has read) and an integer offset pointing after the array. For example:
s = "[[this, is],[a, nested],[array]]"
yourFunc(s, 1) # returns ['this', 'is'] and 11.
yourFunc(s, 2) # returns 'this' and 6.
Then you can call it with another function that provides an offset of 0, and makes sure that the finishing offset is the length of the string.

Resources