How to print elements with same Xpath below same column (same header) - ruby

I'm trying to parse an XML file with REXML on Ruby.
What I want is print all values and the corresponding element name as header. The issue I have
is that some nodes have child elements that appear repeated and have the same Xpath, so for those
elements I want to printing in the same column. Then for the small sample below, the output desired
for the elements of Node_XX would be:
Output I'm looking for:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
What I have so far is the code below, but I don´t know how to do in order repeated
elements be printed in the same column.
Thanks in advance for any help.
Code I have so far:
#!/usr/bin/env ruby
require 'rexml/document'
include REXML
xmldoc = Document.new File.new("input.xml")
arr_H_Xpath = [] # Array to store only once all Xpath´s (without Xpath repeated)
arr_H_Values = [] # Array for headers (each child element´s name)
arr_Values = [] # Values of each child element.
xmldoc.elements.each("//Node_XYZ") {|element|
element.each_recursive do |child|
if (child.has_text? && child.text =~ /^[[:alnum:]]/) && !arr_H_Xpath.include?(child.xpath.gsub(/\[.\]/,"")) # Check if element has text and Xpath is stored in arr_H_Xpath.
arr_H_Xpath << child.xpath.gsub(/\[.\]/,"") #Remove the [..] for repeated XPaths
arr_H_Values << child.xpath.gsub(/\/\w.*\//,"") #Get only name of child element to use it as header
arr_Values << child.text
end
print arr_H_Values + "|"
arr_H_Values.clear
end
puts arr_Values.join("|")
}
The input.xml is:
<TopNode>
<NodeX>
<Node_XX>
<RepCD_valm>
<RepVal>MTSJ</RepVal>
</RepCD_valm>
<RepCD_yur>
<Yur>
<YurVal>AB01-J</YurVal>
</Yur>
</RepCD_yur>
<CodesDif>
<CD_Ranges>
<CD_val>45</CD_val>
<HJY_val>01</HJY_val>
<CD_Sub>
<CD_SubA>87</CD_SubA>
<CD_SubB>12</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
<CodesDif>
<CD_Ranges>
<CD_val>34</CD_val>
<HJY_val>11</HJY_val>
<CD_Sub>
<CD_SubA>43</CD_SubA>
<CD_SubB>62</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
</Node_XX>
<Node_XY>
....
....
....
</Node_XY>
</NodeX>
</TopNode>

Here's one way to solve your problem. It is probably a little unusual, but I was experimenting. :)
First, I chose a data structure that can store the headers as keys and multiple values per key to represent the additional row(s) of data: a MultiMap. It is like a hash with multiple keys.
With the multimap, you can store the elements as key-value pairs:
data = Multimap.new
doc.xpath('//RepVal|//YurVal|//CD_val|//HJY_val|//CD_SubA|//CD_SubB').each do |elem|
data[elem.name] = elem.inner_text
end
The content of data is:
{"RepVal"=>["MTSJ"],
"YurVal"=>["AB01-J"],
"CD_val"=>["45", "34"],
"HJY_val"=>["01", "11"],
"CD_SubA"=>["87", "43"],
"CD_SubB"=>["12", "62"]}
As you can see, this was a simple way to collect all the information you need to create your table. Now it is just a matter of transforming it to your pipe-delimited format. For this, or any delimited format, I recommend using CSV:
out = CSV.generate({col_sep: "|"}) do |csv|
columns = data.keys.to_a.uniq
csv << columns
while !data.values.empty? do
csv << columns.map { |col| data[col].shift }
end
end
The output is:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
Explanation:
CSV.generate creates a string. If you wanted to create an output file directly, use CSV.open instead. See the CSV class for more information. I added the col_sep option to delimit with a pipe character instead of the default of a comma.
Getting a list of columns would just be the keys if data was a hash. But since it is a Multimap which will repeat key names, I have to call .to_a.uniq on it. Then I add them to the output using csv << columns.
In order to create the second row (and any subsequent rows), we slice down and get the first value for each key of data. That's what the data[col].shift does: it actually removes the first value from each value in data. The loop is in place to keep going as long as there are more values (more rows).

Related

Ruby - Extra punctuation in file when using regex and csv class to write to a file

I'm using regex to grab parameters from an html file.
I've tested the regexp and it seems to be fine- it appears that the csv conversion is what's causing the issue, but I'm not sure.
Here is what I have:
mechanics_file= File.read(filename)
mechanics= mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/)
id_file= File.read(filename)
id=id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/)
puts id.zip(mechanics)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
The puts output looks like this:
2073
Acting
2689
Action / Movement Programming
But the contents of the csv look like this:
"[""2073""]","[""Acting""]"
"[""2689""]","[""Action / Movement Programming""]"
How do I get rid of all of the extra quotes and brackets? Am I doing something wrong in the process of writing to a csv?
This is my first project in ruby so I would appreciate a child-friendly explanation :) Thanks in advance!
String#scan returns an Array of Arrays (bold emphasis mine):
scan(pattern) → array
Both forms iterate through str, matching the pattern (which may be a Regexp or a String). For each match, a result is generated and either added to the result array or passed to the block. If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
a = "cruel world"
# […]
a.scan(/(...)/) #=> [["cru"], ["el "], ["wor"]]
So, id looks like this:
id == [['2073'], ['2689']]
and mechanics looks like this:
mechanics == [['Acting'], ['Action / Movement Programming']]
id.zip(movements) then looks like this:
id.zip(movements) == [[['2073'], ['Acting']], [['2689'], ['Action / Movement Programming']]]
Which means that in your loop, each row looks like this:
row == [['2073'], ['Acting']]
row == [['2689'], ['Action / Movement Programming']]
CSV#<< expects an Array of Strings, or things that can be converted to Strings as an argument. You are passing it an Array of Arrays, which it will happily convert to an Array of Strings for you by calling Array#to_s on each element, and that looks like this:
[['2073'], ['Acting']].map(&:to_s) == [ '["2073"]', '["Acting"]' ]
[['2689'], ['Action / Movement Programming']].map(&:to_s) == [ '["2689"]', '["Action / Movement Programming"]' ]
Lastly, " is the string delimiter in CSV, and needs to be escaped by doubling it, so what actually gets written to the CSV file is this:
"[""2073""]", "[""Acting""]"
"[""2689""]", "[""Action / Movement Programming""]"
The simplest way to correct this, would be to flatten the return values of the scans (and maybe also convert the IDs to Integers, assuming that they are, in fact, Integers):
mechanics_file = File.read(filename)
mechanics = mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/).flatten
id_file = File.read(filename)
id = id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/).flatten.map(&:to_i)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
Another suggestion would be to forgo the Regexps completely and use an HTML parser to parse the HTML.

Extra column when scanning JSON into CSV using .map, sorted order is lost

I am writing a script to convert JSON data to an ordered CSV spreadsheet.
The JSON data itself does not necessarily contain all keys (some fields in the spreadsheet should say "NA").
Typical JSON data looks like this:
json = {"ReferringUrl":"N","PubEndDate":"2010/05/30","ItmId":"347628959","ParentItemId":"46999"}
I have a list of the keys found in each column of the spreadsheet:
keys = ["ReferringUrl", "PubEndDate", "ItmId", "ParentItemId", "OtherKey", "Etc"]
My thought was that I could iterate through each line of JSON like this:
parsed = JSON.parse(json)
result = (0..keys.length).map{ |i| parsed[keys[i]] || 'NA'} #add values associated with keys to an array, using NA if no value is present
CSV.open('file.csv', 'wb') do |csv|
csv << keys #create headings on spreadsheet
csv << result #load data associated with headings into the next line
end
Ideally, this would create a CSV file with the proper information in the proper order in a spreadsheet. However, what happens is the result data comes in completely out of order, and contains an extra column that I don't know what to do with.
Looking at the actual data, since there are actually about 100 keys and most of the fields contain NA, it is very difficult to determine what is happening.
Any advice?
The extra column comes from 0..keys.length which includes the end of the range. The last value of result is going to be parsed[keys[keys.length]] i.e. parsed[nil] i.e. nil. You can avoid that entirely by mapping keys directly
result = keys.map { |key| parsed.fetch(key, 'NA') }
As for the random order of the values, I suspect you aren't giving us all of the relevant information, because I tested your code and the result came out in the same order as keys.
Range has two possible notations
..
and
...
... is exclusive, meaning the range (A...B) would be not include B.
Change to
result = (0...keys.length).map{ |i| parsed[keys[i]] || 'NA'} #add values associated with keys to an array, using NA if no value is present
And see if that prevents the last value in that range from evaluating to nil.

Finding maximum CSV field sizes for large file using Ruby

I'm trying to determine maximum field sizes for a large CSV file (~5GB) with over 300 fields that I want to dump into a MySQL table. The CSV file schema I have for the file gives incorrect maximum field lengths, so I run into errors on the table import. I'm running Ruby 2.0 on Windows.
I'm using an array to store the maximum field lengths according to the field's index (or column) location, i.e. ignoring the fields actual name in the header. I've tried fancier things like using hashes, inject, and zip, etc, but it seems that a simple array works fastest here.
field_lengths[0] = Max length of first field
field_lengths[1] = Max length of second field
etc.
The file is too large to slurp at once or parse column-wise using CSV. So, I open the CSV file and use CSV#foreach to parse each line (ignoring the header using the :headers => true option). For each row, I loop through the parsed array of field values and compare the field's length with the current maximum length stored in the field_length array. I realize there are much easier ways to do this with smaller files. This method works OK for larger files, but I still haven't been able to make it to the end of my particular file using this method.
To get around not being able to finish the file, I currently define a number of lines to read including the header (=n), and break once I've reached the n-th line. In the example below, I read 101 lines from the CSV file. (1 header row + 100 actual data rows). I'm not sure how many total rows are in the file since the process hasn't finished.
require 'csv'
require 'pp'
data_file = 'big_file.csv'
# We're only reading the first 101 lines in this example
n = 101
field_lengths = []
File.open(data_file) do |f|
CSV.foreach(f, :headers => true, :header_converters => :symbol) do |csv_row|
break if $. > n
csv_row.fields.each_with_index do |a,i|
field_lengths[i] ||= a.to_s.length
field_lengths[i] = a.to_s.length if field_lengths[i] < a.to_s.length
end
end
end
pp field_lengths
IO#read can read a certain number of bytes, but if I parse the file by bytes the records may get split. Does anyone have alternate suggestions for parsing the CSV file, by splitting it up into smaller files? O'Reilly's Ruby Cookbook (Lucas Carlson & Leonard Richardson, 2006, 1st ed), suggests breaking a large file into chunks (as below), but I'm not sure how to extend it to this example, particularly dealing with the line breaks etc.
class File
def each_chunk(chunk_size = 1024)
yield read(chunk_size) until eof?
end
end
open("bigfile.txt") do |f|
f.each_chunk(15) {|chunk| puts chunk}
end
You're using CSV.foreach wrong, it takes a string for a filename:
field_lengths = {}
CSV.foreach(data_file, :headers => true, :header_converters => :symbol) do |csv_row|
csv_row.each do |k, v|
field_lengths[k] = [field_lengths[k] || 0, v.length].max
end
end
pp field_lengths
Given a CSV::Table in a variable called csv, you can change it to be "by column" and use collect or map...or inject. There's lots of approaches.
e.g.
csv.by_col!
field_lengths = cols.map{|col| col.map{|r| r.is_a?(String) ? r.to_s.length : r.map{|v| v.to_s.length}.max }.max}
The first map iterates through the columns.
The second map iterates through the header (String) then the array of row values.
So the third map iterates through the rows where you can convert the values into a string and return its length.
The first max is the maximum value string length.
The second max is the maximum between the header's lengths and the max value length.
At the end, you may wish to return the csv to col_or_row
e.g.
csv.by_col_or_row!

Problem with initializing a hash in ruby

I have a text file from which I want to create a Hash for faster access. My text file is of format (space delimited)
author title date popularity
I want to create a hash in which author is the key and the remaining is the value as an array.
created_hash["briggs"] = ["Manup", "Jun,2007", 10]
Thanks in advance.
require 'date'
created_hash = File.foreach('test.txt', mode: 'rt', encoding: 'UTF-8').
reduce({}) {|hsh, l|
name, title, date, pop = l.split
hsh.tap {|hsh| hsh[name] = [title, Date.parse(date), pop.to_i] }
}
I threw some type conversion code in there, just for fun. If you don't want that, the loop body becomes even simpler:
k, *v = l.split
hsh.tap {|hsh| hsh[k] = v }
You can also use readlines instead of foreach. Note that IO#readlines reads the entire file into an array first. So, you need enough memory to hold both the entire array and the entire hash. (Of course, the array will be eligible for garbage collection as soon as the loop finishes.)
Just loop through each line of the file, use the first space-delimited item as the hash key and the rest as the hash value. Pretty much exactly as you described.
created_hash = {}
file_contents.each_line do |line|
data = line.split(' ')
created_hash[data[0]] = data.drop 1
end

Read from a file into an array and stop if a ":" is found in ruby

How can I in Ruby read a string from a file into an array and only read and save in the array until I get a certain marker such as ":" and stop reading?
Any help would be much appreciated =)
For example:
10.199.198.10:111 test/testing/testing (EST-08532522)
10.199.198.12:111 test/testing/testing (EST-08532522)
10.199.198.13:111 test/testing/testing (EST-08532522)
Should only read the following and be contained in the array:
10.199.198.10
10.199.198.12
10.199.198.13
This is a rather trivial problem, using String#split:
results = open('a.txt').map { |line| line.split(':')[0] }
p results
Output:
["10.199.198.10", "10.199.198.12", "10.199.198.13"]
String#split breaks a string at the specified delimiter and returns an array; so line.split(':')[0] takes the first element of that generated array.
In the event that there is a line without a : in it, String#split will return an array with a single element that is the whole line. So if you need to do a little more error checking, you could write something like this:
results = []
open('a.txt').each do |line|
results << line.split(':')[0] if line.include? ':'
end
p results
which will only add split lines to the results array if the line has a : character in it.

Resources