Ruby - Extra punctuation in file when using regex and csv class to write to a file - ruby

I'm using regex to grab parameters from an html file.
I've tested the regexp and it seems to be fine- it appears that the csv conversion is what's causing the issue, but I'm not sure.
Here is what I have:
mechanics_file= File.read(filename)
mechanics= mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/)
id_file= File.read(filename)
id=id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/)
puts id.zip(mechanics)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
The puts output looks like this:
2073
Acting
2689
Action / Movement Programming
But the contents of the csv look like this:
"[""2073""]","[""Acting""]"
"[""2689""]","[""Action / Movement Programming""]"
How do I get rid of all of the extra quotes and brackets? Am I doing something wrong in the process of writing to a csv?
This is my first project in ruby so I would appreciate a child-friendly explanation :) Thanks in advance!

String#scan returns an Array of Arrays (bold emphasis mine):
scan(pattern) → array
Both forms iterate through str, matching the pattern (which may be a Regexp or a String). For each match, a result is generated and either added to the result array or passed to the block. If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
a = "cruel world"
# […]
a.scan(/(...)/) #=> [["cru"], ["el "], ["wor"]]
So, id looks like this:
id == [['2073'], ['2689']]
and mechanics looks like this:
mechanics == [['Acting'], ['Action / Movement Programming']]
id.zip(movements) then looks like this:
id.zip(movements) == [[['2073'], ['Acting']], [['2689'], ['Action / Movement Programming']]]
Which means that in your loop, each row looks like this:
row == [['2073'], ['Acting']]
row == [['2689'], ['Action / Movement Programming']]
CSV#<< expects an Array of Strings, or things that can be converted to Strings as an argument. You are passing it an Array of Arrays, which it will happily convert to an Array of Strings for you by calling Array#to_s on each element, and that looks like this:
[['2073'], ['Acting']].map(&:to_s) == [ '["2073"]', '["Acting"]' ]
[['2689'], ['Action / Movement Programming']].map(&:to_s) == [ '["2689"]', '["Action / Movement Programming"]' ]
Lastly, " is the string delimiter in CSV, and needs to be escaped by doubling it, so what actually gets written to the CSV file is this:
"[""2073""]", "[""Acting""]"
"[""2689""]", "[""Action / Movement Programming""]"
The simplest way to correct this, would be to flatten the return values of the scans (and maybe also convert the IDs to Integers, assuming that they are, in fact, Integers):
mechanics_file = File.read(filename)
mechanics = mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/).flatten
id_file = File.read(filename)
id = id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/).flatten.map(&:to_i)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
Another suggestion would be to forgo the Regexps completely and use an HTML parser to parse the HTML.

Related

Parse CSV file with headers when the headers are part way down the page

I have a CSV file that, as a spreadsheet, looks like this:
I want to parse the spreadsheet with the headers at row 19. Those headers wont always start at row 19, so my question is, is there a simple way to parse this spreadsheet, and specify which row holds the headers, say by using the "Date" string to identify the header row?
Right now, I'm doing this:
CSV.foreach(params['logbook'].tempfile, headers: true) do |row|
Flight.create(row.to_hash)
end
but obviously that wont work because it doesn't get the right headers.
I feel like there should be a simple solution to this since it's pretty common to have CSV files in this format.
Let's first create the csv file that would be produced from the spreadsheet.
csv =<<-_
N211E,C172,2004,Cessna,172R,airplane,airplane
C-GPGT,C172,1976,Cessna,172M,airplane,airplane
N17AV,P28A,1983,Piper,PA-28-181,airplane,airplane
N4508X,P28A,1975,Piper,PA-28-181,airplane,airplane
,,,,,,
Flights Table,,,,,,
Date,AircraftID,From,To,Route,TimeOut,TimeIn
2017-07-27,N17AV,KHPN,KHPN,KHPN KHPN,17:26,18:08
2017-07-27,N17AV,KHSE,KFFA,,16:29,17:25
2017-07-27,N17AV,W41,KHPN,,21:45,23:53
_
FName = 'test.csv'
File1.write(FName, csv)
#=> 395
We only want the part of the string that begins "Date,".The easiest option is probably to first extract the relevant text. If the file is not humongous, we can slurp it into a string and then remove the unwanted bit.
str = File.read(FName).gsub(/\A.+?(?=^Date,)/m, '')
#=> "Date,AircraftID,From,To,Route,TimeOut,TimeIn\n2017-07-27,N17AV,
# KHPN,KHPN,KHPN KHPN,17:26,18:08\n2017-07-27,N17AV,KHSE,KFFA,,16:29,
# 17:25\n2017-07-27,N17AV,W41,KHPN,,21:45,23:53\n"
The regular expression that is gsub's first argument could be written in free-spacing mode, which makes it self-documenting:
/
\A # match the beginning of the string
.+? # match any number of characters, lazily
(?=^Date,) # match "Date," at the beginning of a line in a positive lookahead
/mx # multi-line and free-spacing regex definition modes
Now that we have the part of the file we want in the string str, we can use CSV::parse to create the CSV::Table object:
csv_tbl = CSV.parse(str, headers: true)
#=> #<CSV::Table mode:col_or_row row_count:4>
The option :headers => true is documented in CSV::new.
Here are a couple of examples of how csv_tbl can be used.
csv_tbl.each { |row| p row }
#=> #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHPN"\
# "To":"KHPN" "Route":"KHPN KHPN" "TimeOut":"17:26" "TimeIn":"18:08">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHSE"\
# "To":"KFFA" "Route":nil "TimeOut":"16:29" "TimeIn":"17:25">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"W41"\
# "To":"KHPN" "Route":nil "TimeOut":"21:45" "TimeIn":"23:53">
(I've used the character '\' to signify that the string continues on the following line, so that readers would not have to scroll horizontally to read the lines.)
csv_tbl.each { |row| p row["From"] }
# "KHPN"
# "KHSE"
# "W41"
Readers who want to know more about how Ruby's CSV class is used may wish to read Darko Gjorgjievski's piece, "A Guide to the Ruby CSV Library, Part 1 and Part 2".
You can use the smarter_csv gem for this. Parse the file once to determine how many rows you need to skip to get to the header row you want, and then use the skip_lines option:
header_offset = <code to determine number of lines above the header>
SmarterCSV.process(params['logbook'].tempfile, skip_lines: header_offset)
From this format, I think the easiest way is to detect an empty line that comes before the header line. That would also work under changes to the header text. In terms of CSV, that would mean a whole line that has only empty cell items.

Ruby script which can replace a string in a binary file to a different, but same length string?

I would like to write a Ruby script (repl.rb) which can replace a string in a binary file (string is defined by a regex) to a different, but same length string.
It works like a filter, outputs to STDOUT, which can be redirected (ruby repl.rb data.bin > data2.bin), regex and replacement can be hardcoded. My approach is:
#!/usr/bin/ruby
fn = ARGV[0]
regex = /\-\-[0-9a-z]{32,32}\-\-/
replacement = "--0ca2765b4fd186d6fc7c0ce385f0e9d9--"
blk_size = 1024
File.open(fn, "rb") {|f|
while not f.eof?
data = f.read(blk_size)
data.gsub!(regex, str)
print data
end
}
My problem is that when string is positioned in the file that way it interferes with the block size used by reading the binary file. For example when blk_size=1024 and my 1st occurance of the string begins at byte position 1000, so I will not find it in the "data" variable. Same happens with the next read cycle. Should I process the whole file two times with different block size to ensure avoiding this worth case scenario, or is there any other approach?
I would posit that a tool like sed might be a better choice for this. That said, here's an idea: Read block 1 and block 2 and join them into a single string, then perform the replacement on the combined string. Split them apart again and print block 1. Then read block 3 and join block 2 and 3 and perform the replacement as above. Split them again and print block 2. Repeat until the end of the file. I haven't tested it, but it ought to look something like this:
File.open(fn, "rb") do |f|
last_block, this_block = nil
while not f.eof?
last_block, this_block = this_block, f.read(blk_size)
data = "#{last_block}#{this_block}".gsub(regex, str)
last_block, this_block = data.slice!(0, blk_size), data
print last_block
end
print this_block
end
There's probably a nontrivial performance penalty for doing it this way, but it could be acceptable depending on your use case.
Maybe a cheeky
f.pos = f.pos - replacement.size
at the end of the while loop, just before reading the next chunk.

How to print elements with same Xpath below same column (same header)

I'm trying to parse an XML file with REXML on Ruby.
What I want is print all values and the corresponding element name as header. The issue I have
is that some nodes have child elements that appear repeated and have the same Xpath, so for those
elements I want to printing in the same column. Then for the small sample below, the output desired
for the elements of Node_XX would be:
Output I'm looking for:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
What I have so far is the code below, but I don´t know how to do in order repeated
elements be printed in the same column.
Thanks in advance for any help.
Code I have so far:
#!/usr/bin/env ruby
require 'rexml/document'
include REXML
xmldoc = Document.new File.new("input.xml")
arr_H_Xpath = [] # Array to store only once all Xpath´s (without Xpath repeated)
arr_H_Values = [] # Array for headers (each child element´s name)
arr_Values = [] # Values of each child element.
xmldoc.elements.each("//Node_XYZ") {|element|
element.each_recursive do |child|
if (child.has_text? && child.text =~ /^[[:alnum:]]/) && !arr_H_Xpath.include?(child.xpath.gsub(/\[.\]/,"")) # Check if element has text and Xpath is stored in arr_H_Xpath.
arr_H_Xpath << child.xpath.gsub(/\[.\]/,"") #Remove the [..] for repeated XPaths
arr_H_Values << child.xpath.gsub(/\/\w.*\//,"") #Get only name of child element to use it as header
arr_Values << child.text
end
print arr_H_Values + "|"
arr_H_Values.clear
end
puts arr_Values.join("|")
}
The input.xml is:
<TopNode>
<NodeX>
<Node_XX>
<RepCD_valm>
<RepVal>MTSJ</RepVal>
</RepCD_valm>
<RepCD_yur>
<Yur>
<YurVal>AB01-J</YurVal>
</Yur>
</RepCD_yur>
<CodesDif>
<CD_Ranges>
<CD_val>45</CD_val>
<HJY_val>01</HJY_val>
<CD_Sub>
<CD_SubA>87</CD_SubA>
<CD_SubB>12</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
<CodesDif>
<CD_Ranges>
<CD_val>34</CD_val>
<HJY_val>11</HJY_val>
<CD_Sub>
<CD_SubA>43</CD_SubA>
<CD_SubB>62</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
</Node_XX>
<Node_XY>
....
....
....
</Node_XY>
</NodeX>
</TopNode>
Here's one way to solve your problem. It is probably a little unusual, but I was experimenting. :)
First, I chose a data structure that can store the headers as keys and multiple values per key to represent the additional row(s) of data: a MultiMap. It is like a hash with multiple keys.
With the multimap, you can store the elements as key-value pairs:
data = Multimap.new
doc.xpath('//RepVal|//YurVal|//CD_val|//HJY_val|//CD_SubA|//CD_SubB').each do |elem|
data[elem.name] = elem.inner_text
end
The content of data is:
{"RepVal"=>["MTSJ"],
"YurVal"=>["AB01-J"],
"CD_val"=>["45", "34"],
"HJY_val"=>["01", "11"],
"CD_SubA"=>["87", "43"],
"CD_SubB"=>["12", "62"]}
As you can see, this was a simple way to collect all the information you need to create your table. Now it is just a matter of transforming it to your pipe-delimited format. For this, or any delimited format, I recommend using CSV:
out = CSV.generate({col_sep: "|"}) do |csv|
columns = data.keys.to_a.uniq
csv << columns
while !data.values.empty? do
csv << columns.map { |col| data[col].shift }
end
end
The output is:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
Explanation:
CSV.generate creates a string. If you wanted to create an output file directly, use CSV.open instead. See the CSV class for more information. I added the col_sep option to delimit with a pipe character instead of the default of a comma.
Getting a list of columns would just be the keys if data was a hash. But since it is a Multimap which will repeat key names, I have to call .to_a.uniq on it. Then I add them to the output using csv << columns.
In order to create the second row (and any subsequent rows), we slice down and get the first value for each key of data. That's what the data[col].shift does: it actually removes the first value from each value in data. The loop is in place to keep going as long as there are more values (more rows).

Extract a single line string having "foo: XXXX"

I have a file with one or more key:value lines, and I want to pull a key:value out if key=foo. How can I do this?
I can get as far as this:
if File.exist?('/file_name')
content = open('/file_name').grep(/foo:??/)
I am unsure about the grep portion, and also once I get the content, how do I extract the value?
People like to slurp the files into memory, which, if the file will always be small, is a reasonable solution. However, slurping isn't scalable, and the practice can lead to excessive CPU and I/O waits as content is read.
Instead, because you could have multiple hits in a file, and you're comparing the content line-by-line, read it line-by-line. Line I/O is very fast and avoids the scalability problems. Ruby's File.foreach is the way to go:
File.foreach('path/to/file') do |li|
puts $1 if li[/foo:\s*(\w+)/]
end
Because there are no samples of actual key/value pairs, we're shooting in the dark for valid regex patterns, but this is the basis for how I'd solve the problem.
Try this:
IO.readlines('key_values.txt').find_all{|line| line.match('key1')}
i would recommend to read the file into array and select only lines you need:
regex = /\A\s?key\s?:/
results = File.readlines('file').inject([]) do |f,l|
l =~ regex ? f << "key = %s" % l.sub(regex, '') : f
end
this will detect lines starting with key: and adding them to results like key = value,
where value is the portion going after key:
so if you have a file like this:
key:1
foo
key:2
bar
key:3
you'll get results like this:
key = 1
key = 2
key = 3
makes sense?
value = File.open('/file_name').read.match("key:(.*)").captures[0] rescue nil
File.read('file_name')[/foo: (.*)/, 1]
#=> XXXX

Read from a file into an array and stop if a ":" is found in ruby

How can I in Ruby read a string from a file into an array and only read and save in the array until I get a certain marker such as ":" and stop reading?
Any help would be much appreciated =)
For example:
10.199.198.10:111 test/testing/testing (EST-08532522)
10.199.198.12:111 test/testing/testing (EST-08532522)
10.199.198.13:111 test/testing/testing (EST-08532522)
Should only read the following and be contained in the array:
10.199.198.10
10.199.198.12
10.199.198.13
This is a rather trivial problem, using String#split:
results = open('a.txt').map { |line| line.split(':')[0] }
p results
Output:
["10.199.198.10", "10.199.198.12", "10.199.198.13"]
String#split breaks a string at the specified delimiter and returns an array; so line.split(':')[0] takes the first element of that generated array.
In the event that there is a line without a : in it, String#split will return an array with a single element that is the whole line. So if you need to do a little more error checking, you could write something like this:
results = []
open('a.txt').each do |line|
results << line.split(':')[0] if line.include? ':'
end
p results
which will only add split lines to the results array if the line has a : character in it.

Resources