Parse multiline fixed-width text files in ruby - ruby

I am trying to parse a multi-line fixed-width file in ruby and it seems I can't parse the information I need. I can parse fine when the information is in 1 line. for example:
Name LastName DOB
John Doe 01/01/2001
Jane Doe 01/02/2002
but where I am facing a challenge is when the file does have a structure like below
This message needs to be AccountId: 7854639
parsed in a single key Phone: 823972839563
of the json that I want to produce Email: test#test.com
The multiline text is always let's say on the same coordinates, and it is dynamic. Not sure how to parse this and map into a json value for example.

Here's a simplistic, un-golfed approach:
freeform_text = str.split('\n').map do |s|
m = s.match(/^(.*)\s+(.*):(.*)$/)
m[1] ? m[1].strip : ''
end.join(' ')
# Produces:
# "This message needs to be parsed in a single key of the json that I want to produce"
There are other, more-idiomatic approaches, but this gives you a hint of the direction to take.

str = "This message needs to be AccountId: 7854639
parsed in a single key Phone: 823972839563
of the json that I want to produce Email: test#test.com"
p str.scan(/([^\s]+:[^\n]+)/).flatten
See Ruby demo.

Related

How do I create an XPath query to extract a substring of text?

I am trying to create xpath so that it only returns order number instead of whole line.Please see attached screenshot
What you want is the substring-after() function -
fn:substring-after(string1,string2)
Returns the remainder of string1 after string2 occurs in it
Example: substring-after('12/10','/')
Result: '10'
For your situation -
substring-after(string(//p[contains(text(), "Your order # is")]), ": ")
To test this, I modified the DOM on this page to include a "Order Number: ####" string.
See it in action:
You could also just use your normal Xpath selector to get the complete text, being "Your oder # is: 123456" and then perform a regex on the string like mentioned in Get numbers from string with regex

Regular expression to fetch the value from a given string

I have the following string:
a=<record><FPR_AGENT_CODE>990042833</FPR_AGENT_CODE><FPR_AGENT_LABELCODE>CIF Code :</FPR_AGENT_LABELCODE><FPR_AGENT_LABELNAME>CIF Name :</FPR_AGENT_LABELNAME>
I need to get the value from:
<FPR_AGENT_CODE>990042833</FPR_AGENT_CODE>
to
"FPR_AGENT_CODE 990042833 FPR_AGENT_CODE"
How can I write the regular expression for this? I tried using the one given below, but it's not working.
puts a[/<.*>.*<\/.*>/]
You can use scan with the following regex:
/<([^>]+)>(\d+)<\/\1>/
Sample code:
a="<record><FPR_AGENT_CODE>990042833</FPR_AGENT_CODE><FPR_AGENT_LABELCODE>CIF Code :</FPR_AGENT_LABELCODE><FPR_AGENT_LABELNAME>CIF Name :</FPR_AGENT_LABELNAME><FPR_AGENT_NAME>Mr Kamal Kishore</FPR_AGENT_NAME><FPR_BANK_BRANCH_NAME>STATE BANK OF INDIA KHOUR</FPR_BANK_BRANCH_NAME><FPR_BRANCH_ADDRESS>"
puts a.scan(/<([^>]+)>(\d+)<\/\1>/)
Output:
FPR_AGENT_CODE
990042833
The regex <([^>]+)>(\d+)<\/\1> searches for a string in angle brackets (capturing the text into group 1), then a sequence of 1 or more digits (\d+), and then the closing tag.
If you need to get multiple values, you can use:
puts a.scan(/<([^>]+\b)[^<>]*>(.*?)<\/\1>/)
See another demo, output:
FPR_AGENT_CODE
990042833
FPR_AGENT_LABELCODE
CIF Code :
FPR_AGENT_LABELNAME
CIF Name :
FPR_AGENT_NAME
Mr Kamal Kishore
FPR_BANK_BRANCH_NAME
STATE BANK OF INDIA KHOUR
For multiline input, either use m option, or replace (.*?) with ([^<]*).
puts a.scan(/<([^>]+\b)[^<>]*>(.*?)<\/\1>/m)
Or
puts a.scan(/<([^>]+\b)[^<>]*>([^<]*)<\/\1>/)
See another demo

extracting strings out of one long string in Ruby

I have this really long string and I would like to extract specific strings out of it in a list form.
the string:
[#<User id: 1, login: "test", hash ... ]
I would like to extract everything that appears in between login: " and ", so in this case it would be the word test. This string can be indefinitely long but the pattern will be the same. How can I go about extracting the words out in a list form?
thanks!
string.scan(/login: "(.*?)",/)

regex for extracting string from ical data with possible linebreaks

I need to match some ical-data per regex to change the summary with the description values for each event and I'm somehow stuck there.
sample data set:
...
SUMMARY: Hello how are you doi
ng? Hope everything is fine?
DESCRIPTION: This is a description.
This: is still the description;
...
Linebreaks are intended. As are the ":" and ";" characters in the value.
I now need to extract the SUMMARY and the DESCRIPTION values.
My first try was something like this for:
summary = text.match /(?<=SUMMARY:).+(?=\n[A-Z]+:)/m
Here is a link to the rubular example (without the lookbehind, seems rubular isn't able to do that)
It works for summary as expected but not for Description.
Summary
Description
The problem is that you expect \n[A-Z]+: after the match because of your look ahead. But in your case the end of the string is following.
So the solution is to make an alternation expecting either the one or the other
DESCRIPTION:.+(?=\n[A-Z]+:|$)
See it on rubular
This works ok for me:
text = <<EOS
SUMMARY: Hello how are you doi
ng? Hope everything is fine?
DESCRIPTION: This is a description.
This: is still the description;
DATE: this gets selected too :(
EOS
summary = text.match /(?<=SUMMARY:)(?:.+?(?=[A-Z]+:)|.+?$)/m
p summary[0]
# " Hello how are you doi\nng? Hope everything is fine?\n"
description = text.match /(?<=DESCRIPTION:)(?:.+?(?=[A-Z]+:)|.+?$)/m
p description[0]
# " This is a description.\nThis: is still the description;"
Your example data does not conform to RFC 5545, section 3.1 Content Lines:
a long line can be split between any two characters by inserting a
CRLF immediately followed by a single linear white-space character
(i.e., SPACE or HTAB).
...
SUMMARY: Hello how are you doi
ng? Hope everything is fine?
DESCRIPTION: This is a description.
This: is still the description;
...
is a correct example.
Unfolding is accomplished by removing the CRLF and the linear
white-space character that immediately follows... parsing a content
line, folded lines MUST first be unfolded
data = File.read("ical-data").gsub!(/\n[\s\t]/, '');
hash = Hash[data.scan(/^(SUMMARY|DESCRIPTION):(.+)/)];
puts "Description:", hash["DESCRIPTION"];
puts "Summary:", hash["SUMMARY"];

Ruby: How can I process a CSV file with "bad commas"?

I need to process a CSV file from FedEx.com containing shipping history. Unfortunately FedEx doesn't seem to actually test its CSV files as it doesn't quote strings that have commas in them.
For instance, a company name might be "Dog Widgets, Inc." but the CSV doesn't quote that string, so any CSV parser thinks that comma before "Inc." is the start of a new field.
Is there any way I can reliably parse those rows using Ruby?
The only differentiating characteristic that I can find is that the commas that are part of a string have a space after then. Commas that separate fields have no spaces. No clue how that helps me parse this, but it is something I noticed.
you can use a negative lookahead
>> "foo,bar,baz,pop, blah,foobar".split(/,(?![ \t])/)
=> ["foo", "bar", "baz", "pop, blah", "foobar"]
Well, here's an idea: You could replace each instance of comma-followed-by-a-space with a unique character, then parse the CSV as usual, then go through the resulting rows and reverse the replace.
Perhaps something along these lines..
using gsub to change the ', ' to something else
ruby-1.9.2-p0 > "foo,bar,baz,pop, blah,foobar".gsub(/,\ /,'| ').split(',')
[
[0] "foo",
[1] "bar",
[2] "baz",
[3] "pop| blah",
[4] "foobar"
]
and then remove the | after words.
If you are so lucky as to only have one field like that, you can parse the leading fields off the start, the trailing fields off than end and assume whatever is left is the offending field. In python (no habla ruby) this would look something like:
fields = line.split(',') # doesn't work if some fields are quoted
fields = fields[:5] + [','.join(fields[5:-3])] + fields[-3:]
Whatever you do, you should be able at a minimum determine the number of offending commas and that should give you something (a sanity check if nothing else).

Resources