Split complex string with a regex - ruby

I have a string:
(3592, -1, 7, N'SUNWopensp-root', N'1.5,REV=10.0.3.2004.12.15.14.19', N'Sun Microsystems, Inc.', N'The OpenJade Group''s SGML and XML parsing tools - platfowrm independent files, / filesystem', N'SunPackage', abc, 83)
I need to split this on commas, but NOT the ones within N' ... ' substrings.
I managed to extract all the content of N' ... ' strings with this:
N\'(.*?)(?:\',|\)|\'\))
But that does not split on commas "3592, -1, 7" and the like, while I cannot split on commas separately because that breaks up N' ... ' strings with commas. The ultimate goal is having all fields split on commas EXCEPT the ones within N' ... ' strings (i.e. N'.. , ..' should be a complete field too).

given_string.scan(/(?:(?:N'.*?')|[^,])+/)
gives:
[
  "(3592",
  " -1",
  " 7",
  " N'SUNWopensp-root'",
  " N'1.5,REV=10.0.3.2004.12.15.14.19'",
  " N'Sun Microsystems, Inc.'",
  " N'The OpenJade Group''s SGML and XML parsing tools - platfowrm independent files",
  " / filesystem'",
  " N'SunPackage'",
  " abc",
  " 83)"
]
This looks unusual as it contains spaces and parentheses, and a ' character inside a word works as a delimiter for the field N'...', but since that is what is mentioned in the question, this is what I give. If this is not exactly what you want, blame the sloppiness of the question.

Since that is close to CSV format, here's one way to parse it.
#remove parens and N's
csv = str.gsub(/^\(|\)$/, "").gsub(/, N/, ",")
CSV.parse_line(csv, {:quote_char => "'"})
Output:
[
"3592",
" -1",
" 7",
"SUNWopensp-root",
"1.5,REV=10.0.3.2004.12.15.14.19",
"Sun Microsystems, Inc.",
"The OpenJade Group's SGML and XML parsing tools - platfowrm independent files,
/ filesystem",
"SunPackage",
" abc",
" 83"
]
Note: This is the only solution that handles the doubled apostrophe correctly.

You already extracted the N' fields, now you can gsub them to become something like X, them you split by comma and substitute the X's with your N' fields. It's not the solution, but works.

Related

Parse multiline text with pattern

here is a little example:
02-09-17 1:01 PM - Some User (Add comments)
Hello,
How are you?
Regards,
02-09-17 3:29 PM - Another User (Add comments)
Hey,
Thanks, all is fine.
Some another text here.
02-09-17 4:30 AM - Just a User (Add comments)
some text
with
multiline
I want to parse and process this three comments. What is the best way for this?
Tried regex like this - http://www.rubular.com/r/k1CHJ1STTD but have problems with /m flag. Without multiline flag for regex - can`t catch "body" of comment.
Also tried to split by regex:
text_above.split(/^(\d{1,2}-\d{1,2}-\d{2} \d{1,2}:\d{1,2} [AP]M - .+ \(Add comments\))/)
=> ["",
"02-09-17 1:01 PM - Some User (Add comments)",
"\n" + "Hello,\n" + "\n" + "How are you?\n" + "\n" + "Regards,\n" + "\n",
"02-09-17 3:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n",
"02-09-17 4:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n" + "\n",
"02-09-17 5:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n",
"02-09-17 6:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n"]
But this is not comfortable solution.
Ideally I want to get regex captures with three or two group matches, for example:
1. 02-09-17 1:01 PM
2. Some User (Add comments)
3. Hello,
How are you?
Regards,
for each comment, or, Array of comments:
[['02-09-17 1:01 PM - Some User (Add comments) Hello,
How are you?
Regards,'],[...]]
Any ideas? Thanks.
You can keep it simple using two splits (one for the whole string and one for each block):
text.split(/\n\n(?=\d\d-)/).map { |m| m.split(/ - |\n/, 3) }
You can also use the scan method, but it's a little more fastidious:
text.scan(/([\d-]+[^-]+) - (.*)\n(.*(?>\n.*)*?(?=\n\n\d\d-|\z))/)
slice_before might be easier to understand than a huge scan, and it has the advantage of keeping the pattern (split removes it)
data = text.each_line.slice_before(/^\d\d\-\d\d\-\d\d/).map do |block|
time, user = block.shift.strip.split(' - ')
[time, user, block.join.strip]
end
p data
# [["02-09-17 1:01 PM",
# "Some User (Add comments)",
# "Hello,\n\nHow are you?\n\nRegards,"],
# ["02-09-17 3:29 PM",
# "Another User (Add comments)",
# "Hey,\n\nThanks, all is fine.\n\nSome another text here."],
# ["02-09-17 4:30 AM",
# "Just a User (Add comments)",
# "some text\nwith\nmultiline"]]
You can use this regular expression:
(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) - (.*?)\r?\n((?:.|\r?\n)+?)(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)
(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) matches the first group, the date and time. The date must consist of three numbers, separated by a dash, followed by the time with AM/PM
(.*?)\r?\n((?:.|\r?\n)+?) matches the username up to the first line break (\r?\n) as the second group. Afterwards, anything including linebreaks is matching and building the third group, the comment.
This won't work, because it would handle everything from the beginning of the comment up to the end of the file as a comment. Therefore, you need to select the next date/time format, so that it stops there. You can do this just by repeating the date/time format after the comment and matching non-greedy, but this will include the next datetime already in the current match and therefore exclude it in the next match (which will lead to a skip of every second match). To circumvent this, you can use a positive lookahead: (?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$). This matches a number afterwards, but does not include it in the match. The last comment must then end at the end of the string $.
You need to use the global flag /g but mustn't use the multi-line flag /g, because the matching of the comment goes over multiple lines.
Here is a live example: https://regex101.com/r/o63GQE/2

combine pipe delimited lines of text in multiple lines with vbscript?

I've got a text file of output that looks essentially like this:
SMITHERSON, SMITH|00012345|15-Jan-1999|000885340
619649339|29-Sep-2015 00:09:30|Black|JOHNERSON, JOHN
00067890|02-Dec-1996|000490365|620094551
29-Sep-2015 23:06:01|Green|DAVISON, DAVE|00086543|06-Jun-2001|000938585
226438332|28-Sep-2015 00:12:12|Yellow
Seven pieces of data, they are always in the correct order but unfortunately they run together and onto different lines. There are carriage return + line feeds at the end of each line and there aren't pipe delimiters. The individual pieces of data are never split over multiple lines - I'm having a hard time explaining so here's another example:
DATA 1|DATA 2|DATA 3
DATA 4
DATA 5|DATA 6|DATA 7
DATA 1|DATA 2|DATA 3|DATA 4
DATA 5|DATA 6|DATA 7
etc...
They will have spaces between them but each piece of data will always stay on it's own line.
And I'm trying to turn it into this:
SMITHERSON, SMITH|00012345|15-Jan-1999|000885340|619649339|29-Sep-2015 00:09:30|Black
JOHNERSON, JOHN|00067890|02-Dec-1996|000490365|620094551|29-Sep-2015 23:06:01|Green
DAVISON, DAVE|00086543|06-Jun-2001|000938585|226438332|28-Sep-2015 00:12:12|Yellow
DATA 1|DATA 2|DATA 3|DATA 4|DATA 5|DATA 6|DATA 7
DATA 1|DATA 2|DATA 3|DATA 4|DATA 5|DATA 6|DATA 7
etc.
Seven pieces of data each on their own line, but still seperated by the '|' for another piece of software to read correctly.
I am spending about one hour every day correcting the text files by hand, so I've been trying to find an example I can work from to do this for a while but have not had any luck wrapping my head around this.
This code is ok. I only tested your sample text, not big files.
It will replace line feeds with the delimiter, then convert the entire file into one big array:
Set fso = CreateObject("Scripting.FileSystemObject")
Set input = fso.OpenTextFile("input.txt", 1)
Set output = fso.OpenTextFile("output.txt", 2, True)
Dim data: data = input.ReadAll
input.Close()
data = Replace(data, vbCrlf, "|")
data = Split(data, "|")
For i=0 To UBound(data) Step 7
output.WriteLine data(i) & "|" & data(i+1) & "|" & data(i+2) & "|" & data(i+3) & "|" & data(i+4) & "|" & data(i+5) & "|" & data(i+6)
Next
output.Close()
Untested, but something like this might do it. (Essentially it copies input to output as a stream, but newlines in the input are converted to pipe characters and every seventh pipe in the output is converted to a newline)
Set fs = CreateObject("Scripting.FileSystemObject")
Set f = fs.OpenTextFile("D:\data\thefile.txt", 1)
Set o = fs.OpenTextFile("D:\data\combined.txt", 2, True)
pipecount = 0
Do While f.AtEndOfFile <> True
If f.AtEndOfLine = True Then
c = f.Read(2) ' Skip the CR+LF
c = "|" ' and pretend we got a pipe character
Else
c = f.Read(1)
End If
If c = "|" Then
pipecount = pipecount + 1
If pipecount = 7 Then
pipecount = 0
o.WriteLine()
Else
o.Write("|")
End If
Else
o.Write(c)
End If
End While
o.Close()

Is there a SnakeYaml DumperOptions setting to avoid double-spacing output?

I seem to see double-spaced output when parsing/dumping a simple YAML file with a pipe-text field.
The test is:
public void yamlTest()
{
DumperOptions printOptions = new DumperOptions();
printOptions.setLineBreak(DumperOptions.LineBreak.UNIX);
Yaml y = new Yaml(printOptions);
String input = "foo: |\n" +
" line 1\n" +
" line 2\n";
Object parsedObject = y.load(new StringReader(input));
String output = y.dump(parsedObject);
System.out.println(output);
}
and the output is:
{foo: 'line 1
line 2
'}
Note the extra space between line 1 and line 2, and after line 2 before the end of the string.
This test was run on Mac OS X 10.6, java version "1.6.0_29".
Thanks!
Mark
In the original string you use literal style - it is indicating by the '|' character. When you dump your text, you use single-quoted style which ignores the '\n' characters at the end. That is why they are repeated with the empty lines.
Try to set different styles in DumperOptions:
// and others - FOLDED, DOUBLE_QUOTED
DumperOptions.setDefaultScalarStyle(ScalarStyle.LITERAL)

Make level the printed symbols in the ouput

Consider this Ruby code:
puts "*****"
puts " *"
puts " "
puts "*****"
puts " *"
My Output is like this:
*****
*
*****
*
Why the heck a whitespace doesn't fill the same space as * character in Scite?
I've tried it in Eclypse with Java and it works just fine.
Proportional fonts have characters of varying widths, ruining space-based alignment.
Switch to a monospace font (e.g., Courier) so all characters are the same size and it'll work.
In order to make it work in Scite You should add
style.errorlist.32=$(font.monospace) in
SciteUser.properties file

Strange behavior splitting arrays with Ruby (v1.9.2)

I am trying to handle an array with Ruby v1.9.2 but it has some strange behavior.
The best explanation may be done with examples:
CASE 1 TEST
#test1 = "image/bmp, image/gif, image/jpg".split(',')
Debug #test1:
---
- image/bmp # why this?!
- " image/gif"
- " image/jpg"
CASE 2 TEST
#test2 = ", image/bmp, image/gif, image/jpg".split(',')
Debug #test2:
---
- "" # why this?!
- " image/bmp"
- " image/gif"
- " image/jpg"
WHAT I NEED
Notice: I can use the CASE 2 TEST, but I would like to do things right and better.
Debug that I would like to have:
---
- " image/bmp"
- " image/gif"
- " image/jpg"
In the test case 1 there is no space before "image/bmp" in the result because there is no space before "image/bmp" in the original string.
In the test case 2 there is an empty string at the beginning because the string starts with a comma, and for every separator in the string there is a string in the resulting array, containing what comes before that separator (which in this case means the empty string).
If you want the result you've shown, you could just add a space (but no comma) before "image/bmp" in the source string. Alternatively you could split by /, */ and then add one space before each string with map. Though frankly I don't get why you want a space before each string.
>> ", image/bmp, image/gif, image/jpg".split(/\s*,\s*/).select{|x| x!=""}
=> ["image/bmp", "image/gif", "image/jpg"]

Resources