Remove line break if line does not start with KEYWORD - ruby

I have a flat file with lines that look like
KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....
How do I go about removing the linebreak so that
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
turns into
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
This is in a HP-UNIX environment and I can move the file to another system (windows box with powershell and ruby installed).

I don't know what tools are you using, but you can use this regex to match every \n (or maybe \r) that isn't followed by KEYWORD so you can replace it for SPACE and you would have it.
DEMO
Regex: \r(?!KEYWORD) (With global modifier)

Ruby's Array has a nice method called slice_before that it inherits from Enumerable, which comes to the rescue here:
require 'pp'
text = 'KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....'
pp text.split("\n").slice_before(/^KEYWORD/).map{ |a| a.join(' ') }
=> ["KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING",
"KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING",
"KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE",
"KEYWORD|....."]
This code just splits your text on line breaks, then uses slice_before to break the resulting array into sub-arrays, one for each block of text starting with /^KEYWORD/. Then it walks through the resulting sub-arrays, joining them with a single space. Any line that wasn't pre-split will be left alone. Ones that were broken are rejoined.
For real use you'd probably want to replace pp with a regular puts.
As for moving the code to Windows with Ruby, why? Install Ruby on HP-Unix and run it there. It's a more natural fit.

this short awk oneliner should do the job:
awk '/^KEYWORD/{print ""}{printf $0}' file

This might work for you (GNU sed):
sed ':a;$!{N;/\n.*|/!{s/\n/ /;ba}};P;D' file
Keep two lines in the pattern space and if the second line doesn't contain a | replace the newline with a space and repeat until it does or the the end of the file is reached.
This assumes the last field is the field that overflows, otherwise use the KEYWORD such:
sed ':a;$!{N;/\nKEYWORD/!{s/\n/ /;ba}};P;D' file

Powershell way:
[System.IO.File]::ReadAllText( "c:\myfile.txt" ) -replace "`r`n(?!KEYWORD)", ' '

You can use sed or awk (preferred) for this ยป
sed -n 's|\r||g;$!{1{x;d};H};${H;x;s|\n\(KEYWORD\)|\r\1|g;s|\n||g;s|\r|\n|g;p}' file.txt
awk 'BEGIN{ORS="";}NR==1{print;next;}/^KEYWORD/{print"\n";print;next;}{print;}' file.txt
Note: Write each command (sed, awk) in one line

Related

extract data between similar patterns

I am trying to use sed to print the contents between two patterns including the first one. I was using this answer as a source.
My file looks like this:
>item_1
abcabcabacabcabcabcabcabacabcabcabcabcabacabcabc
>item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
>item_3
cdecde
>item_4
defdefdefdefdefdefdef
I want it to start searching from item_2 (and include) and finish at next occuring > (not include). So my code is sed -n '/item_2/,/>/{/>/!p;}'.
The result wanted is:
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
but I get it without item_2.
Any ideas?
Using awk, split input by >s and print part(s) matching item_2.
$ awk 'BEGIN{RS=">";ORS=""} /item_2/' file
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
I would go for the awk method suggested by oguz for its simplicity. Now if you are interested in a sed way, out of curiosity, you could fix what you have already tried with a minor change :
sed -n '/^>item_2/ s/.// ; //,/>/ { />/! p }' input_file
The empty regex // recalls the previous regex, which is handy here to avoid duplicating /item_2/. But keep in mind that // is actually dynamic, it recalls the latest regex evaluated at runtime, which is not necessarily the closest regex on its left (although it's often the case). Depending on the program flow (branching, address range), the content of the same // can change and... actually here we have an interesting example ! (and I'm not saying that because it's my baby ^^)
On a line where /^>item_2/ matches, the s/.// command is executed and the latest regex before // becomes /./, so the following address range is equivalent to /./,/>/.
On a line where /^>item_2/ does not match, the latest regex before // is /^>item_2/ so the range is equivalent to /^>item_2/,/>/.
To avoid confusion here as the effect of // changes during execution, it's important to note that an address range evaluates only its left side when not triggered and only its right side when triggered.
This might work for you (GNU sed):
sed -n ':a;/^>item_2/{s/.//;:b;p;n;/^>/!bb;ba}' file
Turn off implicit printing -n.
If a line begins >item_2, remove the first character, print the line and fetch the next line
If that line does not begins with a >, repeat the last two instructions.
Otherwise, repeat the whole set of instructions.
If there will always be only one line following >item_2, then:
sed '/^>item_2/!d;s/.//;n' file

How can i get only special strings (by condition) from file?

I have a huge text file with strings of a special format. How can i quickly create another file with only strings corresponding to my condition?
for example, file contents:
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"
[2/Nov/2015][rule="mySecondRule"]"GET
http://anotheruselesssotialnetwork.com/picturewithdog.jpg"
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithzombie.jpg"
and i only need string with "myRule" and "cat"?
I think it should be perl, or bash, but it doesn't matter.
Thanks a lot, sorry for noob question.
Is it correct, that each entry is two lines long? Then you can use sed:
sed -n '/myRule/ {N }; /myRule.*cat/ {p}'
the first rule appends the nextline to patternspace when myRule matches
the second rule tries to match myRule followed by a cat in the patternspace , if found it prints patternspace
If your file is truly huge to the extent that it won't fit in memory (although files up to a few gigabytes are fine in modern computer systems) then the only way is to either change the record separator or to read the lines in pairs
This shows the first way, and assumes that the second line of every pair ends with a double quote followed by a newline
perl -ne'BEGIN{$/ = qq{"\n}} print if /myRule/ and /cat/' huge_file.txt
and this is the second
perl -ne'$_ .= <>; print if /myRule/ and /cat/' huge_file.txt
When given your sample data as input, both methods produce this output
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"

gsub issue with awk (gawk)

I need to search a text file for a string, and make a replacement that includes a number that increments with each match.
The string to be "found" could be a single character, or a word, or a phrase.
The replacement expression will not always be the same (as it is in my examples below), but will always include a number (variable) that increments.
For example:
1) I have a test file named "data.txt". The file contains:
Now is the time
for all good men
to come to the
aid of their party.
2) I placed the awk script in a file named "cmd.awk". The file contains:
/f/ {sub ("f","f(" ++j ")")}1
3) I use awk like this:
awk -f cmd.awk data.txt
In this case, the output is as expected:
Now is the time
f(1)or all good men
to come to the
aid of(2) their party.
The problem comes when there is more than one match on a line. For example, if I was searching for the letter "i" like:
/i/ {sub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.
which is wrong because it doesn't include the "i" in "time" or "their".
So, I tried "gsub" instead of "sub" like:
/i/ {gsub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.
Now it makes the replacement for all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.
The desired output should be:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
Note: The number won't always begin with "1" so I might use awk like this:
awk -f cmd.awk -v j=26 data.txt
To get the output:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
And just to be clear, the number in the replacement will not always be inside parenthesis. And the replacement will not always include the matched string (actually it would be quite rare).
The other problem I am having with this is...
I want to use an awk-variable (not environment variable) for the "search string", so I can specify it on the awk command line.
For example:
1) I placed the awk script in a file named "cmd.awk". The file contains something like:
/??a??/ {gsub (a,a "(" ++j ")")}1
2) I would use awk like this:
awk -f cmd.awk -v a=i data.txt
To get the output:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
The question here, is how do I represent the the variable "a" in the /search/ expression ?
awk version:
awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i
gensub() sounds ideal here, it allows you to replace the Nth match, so what sounds like a solution is to iterate over the string in a do{}while() loop replacing one match at a time and incrementing j. This simple gensub() approach won't work if the replacement does not contain the original text (or worse, contains it multiple times), see below.
So in awk, lacking perl's "s///e" evaluation feature, and its stateful regex /g modifier (as used by Steve) the best remaining option is to break the lines into chunks (head, match, tail) and stick them back together again:
BEGIN {
if (j=="") j=1
if (a=="") a="f"
}
match($0,a) {
str=$0; newstr=""
do {
newstr=newstr substr(str,1,RSTART-1) # head
mm=substr(str,RSTART,RLENGTH) # extract match
sub(a,a"("j++")",mm) # replace
newstr=newstr mm
str=substr(str,RSTART+RLENGTH) # tail
} while (match(str,a))
$0=newstr str
}
{print}
This uses match() as an epxression instead of a // pattern so you can use a variable. (You can also just use "($0 ~ a) { ... }", but the results of match() are used in this code, so don't try that here.)
You can define j and a on the command line.
gawk supports \y which is the equivalent of perlre's \b, and also supports \< and \> to explictly match the start and end of a word, just take care to add extra escapes from a unix command line (I'm not quite sure what Windows might require or permit).
Limited gensub() version
As referenced above:
match($0,a) {
idx=1; str=$0
do {
prev=str
str=gensub(a,a"(" j ")",idx++,prev)
} while (str!=prev && j++)
$0=str
}
The problems here are:
if you replace substring "i" with substring "k" or "k(1)" then the gensub() index for the next match will be off by 1. You could work around this if you either know that in advance, or work backward through the string instead.
if you replace substring "i" with substring "ii" or "ii(i)" then a similar problem arises (resulting in an infinite loop, because gensub() keeps finding a new match)
Dealing with both conditions robustly is not worth the code.
I'm not saying this can't be done using awk, but I would strongly suggest moving to a more powerful language. Use perl instead.
To include a count of the letter i beginning at 26, try:
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt
This could also be a shell var:
var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt
Results:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
To include a count of specific words, add word boundaries (i.e. \b) around the words, try:
perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt
Results:
Now is the(6) time
for all good men
to come to the(7)
aid of their party.

in bash, bash remove punctuation between pattern matches?

I am struggling with a conversion of a data file to csv when there is punctuation in the title field.
I have a bash script that obtains the file and processes it, and it almost works. What gets me is when there are commas in a free text title field, which then create extra fields.
I have tried some sed examples to replace between patterns but I have not gotten any of them to work. What I want to do is work between two patterns and replace commas with either nothing or perhaps a semicolon.
Taking this string:
name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,
Replacing with this:
name:A100040,title:Oatmeal is better with raisins dates and sugar,current_balance:50000,
I should probably use "title:" and ",current_" to denote the start and end of the block where I want to make the change to avoid situations like this:
name:A100040,title:Re-title current periodicals, recent books,current_balance:50000,
So far I have not gotten the substitution to match. In this case I am using !! to make the change obvious:
teststring="name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,"
echo $teststring |sed '/title:/,/current_/s/,/!!/g'
name:A100040!!title:Oatmeal is better with raisins!! dates!! and sugar!!current_balance:50000!!
Any help appreciated.
This is one way which could undoubtedly be refined:
perl -ple 'm/(.*?)(title:.*?)(current_balance:.*)/; $save = $part = $2; $part =~ s/,/!!/g; s/$save/$part/'
First, using sed or awk to parse CSV is almost always the wrong thing to do, because they do not allow field delimiters to be quoted. That said, it seems like a better approach would be to quote the fields so that your output would be:
name:"A100040",title:"Oatmeal ... , dates, and sugar",current_balance:50000
Using sed you can try: (this is fragile)
sed 's/:\([^:]*\),\([^,:]*\)/:"\1",\2/g'
If you insist on trying to parse the csv with "standard" tools and you consider perl to be standard, you could try:
perl -pe '1 while s/,([^,:]*),/ $1,/g'

Removing two consecutive line breaks

My file has a lot of line breaks, like this:
This is a line.
This is another line.
I would like to remove these, but only in cases where the first line ends with }, e.g.:
\macro{This is a line.}
This is another line.
That should become:
\macro{This is a line.}This is another line.
How can I remove the line breaks in this situation?
This is what I figured out:
$ sed -n '/}$/{h;:a;n;/^$\|}$/{H;$!ba};H;g;s#}\n*#}#g};p' input.txt
The idea behind is:
Accumulate all continuous empty lines and lines endswith '}'
Substitute }\n* with }
Last line needs special consideration.
You can just use an editor that support regular expressions and do a replace in your file. Replace:
}$\n\n
with
}
If you need to do it programmatically, the same principle applies (i.e. using regex for string replacement) but the actual answer will obviously depend on language/environment.
This might work for you:
sed '$!N;s/}\n$/}/;P;D' file
if there is white space involved, try:
sed '$!N;s/}\s*\n\s*$/}/;P;D' file
or more formally:
sed '$!N;s/}[[:space:]]*\n[[:space:]]*$/}/;P;D' file

Resources