gsub issue with awk (gawk) - windows

I need to search a text file for a string, and make a replacement that includes a number that increments with each match.
The string to be "found" could be a single character, or a word, or a phrase.
The replacement expression will not always be the same (as it is in my examples below), but will always include a number (variable) that increments.
For example:
1) I have a test file named "data.txt". The file contains:
Now is the time
for all good men
to come to the
aid of their party.
2) I placed the awk script in a file named "cmd.awk". The file contains:
/f/ {sub ("f","f(" ++j ")")}1
3) I use awk like this:
awk -f cmd.awk data.txt
In this case, the output is as expected:
Now is the time
f(1)or all good men
to come to the
aid of(2) their party.
The problem comes when there is more than one match on a line. For example, if I was searching for the letter "i" like:
/i/ {sub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.
which is wrong because it doesn't include the "i" in "time" or "their".
So, I tried "gsub" instead of "sub" like:
/i/ {gsub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.
Now it makes the replacement for all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.
The desired output should be:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
Note: The number won't always begin with "1" so I might use awk like this:
awk -f cmd.awk -v j=26 data.txt
To get the output:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
And just to be clear, the number in the replacement will not always be inside parenthesis. And the replacement will not always include the matched string (actually it would be quite rare).
The other problem I am having with this is...
I want to use an awk-variable (not environment variable) for the "search string", so I can specify it on the awk command line.
For example:
1) I placed the awk script in a file named "cmd.awk". The file contains something like:
/??a??/ {gsub (a,a "(" ++j ")")}1
2) I would use awk like this:
awk -f cmd.awk -v a=i data.txt
To get the output:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
The question here, is how do I represent the the variable "a" in the /search/ expression ?

awk version:
awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i

gensub() sounds ideal here, it allows you to replace the Nth match, so what sounds like a solution is to iterate over the string in a do{}while() loop replacing one match at a time and incrementing j. This simple gensub() approach won't work if the replacement does not contain the original text (or worse, contains it multiple times), see below.
So in awk, lacking perl's "s///e" evaluation feature, and its stateful regex /g modifier (as used by Steve) the best remaining option is to break the lines into chunks (head, match, tail) and stick them back together again:
BEGIN {
if (j=="") j=1
if (a=="") a="f"
}
match($0,a) {
str=$0; newstr=""
do {
newstr=newstr substr(str,1,RSTART-1) # head
mm=substr(str,RSTART,RLENGTH) # extract match
sub(a,a"("j++")",mm) # replace
newstr=newstr mm
str=substr(str,RSTART+RLENGTH) # tail
} while (match(str,a))
$0=newstr str
}
{print}
This uses match() as an epxression instead of a // pattern so you can use a variable. (You can also just use "($0 ~ a) { ... }", but the results of match() are used in this code, so don't try that here.)
You can define j and a on the command line.
gawk supports \y which is the equivalent of perlre's \b, and also supports \< and \> to explictly match the start and end of a word, just take care to add extra escapes from a unix command line (I'm not quite sure what Windows might require or permit).
Limited gensub() version
As referenced above:
match($0,a) {
idx=1; str=$0
do {
prev=str
str=gensub(a,a"(" j ")",idx++,prev)
} while (str!=prev && j++)
$0=str
}
The problems here are:
if you replace substring "i" with substring "k" or "k(1)" then the gensub() index for the next match will be off by 1. You could work around this if you either know that in advance, or work backward through the string instead.
if you replace substring "i" with substring "ii" or "ii(i)" then a similar problem arises (resulting in an infinite loop, because gensub() keeps finding a new match)
Dealing with both conditions robustly is not worth the code.

I'm not saying this can't be done using awk, but I would strongly suggest moving to a more powerful language. Use perl instead.
To include a count of the letter i beginning at 26, try:
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt
This could also be a shell var:
var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt
Results:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
To include a count of specific words, add word boundaries (i.e. \b) around the words, try:
perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt
Results:
Now is the(6) time
for all good men
to come to the(7)
aid of their party.

Related

extract data between similar patterns

I am trying to use sed to print the contents between two patterns including the first one. I was using this answer as a source.
My file looks like this:
>item_1
abcabcabacabcabcabcabcabacabcabcabcabcabacabcabc
>item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
>item_3
cdecde
>item_4
defdefdefdefdefdefdef
I want it to start searching from item_2 (and include) and finish at next occuring > (not include). So my code is sed -n '/item_2/,/>/{/>/!p;}'.
The result wanted is:
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
but I get it without item_2.
Any ideas?
Using awk, split input by >s and print part(s) matching item_2.
$ awk 'BEGIN{RS=">";ORS=""} /item_2/' file
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
I would go for the awk method suggested by oguz for its simplicity. Now if you are interested in a sed way, out of curiosity, you could fix what you have already tried with a minor change :
sed -n '/^>item_2/ s/.// ; //,/>/ { />/! p }' input_file
The empty regex // recalls the previous regex, which is handy here to avoid duplicating /item_2/. But keep in mind that // is actually dynamic, it recalls the latest regex evaluated at runtime, which is not necessarily the closest regex on its left (although it's often the case). Depending on the program flow (branching, address range), the content of the same // can change and... actually here we have an interesting example ! (and I'm not saying that because it's my baby ^^)
On a line where /^>item_2/ matches, the s/.// command is executed and the latest regex before // becomes /./, so the following address range is equivalent to /./,/>/.
On a line where /^>item_2/ does not match, the latest regex before // is /^>item_2/ so the range is equivalent to /^>item_2/,/>/.
To avoid confusion here as the effect of // changes during execution, it's important to note that an address range evaluates only its left side when not triggered and only its right side when triggered.
This might work for you (GNU sed):
sed -n ':a;/^>item_2/{s/.//;:b;p;n;/^>/!bb;ba}' file
Turn off implicit printing -n.
If a line begins >item_2, remove the first character, print the line and fetch the next line
If that line does not begins with a >, repeat the last two instructions.
Otherwise, repeat the whole set of instructions.
If there will always be only one line following >item_2, then:
sed '/^>item_2/!d;s/.//;n' file

awk - skip lines of subdomains if domain already matched

lets assume - there is an already ordered list of domains like:
tld.aa.
tld.aa.do.notshowup.0
tld.aa.do.notshowup.0.1
tld.aa.do.notshowup.0.1.1
tld.aa.do.notshowup.too
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.xxxxx.donotshowup
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
which later acts as a blacklist.
Per specific requirement - all lines with a trailing '.' indicate
that all deeper subdomains of that specific domain should not appear
in the blacklist itself then... so the desired output of the example
above would/should be:
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
I currently run this in a loop (pure bash + heavy use of bash builtins to speedup things) ... but as the list
grows it takes quite long now to process around 562k entries.
Shouldn't it be easy for AWK (or sed maybe) to do this - any help is
really appreciated (I already tried some things in awk but somehow couldn't get it to display what I want...).
Thankyou!
If the . lines always come before the lines to ignore, this awk should do:
$ awk '{for (i in a) if (index($0,i) == 1) next}/\.$/{a[$0]=1}1' file
tld.aa.
tld.bb.showup
tld.aaaaa.showup
tld.xxxxx.
tld.yougettheidea.dontyou
tld.yougettheidea.dontyou.thankyou
/\.$/{a[$0]=1} adds lines with trailing dot to an array.
{for (i in a) if (index($0,i) == 1) next} searches for the current line in one of these indexed entries and skips further processing if found (next).
If the file is sorted alphabetically and no subdomains end with a dot, you don't even need an array as #Corentin Limier suggests:
awk 'a{if (index($0,a) == 1) next}/\.$/{a=$0}1' file

How can i get only special strings (by condition) from file?

I have a huge text file with strings of a special format. How can i quickly create another file with only strings corresponding to my condition?
for example, file contents:
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"
[2/Nov/2015][rule="mySecondRule"]"GET
http://anotheruselesssotialnetwork.com/picturewithdog.jpg"
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithzombie.jpg"
and i only need string with "myRule" and "cat"?
I think it should be perl, or bash, but it doesn't matter.
Thanks a lot, sorry for noob question.
Is it correct, that each entry is two lines long? Then you can use sed:
sed -n '/myRule/ {N }; /myRule.*cat/ {p}'
the first rule appends the nextline to patternspace when myRule matches
the second rule tries to match myRule followed by a cat in the patternspace , if found it prints patternspace
If your file is truly huge to the extent that it won't fit in memory (although files up to a few gigabytes are fine in modern computer systems) then the only way is to either change the record separator or to read the lines in pairs
This shows the first way, and assumes that the second line of every pair ends with a double quote followed by a newline
perl -ne'BEGIN{$/ = qq{"\n}} print if /myRule/ and /cat/' huge_file.txt
and this is the second
perl -ne'$_ .= <>; print if /myRule/ and /cat/' huge_file.txt
When given your sample data as input, both methods produce this output
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"

Need a guide to basic command-line awk syntax

I have read several awk tutorials and seen a number of questions and answers on here and the problem is that I'm seeing a LOT of variety in how people do their awk 1-liners and it has really overcomplicated it in my mind.
So I see things like this:
awk '/pattern/ { print }'
awk '/pattern/ { print $0 }'
awk '/pattern/ { print($0) }'
awk '/pattern/ { print($0); }'
awk 'BEGIN { print }'
awk '/pattern/ BEGIN { print };
Sometimes I get errors and sometimes not but because I'm seeing so many different phrasings I'm really having trouble fixing syntax errors because I can't figure out what's allowed and what isn't.
Can someone explain this? Does print require parens or not? Are semi-colons required or not? Is BEGIN required or not? What happens when you start an awk script with a /pattern/, and/or just pass it the name of a function like print on its own?
One at a time:
Can someone explain this?
Yes.
Does print require parens or not?
print, like return, is a builtin, not a function, and as such does not use parens at all. When you see print("foo") the parens are associated with the string "foo", they are NOT in any way part of the print command despite how it looks. It might be clearer (but still not useful in this case) to write it as print ("foo").
Are semi-colons required or not?
Not when the statements are on separate lines. Like in shell, semi-colons would be required to separate statements that occur on a single line
Is BEGIN required or not?
No. Note that BEGIN is a keyword that represents the condition that exists before the first input file is opened for reading so BEGIN{print} will just print a blank line since nothing has been read to print. Also /pattern/ BEGIN is nonsense and should produce a syntax error.
What happens when you start an awk script with a /pattern/, and/or just pass it the name of a function like print on its own?
An awk script is made up of condition { <action> } sections with the default condition being TRUE and the default action being print $0. So awk '/pattern/' means if the regexp "pattern" exists in the current record then invoke the default action which is to print that record and awk '{ print }' means the default condition of TRUE applies so execute the specified action and print the current record. Not also that print by default prints the current record so print $0 is synonymous with just print.
If you are considering starting to use awk, get the book Effective Awk Programming by Arnold Robbins and at least read the first chapter or 2.
Function calls require (). Statements do not (but appear to allow them).
print and printf are statements so do not require () (but supports it "The entire list of items may be optionally enclosed in parentheses.")
From print we also find out that
The simple statement ‘print’ with no items is equivalent to ‘print $0’: it prints the entire current record.
So we now know that the first three statements are identical.
From Actions we find out that.
An action consists of one or more awk statements, enclosed in curly braces (‘{…}’).
and that
The statements are separated by newlines or semicolons.
Which tells us that the semicolon is a "separator" and not a terminator so we don't need one at the end of an action so we now know the fourth is also identical.
BEGIN is a special pattern and that
[a] BEGIN rule is executed once only, before the first input record is read.
So the fifth is different because it operates once at the start and not on every line.
And the last is a syntax error because it has two patterns next to each other without an intervening action or separator.
All of those awk commands (except the last 2) can be shortened to:
awk '/pattern/' file
since print is always the action in awk.
Semicolon is optional just before }.
You cannot place BEGIN after /pattern/

Remove line break if line does not start with KEYWORD

I have a flat file with lines that look like
KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....
How do I go about removing the linebreak so that
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
turns into
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
This is in a HP-UNIX environment and I can move the file to another system (windows box with powershell and ruby installed).
I don't know what tools are you using, but you can use this regex to match every \n (or maybe \r) that isn't followed by KEYWORD so you can replace it for SPACE and you would have it.
DEMO
Regex: \r(?!KEYWORD) (With global modifier)
Ruby's Array has a nice method called slice_before that it inherits from Enumerable, which comes to the rescue here:
require 'pp'
text = 'KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....'
pp text.split("\n").slice_before(/^KEYWORD/).map{ |a| a.join(' ') }
=> ["KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING",
"KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING",
"KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE",
"KEYWORD|....."]
This code just splits your text on line breaks, then uses slice_before to break the resulting array into sub-arrays, one for each block of text starting with /^KEYWORD/. Then it walks through the resulting sub-arrays, joining them with a single space. Any line that wasn't pre-split will be left alone. Ones that were broken are rejoined.
For real use you'd probably want to replace pp with a regular puts.
As for moving the code to Windows with Ruby, why? Install Ruby on HP-Unix and run it there. It's a more natural fit.
this short awk oneliner should do the job:
awk '/^KEYWORD/{print ""}{printf $0}' file
This might work for you (GNU sed):
sed ':a;$!{N;/\n.*|/!{s/\n/ /;ba}};P;D' file
Keep two lines in the pattern space and if the second line doesn't contain a | replace the newline with a space and repeat until it does or the the end of the file is reached.
This assumes the last field is the field that overflows, otherwise use the KEYWORD such:
sed ':a;$!{N;/\nKEYWORD/!{s/\n/ /;ba}};P;D' file
Powershell way:
[System.IO.File]::ReadAllText( "c:\myfile.txt" ) -replace "`r`n(?!KEYWORD)", ' '
You can use sed or awk (preferred) for this »
sed -n 's|\r||g;$!{1{x;d};H};${H;x;s|\n\(KEYWORD\)|\r\1|g;s|\n||g;s|\r|\n|g;p}' file.txt
awk 'BEGIN{ORS="";}NR==1{print;next;}/^KEYWORD/{print"\n";print;next;}{print;}' file.txt
Note: Write each command (sed, awk) in one line

Resources