Finding and saving the last occurrence of a string using awk - bash

I need to find the last occurrence of a string in a plain text file (no delimiters or columns) and save its line number and the entire line in variables for later use in my script
Then I need to check if there is an occurrence of a second string after the line we just found.
I'm unsure of how to do this, I'm a scrub at bash. I'm not sure how to save results of awk in a variable, and I'm not sure of the logic i'd need to find the last occurrence of a string. Any advice/guidance would be amazing

# Remember last line on which we saw "string_to_match", and the line itself
/string_to_match/ { last1 = NR; line=$0 }
# Remember last line on which we saw "second_string"
/second_string/ { last2 = NR }
# At the end of the file, if last2 was after last1, print it.
END { if (last2 > last1) print last2 }
Basically just process each line in turn and every time you find the first string update the last1 and line variables.
Similarly, every time you see the first string update the last2 variable.
When you reach the end of the file last1 will be the last line on which you saw the first string. At that point you can see if the second string was seen after that point. You can also do whatever processing you need using last1 and line.

Related

Using awk to format text

I'm getting hard times understanding how to achieve what I want using awk and after searching for quite some time, I couldn't find the solution I'm looking for.
I have an input text that looks like this:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line
I want to properly format the weird lines between ' (' and ')'. The expected output is as follow:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Looking up on stack overflow I found this :
How to select lines between two marker patterns which may occur multiple times with awk/sed
So what I'm using now is echo $text | awk '/ \(/{flag=1;next}/\)/{flag=0}flag'
Which almost works except it filters out the non-matching lines, here's the output produced by this very last command:
(Element 4)
(Element 1, span 1 to Element 5, span 4)
Anyone knows how-to do this? I'm open to any suggestion, including not-using awk if you know better.
Bonus point if you teach me how to remove syntaxic coloration on my question code blocks :)
Thanks a billion times
Edit: Ok, so I accepted #EdMorton's solution as he provided something using awk (well, GNU awk). However, I'm currently using #aaron's sed voodoo incantations with great success and will probably continue doing so until I hit anything new on that specific usecase.
I strongly suggest reading EdMorton's explanation, last paragraph made my day. If anyone passing by has good ressources regarding awk/sed they can share, feel free to do so in the comments.
Here's how I would do it with GNU sed :
s/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}
Which, for those who don't speak gibberish, means :
remove the leading spaces from lines that start with spaces and an opening bracket
test if the line now start with an opening bracket. If that's the case, do the following :
mark this spot as the label l, which denotes the start of a loop
add a line from the input to the pattern space
test if you now have a closing bracket in your pattern space
if so, jump to the label e
(if not) jump to the label l
mark this spot as the label e, which denotes the end of the code
remove the linefeeds from the pattern space
(implicitly print the pattern space, whether it has been modified or not)
This can probably be refined, but it does the trick :
$ echo """Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line """ | sed 's/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}'
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Edit : if you can disable history expansion (set +H), this sed command is nicer : s/^\s*(/(/;/^(/{:l N;/)/!b l;s/\n//g}
sed is for simple substitutions on individual lines, that is all. If you try to do anything else with it then you are using constructs that became obsolete in the mid-1970s when awk was invented, are almost certainly non-portable and inefficient, are always just a pile of indecipherable arcane runes, and are used today just for mental exercise.
The following uses GNU awk for multi-char RS, RT and the \s shorthand for [[:space:]] and works by simply isolating the (...) strings and then doing whatever you want with them:
$ cat tst.awk
BEGIN {
RS="[(][^)]+[)]" # a regexp for the string you want to isolate in RT
ORS="" # disable appending of newlines so we print as-is
}
{
gsub(/\n[[:blank:]]+$/,"\n") # remove any blanks before RT at the start of each line
sub(/\(\s+/,"(",RT) # remove spaces after ( in RT
sub(/\s+\)/,")",RT) # remove spaces before ) in RT
gsub(/\s+/," ",RT) # compress each chain of spaces to one blank char in RT
print $0 RT # print the result
}
$ awk -f tst.awk file
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
If you're considering using a sed solution for this also consider how you would enhance it if/when you have the slightest requirements change. Any change to the above awk code would be trivial and obvious while a change to the equivalent sed code would require first sacrificing a goat under a blood moon then breaking out your copy of the Rosetta Stone...
It's doable in awk, and maybe there's a slicker way than this. It looks for lines between and including those containing only blanks and either an open or close parenthesis, and processes them specially. Everything else it just prints:
awk '/^ *\( *$/,/^ *\) *$/ {
sub(/^ */, "");
sub(/ *$/, "");
if ($1 ~ /[()]/) hold = hold $1; else hold = hold " " $0
if ($0 ~ /\)/) {
sub(/\( /, "(", hold)
sub(/ \)/, ")", hold)
print hold
hold = ""
}
next
}
{ print }' data
The variable hold is initially empty.
The first pair of sub calls strip leading and trailing blanks (copying the data from the question, there's a blank after span 1 to). The if adds the ( or ) to hold without a space, or the line to hold after a space. If the close parenthesis is present, remove the space after the open parenthesis and before the close parenthesis, print hold, and reset hold to empty. Always skip the rest of the script with next. The rest of the script is { print } — print unconditionally, often written 1 by minimalists.
The file data is copy'n'paste from the data in the question.
Output:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
The 'Another Line' (with capital L) has a trailing blank because the data in the question does.
With awk
$ cat fmt.awk
function rem_wsp(s) { # remove white spaces
gsub(/[\t ]/, "", s)
return s
}
function beg() {return rem_wsp($0)=="("}
function end() {return rem_wsp($0)==")"}
function dump_block() {
print "(" block ")"
}
beg() {
in_block = 1
next
}
end() {
dump_block()
in_block = block = ""
next
}
in_block {
if (length(block)>0) sep = " "
block = block sep $0
next
}
{
print
}
END {
if (in_block) dump_block()
}
Usage:
$ awk -f fmt.awk fime.dat

Edit fields in csv files using bash

I have a bunch of csv files that need "cleaning".
Specifically, there is a column that contains timestamp values, however some lines have a value of '1' instead.
What I wish to do, is replace those 1's with the last valid (timestamp) value, i.e. replace the value of i-th line with that of that of line i-1.
I provide a sample of the file
URL192.168.2.2,420042,20/07/2015 09:40:00,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:00,3232236038,3232236034
URL192.168.2.2,420042, 1,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:01,3232236038,3232236034
So in this example, the 1 must be replaced with 20/07/2015 09:40:00. I tried it using awk but couldn't nail it.
Assuming no commas in the other fields, an awk program like this should work:
BEGIN { FS = OFS = "," }
$3!=1 { prev = $3 }
$3==1 { $3 = prev }
{ print }
Warning: this is untested code.
The first line sets the field separator to a comma, for both input and output. The second line saves the timestamp of every row that has a timestamp in the third field. The third line writes the most recently saved timestamp to every row that doesn't have a timestamp in the third field. And the fourth line writes every input line, whether modified or not, to the output.
Let me know how you get on.

awk script: removing line previous to pattern match and after, until a blank line

I began learning awk yesterday in attempt to solve this problem (and learn a useful new language). At first I tried using sed, but soon realized it was not the correct tool to access/manipulate lines previous to a pattern match.
I need to:
Remove all lines containing "foo" (trivial on it's own, but not whilst keeping track of previous lines)
Find lines containing "bar"
Remove the line previous to the one containing "bar"
Remove all lines after and including the line containing "bar" until we reach a blank line
Example input:
This is foo stuff
I like food!
It is tasty!
stuff
something
stuff
stuff
This is bar
Hello everybody
I'm Dr. Nick
things
things
things
Desired output:
It is tasty!
stuff
something
stuff
things
things
things
My attempt:
{
valid=1; #boolean variable to keep track if x is valid and should be printed
if ($x ~ /foo/){ #x is valid unless it contains foo
valid=0; #invalidate x so that is doesn't get printed at the end
next;
}
if ($0 ~ /bar/){ #if the current line contains bar
valid = 0; #x is invalid (don't print the previous line)
while (NF == 0){ #don't print until we reach an empty line
next;
}
}
if (valid == 1){ #x was a valid line
print x;
}
x=$0; #x is a reference to the previous line
}
Super bonus points (not needed to solve my problem but I'm interesting in learning how this would be done):
Ability to remove n lines before pattern match
Option to include/disclude the blank line in output
Below is an alternative awk script using patterns & functions to trigger state changes and manage output, which produces the same result.
function show_last() {
if (!skip && !empty) {
print last
}
last = $0
empty = 0
}
function set_skip_empty(n) {
skip = n
last = $0
empty = NR <= 0
}
BEGIN { set_skip_empty(0) }
END { show_last() ; }
/foo/ { next; }
/bar/ { set_skip_empty(1) ; next }
/^ *$/ { if (skip > 0) { set_skip_empty(0); next } else show_last() }
!/^ *$/{ if (skip > 0) { next } else show_last() }
This works by retaining the "current" line in a variable last, which is either
ignored or output, depending on other events, such as the occurrence of foo and bar.
The empty variable keeps track of whether or not the last variable is really
a blank line, or simple empty from inception (e.g., BEGIN).
To accomplish the "bonus points", replace last with an array of lines which could then accumulate N number of lines as desired.
To exclude blank lines (such as the one that terminates the bar filter), replace the empty test with a test on the length of the last variable. In awk, empty lines have no length (but, lines with blanks or tabs *do* have a length).
function show_last() {
if (!skip && length(last) > 0) {
print last
}
last = $0
}
will result in no blank lines of output.
Read each blank-lines-separated paragraph in as a string, then do a gsub() removing the strings that match the RE for the pattern(s) you care about:
$ awk -v RS= -v ORS="\n\n" '{ gsub(/[^\n]*foo[^\n]*\n|\n[^\n]*\n[^\n]*bar.*/,"") }1' file
It is tasty!
stuff
something
stuff
things
things
things
To remove N lines, change [^\n]*\n to ([^\n]*\n){N}.
To not remove part of the RE use GNU awk and use gensub() instead of gsub().
To remove the blank lines, change the value of ORS.
Play with it...
This awk should work without storing full file in memory:
awk '/bar/{skip=1;next} skip && p~/^$/ {skip=0} NR>1 && !skip && !(p~/foo/){print p} {p=$0}
END{if (!skip && !(p~/foo/)) print p}' file
It is tasty!
stuff
something
stuff
things
things
things
One way:
awk '
/foo/ { next }
flag && NF { next }
flag && !NF { flag = 0 }
/bar/ { delete line[NR-1]; idx-=1; flag = 1; next }
{ line[++idx] = $0 }
END {
for (x=1; x<=idx; x++) print line[x]
}' file
It is tasty!
stuff
something
stuff
things
things
things
If line contains foo skip it.
If flag is enabled and line is not blank skip it.
If flag is enabled and line is blank disable the flag.
If line contains bar delete the previous line, reset the counter, enable the flag and skip it
Store all lines that manages through in array indexed at incrementing number
In the END block print the lines.
Side Notes:
To remove n number of lines before a pattern match, you can create a loop. Start with current line number and using a reverse for loop you can remove lines from your temporary cache (array). You can then subtract n from your self defined counter variable.
To include or exclude blank lines you can use the NF variable. For a typical line, NF variable is set to number of fields based on your field separator. For blank lines this variable is 0. For example, if you modify the line above END block to NF { line[++idx] = $0 } in the answer above you will see we have bypassed all blank lines from output.

Get random text block from file using bash

what is the simplest way of reading a random block of characters from a text file using bash?
A block is a set of characters which begin with X and end with X, where X is a character sequence, usually it will be "\n\n"
We can assume that file has short lines, less than 200 characters each.
Blocks don't have more than 20 lines.
I have seen threads like get random line, get text from between two tokens, but it's not exacly what I need.
I can write a simple program in C that will read how many blocks are in file, get a random number from a given range and then search for a block with this ID, but there must be an easier way.
Example:
X = "\n\n"
File: (the .'s are not in the file, I used them to make "empty" line at the begginning and end of code)
.
first line
second line and some other text
fourth line
sixth line
seventh line, more textęęę
.
Running the script for first time, output:
fourth line
Running the script for the second time, output:
first line
second line and some other text
Yours faithfully,
user2420535
To get a uniformly random block from a file of blank-line-separated blocks in one pass,
awk -v RS='\n\n' '
BEGIN { srand(); }
rand() < 1.0/NR { s=$0; }
END { print s; }
' file
This is a simple case of Reservoir Sampling.

Ruby String mismatch in while comparing text from a text file

I am in a problem while reading a text file with readline and trying to compare first line with a string. I want to compare the first line of the text file with a string and then will go for next process. But I can't do that. Here is my code:
doc = File.open("example.txt", "r")
line1 = doc.readline
if line1 == "sukanta"
line2 = doc.readline
line3 = doc.readline
line4 = doc.readline
end
My example.txt file contains:
sukanta
Software engineer
label2
server:107.108.9.190
Please give me solution. While I am trying to get string length with line1.length it's not showing the exact number.
i got the answer. Its silly mistake .. i should use "sukanta\n" to compare
When i am using readline to read each line then i have to set each line in their place sequentially. i cant break the order. Whil i am using loop like
doc = File.open("example.txt", "r")
doc.each_line do |lines|
puts lines
end
getting the whole text as a line. cant separate each line from others. i need to break the order. How to do that?
I suspect you are not taking into account that a line ends with $/ ("\n" on UNIX). So you probably intended
line1 == "sukanta\n"
or
line1.chomp == "sukanta"
and you are not including $/ when you count the length (which is one or two characters less than the correct length depending on the OS).

Resources