How to cut a range of lines defined by variables - shell

I have this python crawler output
[+] Site to crawl: http://www.example.com
[+] Start time: 2020-05-24 07:21:27.169033
[+] Output file: www.example.com.crawler
[+] Crawling
[-] http://www.example.com
[-] http://www.example.com/
[-] http://www.example.com/icons/ubuntu-logo.png
[-] http://www.example.com/manual
[i] 404 Not Found
[+] Total urls crawled: 4
[+] Directories found:
[-] http://www.example.com/icons/
[+] Total directories: 1
[+] Directory with indexing
I want to cut the lines between "Crawling" & "Total urls crawled" using awk or any other tool, so basically i wanna use variables to assign the NR to the first keyword "Crawling", and a second variable assigned to it the NR value of the second limiter "Total urls crawled", and then cut the range between the two, i tried something like this:
awk 'NR>$(Crawling) && NR<$(urls)' file.txt
but nothing really worked, the best i got is a cut from the Crawling+1 line to the end of the file which isn't helpfull really, so how to do it and how to cut a range of lines with awk with variables!
awk

If I got your requirement correctly you want to put shell variables to awk code and search strings then try following.
awk -v crawl="Crawling" -v url="Total urls crawled" '
$0 ~ url{
found=""
next
}
$0 ~ crawl{
found=1
next
}
found
' Input_file
Explanation: Adding detailed explanation for above.
awk -v crawl="Crawling" -v url="Total urls crawled" ' ##Starting awk program and setting crawl and url values of variables here.
$0 ~ url{ ##Checking if line is matched to url variable then do following.
found="" ##Nullify the variable found here.
next ##next will skip further statements from here.
}
$0 ~ crawl{ ##Checking if line is matched to crawl variable then do following.
found=1 ##Setting found value to 1 here.
next ##next will skip further statements from here.
}
found ##Checking condition if found is SET(NOT NULL) then print current line.
' Input_file ##Mentioning Input_file name here.

The clause "...or any other tool" prompts me to point out that a scripting language could be used in command-line mode for this. Here's how it could be done using Ruby, where 't' is the name of the file that contains the text from which the specified lines are to be extracted. The following would be entered in the shell.
ruby -W0 -e 'puts STDIN.readlines.select { |line| true if line.match?(/\bCrawling\b/)..line.match?(/\bTotal urls crawled\b/) }[1..-2]' < t
displays the following:
[" [-] http://www.example.com",
" [-] http://www.example.com/",
" [-] http://www.example.com/icons/ubuntu-logo.png",
" [-] http://www.example.com/manual",
" [i] 404 Not Found"]
The following operations are performed.
STDIN.readlines and < t reads the lines of t into an array
select selects the lines for which its block calculation returns true
[1..-2] extracts all but the first and last of the selected lines
select's block calculation,
true if line.match?(/\bCrawling\b/)..line.match?(/\bTotal urls crawled\b/)
employs the flip-flop operator. The block returns nil (treated as false by Ruby) until a line that matches /\bCrawling\b is read, namely, "[+] Crawling". The block then returns true, and continues to return true until and it encounters the line matching /\bTotal urls crawled\b, namely, "[+] Total urls crawled: 4". The block returns true for that line as well, but returns false for each subsequent line until and if it encounters another line that matches /\bCrawling\b, in which case the process repeats. Hence, "flip-flop".
"-W0" in the command line suppresses warning messages. Without it one may see the warning, "flip-flop is deprecated" (depending on the version of Ruby being used). After a decision was made to deprecate the (rarely-used) flip-flop operator, Rubyists took to the streets with pitchforks and torches in protest. The Ruby monks saw the error of their ways and reversed their decision.

Related

Can the regex matching pattern for awk be placed above the opening brace of the action line, or must it be on the same line?

I'm studying awk pretty fiercely to write a git diffn implementation which will show line numbers for git diff, and I want confirmation on whether or not this Wikipedia page on awk is wrong [Update: I've now fixed this part of that Wikipedia page, but this is what it used to say]:
(pattern)
{
print 3+2
print foobar(3)
print foobar(variable)
print sin(3-2)
}
Output may be sent to a file:
(pattern)
{
print "expression" > "file name"
}
or through a pipe:
(pattern)
{
print "expression" | "command"
}
Notice (pattern) is above the opening brace. I'm pretty sure this is wrong but need to know for certain before editing the page. What I think that page should look like is this:
/regex_pattern/ {
print 3+2
print foobar(3)
print foobar(variable)
print sin(3-2)
}
Output may be sent to a file:
/regex_pattern/ {
print "expression" > "file name"
}
or through a pipe:
/regex_pattern/ {
print "expression" | "command"
}
Here's a test to "prove" it. I'm on Linux Ubuntu 18.04.
1. test_awk.sh
gawk \
'
BEGIN
{
print "START OF AWK PROGRAM"
}
'
Test and error output:
$ echo -e "hey1\nhello\nhey2" | ./test_awk.sh
gawk: cmd. line:3: BEGIN blocks must have an action part
But with this:
2. test_awk.sh
gawk \
'
BEGIN {
print "START OF AWK PROGRAM"
}
'
It works fine!:
$ echo -e "hey1\nhello\nhey2" | ./test_awk.sh
START OF AWK PROGRAM
Another example (fails to provide expected output):
3. test_awk.sh
gawk \
'
/hey/
{
print $0
}
'
Erroneous output:
$ echo -e "hey1\nhello\nhey2" | ./test_awk.sh
hey1
hey1
hello
hey2
hey2
But like this:
4. test_awk.sh
gawk \
'
/hey/ {
print $0
}
'
It works as expected:
$ echo -e "hey1\nhello\nhey2" | ./test_awk.sh
hey1
hey2
Updates: after solving this problem, I just added these sections below:
Learning material:
In the process of working on this problem, I just spent several hours and created these examples: https://github.com/ElectricRCAircraftGuy/eRCaGuy_hello_world/tree/master/awk. These examples, comments, and links would prove useful to anyone getting started learning awk/gawk.
Related:
git diff with line numbers and proper code alignment/indentation
"BEGIN blocks must have an action part" error in awk script
The whole point of me learning awk at all in the first place was to write git diffn. I just got it done: Git diff with line numbers (Git log with line numbers)
I agree with you that the Wikipedia page is wrong. It's right in the awk manual:
A pattern-action statement has the form
pattern { action }
A missing { action } means print the line; a missing pattern always matches. Pattern-action statements are separated by newlines or semicolons.
...
Statements are terminated by semicolons, newlines or right braces.
This the man page for the default awk on my Mac. The same information is in the GNU awk manual, it's just buried a little deeper. And the POSIX specification of awk states
An awk program is composed of pairs of the form:
pattern { action }
Either the pattern or the action (including the enclosing brace characters) can be omitted.
A missing pattern shall match any record of input, and a missing action shall be equivalent to:
{ print }
You can see in you examples that instead of semicolons at the end of statements you can separate them with new lines. When you have
/regex/
{ ...
}
it's equivalent to /regex/; {...} which is equal to /regex/{print $0} {...} as you tested the behavior.
Note that BEGIN and END are special markers and they need action statements explicitly since for BEGIN {print $0} is not possible as the default action. That's why the open curly brace should be on the same line. Perhaps due to convenience but it's all consistent.

bash, using awk command for printing lines characters if condition

Before starting to explain my issue I have to say that it's the first time I'm using bash and the awk command.
I have a file containing a lot of lines and I am interested in printing some of these lines if certain characters of the line satisfy a condition. I already have a simple method which is working but I intend to try with awk to see if it can be faster. The command I'm trying was inspired by a colleague at work but I don't fully understand it.
My file looks like :
# 15247.479
1 23775U 96005A 18088.90328565 -.00000293 +00000-0 +00000-0 0 9992
2 23775 014.2616 019.1859 0018427 174.9850 255.8427 00.99889926081074
# 15250.479
1 23775U 96005A 18088.35358271 -.00000295 +00000-0 +00000-0 0 9990
2 23775 014.2614 019.1913 0018425 174.9634 058.1812 00.99890136081067
The 4th field number refers to a date and I want to print the lines starting with 1 and 2 if the bold number if superior to startDate and inferior to endDate.
I am trying with :
< $file awk ' BEGIN {ok=0}
{date=substring($0,19,10) if ($date>='$firstTime' && $date<= '$lastTime' ) {print; ok=1} else ok=0;next}
{if (ok) print}'
This returns a syntax error but I fear it is not the only problem. I don't really understand what the $0 in substring refers to.
Thanks everyone for the help !
Per the question about $0:
Awk is a language built for processing tables and has language features specific to both filtering and manipulating tabular data. One language feature is automatic field splitting.
If you see a $ in front of a variable or constant, it is referring to a "field." When awk sees $field_number being used in a variable context, awk splits the current record buffer based upon what is in the FS variable and allows you to work on that just as you would any other variable -- just that the backing store for that variable is the record buffer.
$0 is a special field referring to the whole of the record buffer. There are some interesting notes in the awk documentation about the side effects on $0 of assigning $field_number variables, FS and OFS that are worth an in depth read.
Here is my answer to your application:
(1) First, LC_ALL may help us for speed. I'm using ll/ul for lower and upper limits -- the reason for which will be apparent later. Specifying them as variables outside the script helps our readability. It is good practice to properly quote shell variables.
(2) It is good practice to use BEGIN { ... }, as you did in your attempt, to formally initialize variables. If using gawk, we can use LINT = 1 to test things like this.
(3) /^#/ is probably the simplest (and fastest) pattern for our reset. We use next because we never want to apply the limits to this line and we never want to see this line in our output (even if ll = ul = "").
(4) It is surprisingly easy to make a mistake on limits. Implement limits consistently one way, and our readers will thank us. We remember to check corner cases where ll and/or ul are blank. One corner case is where we have already triggered our limits and we are waiting for /^#/ -- we don't want to rescan the limits again while ok.
(5) The default action of a pattern is to print.
(6) Remembering to quote our filename variable will save us someday when we inevitably encounter the stray "$file" with spaces in the name.
LC_ALL=C awk -v ll="$firstTime" -v ul="$lastTime" ' # (1)
BEGIN { ok = 0 } # (2)
/^#/ { ok = 0; next } # (3)
!ok { ok = (ll == "" || ll <= $4) && (ul == "" || $4 <= ul) } # (4)
ok # <- print if ok # (5)
' "$file" # (6)
You're missing a ; between the variable assignment and if. And instead of concatenating shell variables, assign them to awk variables. There's no need to initialize ok=0, uninitialized variables are automatically treated as falsey. And if you want to access a field of the input, use $n where n is the field number, rather than substr().
You need to set ok=0 when you get to the next line beginning with #, otherwise you'll just keep printing the rest of the file.
awk -v firstTime="$firstTime" -v lastTime="$lastTime" '
NF > 3 && $4 > firstTime && $4 <= lastTime { print; ok=1 }
$1 == "#" { ok = 0 }
ok { print }' "$file"
This answer is based upon my original but taking into account some new information that #clem sent us in comment -- to the effect that we now know that the line we need to test is always immediately subsequent to the line matching /^#/. Therefore, when we match in this new solution, we immediately do a getline to grab the next line, and set ok based upon that next line's data. We now only check against limits on the line subsequent to our match, and we do not check against limits on lines where we shouldn't.
LC_ALL=C awk -v ll="$firstTime" -v ul="$lastTime" '
BEGIN { ok = 0 }
/^#/ {
getline
ok = (ll == "" || ll <= $4) && (ul == "" || $4 <= ul)
}
ok # <- print if ok
' "$file"

Using gawk in a batch file I am having trouble reformatting lines from format A to format B

I have a compiler which produces output like:
>>> Warning <code> "c:\some\file\path\somefile.h" Line <num>(x,y): warning comment
For example:
>>> Warning 100 "c:\some\file\path\somefile.h" Line 10(5,7): you are missing a (
>>> Warning 101 "c:\some\file\path\file with space.h" Line 20(8,12): unexpected char a
I need to get the into the format (for MSVS2013):
<filename-without-quotes>(<line>,<column>) : <error|warning> <code>: <comment>
e.g. using the first example from above:
c:\some\file\path\somefile.h(10,5): warning 100: you are missing a (
I have had a good go at it and I can just about get the first example working, but the second example screwed me over because I had not figured on filenames with spaces (who does that!!? >.< ). Here is my awk (gawk) code:
gawk -F"[(^), ]" '$2 == "Warning" {gsub("<",""^); gsub("\"",""); start=$4"("$6","$7"^) : "$2" "$3":"; $1=$2=$3=$4=$5=$6=$7=$8=$9=""; print start $0;}' "Filename_with_build_output.txt"
gawk -F"[(^), ]" '$2 == "Error" {gsub("<",""^); gsub("\"",""); start=$4"("$6","$7"^) : "$2" "$3":"; $1=$2=$3=$4=$5=$6=$7=$8=$9=""; print start $0;}' "Filename_with_build_output.txt"
Ok, so point 1 is, its a mess. I will break it down to explain what I am doing. First note that the input is a file, which is an error log generated by my build which I simply pass into awk. Also note the occasional "^" before any round bracket is because this is within a batch file IF statement so I have to escape any ")" - except for one of them... I don't know why! - So the breakdown:
-F"[(^), ]" - This is to split the line by "(" or ")" or "," or " ", which is possibly an issue when we think about files with spaces :(
'$2 == "Warning" {...} - Any line where the 2nd parameter is "Warning". I tried using IGNORECASE=1 but I could not get that to work. Also I could not get an or expression for "Warning" or "Error", so I simply repeat the entire awk line with both!
gsub("<",""^); gsub("\"",""); - this is to remove '<' and '"' (double quotes) because MSVS does not want the filename with quotes around it... and it can't seem to handle "<". Again issues here if I want to get the filename with spaces?
start=$4"("$6","$7"^) : "$2" "$3":"; - this part basically shuffles the various parameters into the correct order with the various format strings inserted.
$1=$2=$3=$4=$5=$6=$7=$8=$9=""; - hmm... here I Wanted to print the 10th parameter and every thing after that, one trick (could not get others to work) was to set params 1-9 to "" and then later I will print $0.
print start $0; - final part, this just prints the string "start" that I built up earlier followed by everything after the 9th parameter (see previous point).
So, this works for the first example - although its still a bit rubbish because I get the following (missing the "(" at the end because "(" is a split char):
c:\some\file\path\somefile.h(10,5): warning 100: you are missing a
And for the one with filename with a space I get (you can see the filename is all broken and some parameters are in the wrong place):
RCU(Line,20) : warning 101: : unexpected char a
So, multiple issues here:
How can I extract the filename between the quotes, yet still remove the quotes
How can I get at the individual numbers in Line 10(5,7):, if I split on brackets and comma I can get to them, but then I lose real bracket/commas from the comment at the end.
Can I more efficiently print out the 10th element and all elements after that (instead of $1=$2=...$9="")
How can I make this into one line such that $2 == "Warning" OR "Error"
Sorry for long question - but my awk line is getting very complicated!
IMHO, it is better not to get yourself tied up in reg-ex and fancy FS values if they don't provide real value or are in other ways really needed. Just "cut and paste" as needed. Put the following in a file,
{
sub(/^>>> /,"")
warn=$1 " " $2; $1=$2=""
sub(/^[[:space:]][[:space:]]*/,"",$0)
fname=$0
sub(" Line.*$","",fname)
gsub("\"","",fname);
msg=$0
sub(/^.*:/,"",msg)
print fname ":\t" warn ":\t"msg
}
Then, per #EdMorton 's most excellent comments, run it
awk -f awkscript dat.txt > dat.out
output
c:\some\file\path\somefile.h: Warning 100: you are missing a (
c:\some\file\path\file with space.h: Warning 101: unexpected char a
Note that I have used tab separated fields. If you what spaces or other chars, just sub the \t chars with " " or whatever you need.
As so many crave the one-liner solution, here it is
awk '{sub(/^>>> /,"");warn=$1 " " $2; $1=$2="";sub(/^[[:space:]][[:space:]]*/,"",$0);fname=$0;sub(" Line.*$","",fname);gsub("\"","",fname);msg=$0;sub(/^.*:/,"",msg);print fname ":\t" warn ":\t"msg}' dat.txt
IHTH

Matching strings with start and end characters

I just started learning about awk programming and am still getting used to it in the bash terminal. If i were to write an expression to match strings that start with de and end with ed, i was wondering how does it go about?
Was thinking of something like:
echo -e "deed\ndeath\ndone\ndeindustrialized" |awk '/^de.ed$/'
where i match the start and match the end in the awk command but it doesn't print out anything. I'll appreciate some help.
It should produce:
deed
deindustrialized
I just started today and would like to know.
The awk part should be:
... | awk '/^de.*ed$/'
deed
deindustrialized
. matches any character and * means that the preceding item will be matched zero or more times.
try with awk:
echo -e "deed\ndeath\ndone\ndeindustrialized" | awk 'NR==1;END{print}'
Following is the explanation too on same.
awk '
NR==1; ###Checking the NR(Number of line) value is 1, if yes then print the current line(awk works on method of pattern/action, if a condition is TRUE then do actions, in case of NO action do default action which is print of current line).
END{print}' ###In END section now, so it will print the last line of Input_file.

How to suppress awk output when using FS?

I'm having trouble to understand the difference of behavior of awk when using field separators passed through the command line and using FS=":" in a script.
I thought using -F to supply the field separator and using the FS variable in awk were the same thing.
But if I do:
awk -F: '{}' /etc/passwd
it prints nothing as expected because of the empty action ({}).
And also, if I put on a script:
#!/usr/bin/awk -f
{}
no lines are printed to the output, as expected.
Now, if I put the field separator in the script, as in:
#!/usr/bin/awk -f
FS=":";
{}
it prints all lines again!
So, I want to suppress the default print action in a script, because I am going to do some computation in the action and print the result later in an END section. How can I do that without awk printing all the lines first?
FS=":";
This line doesn't do what you think it does. It isn't an instruction. Awk doesn't work like most interpreted languages: a file is not a list of instructions. At the top level, you have items of the form CONDITION ACTION, where the condition is an expression an the action is a brace-delimited block containing statements (instructions). The action is executed for each record (line) that satisfies the condition. Both the condition and the action are optional.
An assignment is an expression, so FS=":" is an expression, and there is no corresponding action. When the action is omitted, it defaults to “print the current record”. As for {} on the next line, it's an action that does nothing, with no condition (which means “always”); writing nothing has the same effect.
To execute code when the script starts, you need a BEGIN block. You can think of BEGIN as a special condition which matches once at the beginning of the script before any input is read.
BEGIN {FS = ":"}
> /foo/ { } # match foo, do nothing - empty action
>
> /foo/ # match foo, print the record - omitted action
here

Resources