printing lines based on pattern matching in multiple fields using awk - bash

Suppose I have a html input like
<li>this is a html input line</li>
I want to filter all such input lines from a file which begins with <li> and ends with </li>. Now my idea was to search for pattern <li> in the first field and pattern </li> in the last field using the below awk command
awk '$1 ~ /\<li\>/ ; $NF ~ /\</li\>/ {print $0}'
but looks like there is no provision to match two fields at a time or I am making some syntax mistakes. Could you please help me here?
PS: I am working on a Solaris SunOS machine.

There's a lot going wrong in your script on Solaris:
awk '$1 ~ /\<li\>/ ; $NF ~ /\</li\>/ {print $0}'
The default awk on Solaris (and so the one we have to assume you are using since you didn't state otherwise) is old, broken awk which must never be used. On Solaris use /usr/xpg4/bin/awk. There's also nawk but it's got less POSIX features (eg. no support for character classes).
\<...\> are gawk-specific word boundaries. There is no awk on Solaris that would recognize those. If you were just trying to get literal characters then there's no need to escape them as they are not regexp metacharacters.
If you want to test for condition 1 and condition 2 you put && between them, not ; which is just the statement terminator in lieu of a newline.
The default action given a true condition is {print $0} so you don't need to explicitly write that code.
/ is the awk regexp delimiter so you do need to escape that in mid-regexp.
The default field separator is white space so in your posted sample input $1 and $NF will be <li>this and line</li>, not <li> and </li>.
So if you DID for some reason compare multiple fields you could do:
awk '($1 ~ /^<li>.*/) && ($NF ~ /.*<\/li>$/)'
but this is probably what you really want:
awk '/^<li>.*<\/li>/'
in which case you could just use grep:
grep '^<li>.*</li>'

Why not just use a regex to match the start and end of the line like
awk '/^[[:space:]]*<li>.*<\/li>[[:space:]]*$/ {print}'
though in general if you're trying to process HTML you'll be better of using a tool that's really designed to handle that.

Related

Unable to remove last field CSV file

i have csv file contains data like, I need to get all fields as it is except last one.
"one","two","this has comment section1"
"one","two","this has comment section2 and ( anything ) can come here ( ok!!!"
gawk 'BEGIN {FS=",";OFS=","}{sub(FS $NF, x)}1'
gives error-
fatal: Unmatched ( or (:
I know if i remove '(' from second line solves the problem but i can not remove anything from comment section.
With any awk you could try:
awk 'BEGIN{FS=",";OFS=","}{$NF="";sub(/,$/,"")}1' Input_file
Or with GNU awk try:
awk 'BEGIN{FS=",";OFS=","}NF{--NF};1' Input_file
Since you mention that everything can come here, you might also have a line that looks like:
"one","two","comment with a , comma"
So it is a bit hard to just use the <comma>-character as a field separator.
The following two posts are now very handy:
What's the most robust way to efficiently parse CSV using awk?
[U&L] How to delete the last column of a file in Linux (Note: this is only for GNU awk)
Since you work with GNU awk, you can thus do any of the following two:
$ awk -v FPAT='[^,]*|"[^"]+"' -v OFS="," 'NF{NF--}1'
$ awk 'BEGIN{FPAT="[^,]*|\"[^\"]+\"";OFS=","}NF{NF--}1'
$ awk 'BEGIN{FPAT="[^,]*|\042[^\042]+\042";OFS=","}NF{NF--}1'
Why is your command failing: The sub(ere,repl,in) command of awk assumes that the first part ere is an extended regular expression. Hence, the bracket has a special meaning. If you want to replace fields which are known and unique, you should not use sub, but just redefine the field:
$ awk '{$NF=""}'
If you want to replace a string matching a field, you should do this:
s=$(number);while(i=index(s,$0)){$0=substr(1,i-1) "repl" substr(i+length(s),$0) }

An error with asterisk in if statement

I'm having problem with the following code:
nawk -F "," '{if($2<=2)&&($9!=45)&&($11==2348*)) print $2}' abc12* | wc -l
The error is in ($11==2348*). I tried to put this number in variable x and do ($11==$x*).
if you mean a regex match change it to
$ awk -F, '$2<=2 && $9!=45 && $11~/^2348/ {c++; print $2} END{print c}' abc12*
note that you can incorporate line count in the script as well.
If you want equality check $11=="2348*" would do. Will check that the field is literally 2348* without any special meaning of *.
Looks like you intend to use regexp?
$11==2348*
should give you a syntax error as
2348*
is an incomplete multiplication.
For a regular expression match you would have to use
$11 ~ /2348*/
if you intend to have zero to man "8"s or
$11 ~ /2348.*/ or may be $11 ~ /2348[0-9]*/
if the intial intent is having any character or only digits after "2348"
i think your code would work just fine if you wouldnt have added one more ")" than expected. if you count them you have 7.... so this ($11==2348*)) should acctually be ($11==2348*)

limit text files to a certain word length, but keep complete sentences

I have a corpus of text files that I need to copy, but limiting each file to roughly the same word length, while maintaining complete sentences. Treating any punctuation within {.?!} as a sentence boundary is acceptable. I could do this with python, but I am trying to learn bash, so suggestions are welcome. The approach I have been considering is to overshoot my target word length by a few words and then trim the result to the last sentence boundary.
I am familiar with head and wc, but I can't come up with a way to combine the two. The man file for head does not indicate a way to use word-counts, and the man file for wc does not indicate a way to split the file.
Context:
I am working on a text classification task with machine-learning (using weka, for the record). I want to make sure that text length (which varies widely in my data) is not influencing the outcomes too much. To do this, I am trying to normalize my text lengths before I perform feature extraction.
Let's consider this test file:
$ cat file
Do I exist? I program. Therefore, I am!
Suppose that we want to truncate this file to complete sentences of 20 characters or fewer:
$ awk -v n=20 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist?
If we want 30 characters or fewer:
$ awk -v n=30 -v RS='[.?!]' '{if (length(s $0 RT)>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist? I program.
How it works
-v n=20
This sets the awk variable n to the max length that we want (not counting the file's final newline character).
-v RS='[.?!]'
This sets the awk record separator, RS, to any of the three characters that you mentioned.
if (length(s $0 RT)>n) exit; else s=s $0 RT
For each record in the file (a record being a sentence), we test to see if adding it to s would make the output too long. If it makes the output too long, then we exit. If not, we add it to s.
In awk, $0 represents the complete record and RT is the record separator that awk found at the end of the record.
END{print s;}
Before we exit, this prints the string s.
Alternate 1: Truncating based on number of words
Suppose instead that we want to truncate based on the number of words. If we want, for example, 6 words:
$ awk -v n=6 -v RS='[[:space:]]+' 'NR>n{exit;} {printf "%s%s",$0,RT;} END{print"";}' file
Do I exist? I program. Therefore,
The difference is that we know used whitespace as a record separator. In this way, each record is a word and keep printing words until we reach the limit.
Alternative 2: Whole sentences but limited number of words
$ awk -v n=6 -v RS='[.?!]' '{c+=NF; if (c>n) exit; else s=s $0 RT;} END{print s;}' file
Do I exist? I program.
Mac OSX
The above sets the record separator, RS, to a regular expression. This may require GNU awk (gawk). The OSX man page for awk does not say whether this feature is supported or not. #bebop, however, reports that the above code can be run successfully on OSX after installing gawk from macports.

I need to be able to print the largest record value from txt file using bash

I am new to bash programming and I hit a roadblock.
I need to be able to calculate the largest record number within a txt file and store that into a variable within a function.
Here is the text file:
student_records.txt
12345,fName lName,Grade,email
64674,fName lName,Grade,email
86345,fName lName,Grade,email
I need to be able to get the largest record number ($1 or first field) in order for me to increment this unique record and add more records to the file. I seem to not be able to figure this one out.
First, I sort the file by the first field in descending order and then, perform this operation:
largest_record=$(awk-F,'NR==1{print $1}' student_records.txt)
echo $largest_record
This gives me the following error on the console:
awk-F,NR==1{print $1}: command not found
Any ideas? Also, any suggestions on how to accomplish this in the best way?
Thank you in advance.
largest=$(sort -r file|cut -d"," -f1|head -1)
You need spaces, and quotes
awk -F, 'NR==1{print $1}'
The command is awk, you need a space after it so bash parses your command line properly, otherwise it thinks the whole thing is the name of the command, which is what the error messages is telling you.
Learn how to use the man command so you can learn how to invoke other commands:
man awk
This will tell you what the -F option does:
The -F fs option defines the input field separator to be the regular expression fs.
So in your case the field separator is a comma -F,
What follows in quotes is what you want awk to interpret, it says to match a line with the pattern NR==1, NR is special, it is the record number, so you want it to match the first record, following that is the action you want awk to take when that pattern matches, {print $1}, which says to print the first field (comma separated) of the line.
A better way to accomplish this would be to use awk to find the largest record for you rather than sorting it first, this gives you a solution that is linear in the number for records - you just want the max, no need to do extra work of sorting the whole file:
awk -F, 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' student_records.txt
For this and other awk "one liners" look here.

How not to get expanded variables in AWK

Good day,
I was wondering how not to get expanded variables in AWK.
variable to pass:achi
But, when I try with:
awk -F, -v var1="achi" '$(NF-1)~var1' file
It just does not work. It prints all lines that match achi.
I'll appreciate some insights to do it properly.
Input
achi, francia
nachi, peru
universidad achi, japon
achito, suecia
Expected Output
achi, francia
You seem to be trying to test equivalence with the pattern matching operator ~. The proper operator to test equivalence is ==.
awk -F, -v var1="achi" '$(NF-1)==var1' file
If you are expecting more fields you should take into account that your values are separated with a comma and a space, this can be done using ", " as the field separator.
awk -F", " -v var1="achi" '$(NF-1)==var1'

Resources