Regex not working as field separator on awk - bash

I have this text file foo.txt which contains words mixed with punctuation marks.
What I want to do is filter every punctuation mark using awk, so I used a regex expression as field separator, like this awk -F '[^a-zA-Z]+' '{ print $0 }' foo.txt, the problem I'm facing is that the text stays just like the original, nothing is filtered.
Anyone knows why this happens?
Input
¿Hello? How... are foo you?'
Bye ,, hehe '" .lol
Result Expected
Hello How are foo you
Bye hehe lol
P.D
I know I can achieve the same result using sed with something like this sed 's/[[:punct:]]//g' foo.txt or sed s/[^A-Za-z]/" "/g foo.txt, but I want to know why the awk command is not working, I've already investigated everywhere and I can't find an answer, I'm not going to be able to sleep.

If you want to know where you can find the rules behind this, I would like to point to Awk POSIX standard:
However, you have to find the answer a bit on two locations:
DESCRIPTION
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non- <blank> non- <newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS built-in variable or the -F sepstring option. The awk utility shall denote the first field in a record $1, the second $2, and so on. The symbol $0 shall refer to the entire record; setting any other field causes the re-evaluation of $0. Assigning to $0 shall reset the values of all other fields and the NF built-in variable.
Variables and Special Variables
References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value. Such references shall not create new fields. However, assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS. Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters.
It is a bit awkward to find the rule for recomputing $0 when new fields are introduced, but this is essentially the rule.
Furthermore, the statement print $0 prints the entire field. So according to the above, you first need to recompute your $0 as shown in the answer of #oguzismail.
So changing the field separator can be done in the following way:
awk 'BEGIN{FS="oldFS"; OFS="newFS"}{$1=$1}1' <file>
remark: you do not need to check if the line contains any fields as NF{$1=$1} since {$1=$1} will just introduce an empty field without an extra OFS.

Related

How to use file chunks based on characters instead of lines for grep?

I am trying to parse log files of the form below:
---
metadata1=2
data1=2,data3=5
END
---
metadata2=1
metadata1=4
data9=2,data3=2, data0=4
END
Each section between the --- and END is an entry. I want to select the entire entry that contains a field such as data1. I was able to solve it with the following command, but it is painfully slow.
pcregrep -M '(?s)[\-].*data1.*END' temp.txt
What am I doing wrong here?
Parsing this file with pcregrep might be challenging. The 'pcregrep' does not have the ability to break the files into logical records. The pattern that was specific will try to find matching records by combining multiple record together. Sometimes including unmatched records in the output.
For example, if the input is "--- data=a END --- data1=a END", then the above command will select both records, as it will form a match between the initial '---', and the trailing 'END'
For this kind of input, consider using AWK. It has the ability to read input with custom record separator (RS), which make it easy to convert the input into records, and apply the pattern. If you prefer, you can use Perl or Python.
Using awk RS to create "records", possible to apply the pattern test on every record
awk -v RS='END\n' '/data1/ { print $0 }' < log1
awk -v RS='END\n' '/data1/ { print NR, $0 }' < log1
The second command include the record number in the output, if useful.
While AWK is not as fast as pcregrep, in this case, it will not have trouble processing large input set.
I would use awk:
awk 'BEGIN{RS=ORS="END\n"}/\ydata1/' file
Explanation:
awk works based on input records. By default such a record is a line of input, but this behaviour can be changed by setting the record separator (and output record separator for the output).
By setting them to END\n, we can search whole records of your input.
The regular expression /\ydata1/ searches those records for the presence of the the term data1, the \y matches a word boundary, to prevent from matching metadata1.

Extract a substring (value of an HTML node tag) in a bash/zsh script

I'm trying to extract a tag value of an HTML node that I already have in a variable.
I'm currently using Zsh but I'm trying to make it work in Bash as well.
The current variable has the value:
<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>
and I would like to get the value of data-count (in this case 0, but could be any length integer).
I have tried using cut, sed and the variables expansion as explained in this question but I haven't managed to adapt the regexs, or maybe it has to be done differently for Zsh.
There is no reason why sed would not work in this situation. For your specific case, I would do something like this:
sed 's/.*data-count="\([0-9]*\)".*/\1/g' file_name.txt
Basically, it just states that sed is looking for the a pattern that contains data-count=, then saves everything within the paranthesis \(...\) into \1, which is subsequently printed in place of the match (full line due to the .*)
Could you please try following.
awk 'match($0,/data-count=[^ ]*/){print substr($0,RSTART+12,RLENGTH-13)}' Input_file
Explanation: Using match function of awk to match regex data-count=[^ ]* means match everything from data-count till a space comes, if this regex is TRUE(a match is found) then out of the box variables RSTART and RLENGTH will be set. Later I am printing current line's sub-string as per these variables values to get only value of data-count.
With sed could you please try following.
sed 's/.*data-count=\"\([^"]*\).*/\1/' Input_file
Explanation: Using sed's capability of group referencing and saving regex value in first group after data-count=\" which is its length, then since using s(substitution) with sed so mentioning 1 will replace all with \1(which is matched regex value in temporary memory, group referencing).
As was said before, to be on the safe side and handle any syntactically valid HTML tag, a parser would be strongly advised. But if you know in advance, what the general format of your HTML element will look like, the following hack might come handy:
Assume that your variable is called "html"
html='<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>'
First adapt it a bit:
htmlx="tag ${html%??}"
This will add the string tag in front and remove the final />
Now make an associative array:
declare -A fields
fields=( ${=$(tr = ' ' <<<$htmlx)} )
The tr turns the equal sign into a space and the ${= handles word splitting. You can now access the values of your attributes by, say,
echo $fields[data-count]
Note that this still has the surrounding double quotes. Yuo can easily remove them by
echo ${${fields[data-count]%?}#?}
Of course, once you do this hack, you have access to all attributes in the same way.

awk and sed command Special Character in matching pattern for range [duplicate]

NOTE: I am a noob at bash scripts and the awk command - please excuse any dumb mistakes I make.
I am unable to substitute shell variables into my awk pattern. I am trying to scan through a file, find the first occurence of a specific string in the file, and print each line that succeed it in order until it hits an empty string/line.
I don't know the string I am searching for in advance, and I would like to substitute in that variable.
When I run this with the string directly specified (e.g "< main>:"), it works perfectly. I've already searched on how awk patterns work, and how to substitute in variables. I've tried using the -v flag for awk, directly using the shell variable - nothing works.
funcName="<${2}>:"
awk=`awk -v FN="$funcName" '/FN/,/^$/' "$ofile"`
rfile=search.txt
echo -e "$awk" > "$rfile"
The error is just that nothing prints. I want to print all the lines between my desired string and the next empty line.
Could you please try following, haven't tested it because no clear samples but should work.
funcName="<${2}>:"
awk_result=$(awk -v FN="$funcName" 'index($0,FN){found=1} found; /^$/{found=""}' "$ofile")
rfile=search.txt
echo -e "$awk_result" > "$rfile"
Things fixed in OP's attempt:
NEVER keep same name of a variable as a binary's name or on a keyword's name so changed awk variable name to awk_result.
Use of backticks is depreciated now, so always wrap your variable for having values in var=$(......your commands....) fixed it for awk_result variable.
Now let us talk about awk code fix, I have used index method which checks if value of variable FN is present in a line then make a FLAG(a variable TRUE) and make it false till line is empty as per OP's ask.

Need a guide to basic command-line awk syntax

I have read several awk tutorials and seen a number of questions and answers on here and the problem is that I'm seeing a LOT of variety in how people do their awk 1-liners and it has really overcomplicated it in my mind.
So I see things like this:
awk '/pattern/ { print }'
awk '/pattern/ { print $0 }'
awk '/pattern/ { print($0) }'
awk '/pattern/ { print($0); }'
awk 'BEGIN { print }'
awk '/pattern/ BEGIN { print };
Sometimes I get errors and sometimes not but because I'm seeing so many different phrasings I'm really having trouble fixing syntax errors because I can't figure out what's allowed and what isn't.
Can someone explain this? Does print require parens or not? Are semi-colons required or not? Is BEGIN required or not? What happens when you start an awk script with a /pattern/, and/or just pass it the name of a function like print on its own?
One at a time:
Can someone explain this?
Yes.
Does print require parens or not?
print, like return, is a builtin, not a function, and as such does not use parens at all. When you see print("foo") the parens are associated with the string "foo", they are NOT in any way part of the print command despite how it looks. It might be clearer (but still not useful in this case) to write it as print ("foo").
Are semi-colons required or not?
Not when the statements are on separate lines. Like in shell, semi-colons would be required to separate statements that occur on a single line
Is BEGIN required or not?
No. Note that BEGIN is a keyword that represents the condition that exists before the first input file is opened for reading so BEGIN{print} will just print a blank line since nothing has been read to print. Also /pattern/ BEGIN is nonsense and should produce a syntax error.
What happens when you start an awk script with a /pattern/, and/or just pass it the name of a function like print on its own?
An awk script is made up of condition { <action> } sections with the default condition being TRUE and the default action being print $0. So awk '/pattern/' means if the regexp "pattern" exists in the current record then invoke the default action which is to print that record and awk '{ print }' means the default condition of TRUE applies so execute the specified action and print the current record. Not also that print by default prints the current record so print $0 is synonymous with just print.
If you are considering starting to use awk, get the book Effective Awk Programming by Arnold Robbins and at least read the first chapter or 2.
Function calls require (). Statements do not (but appear to allow them).
print and printf are statements so do not require () (but supports it "The entire list of items may be optionally enclosed in parentheses.")
From print we also find out that
The simple statement ‘print’ with no items is equivalent to ‘print $0’: it prints the entire current record.
So we now know that the first three statements are identical.
From Actions we find out that.
An action consists of one or more awk statements, enclosed in curly braces (‘{…}’).
and that
The statements are separated by newlines or semicolons.
Which tells us that the semicolon is a "separator" and not a terminator so we don't need one at the end of an action so we now know the fourth is also identical.
BEGIN is a special pattern and that
[a] BEGIN rule is executed once only, before the first input record is read.
So the fifth is different because it operates once at the start and not on every line.
And the last is a syntax error because it has two patterns next to each other without an intervening action or separator.
All of those awk commands (except the last 2) can be shortened to:
awk '/pattern/' file
since print is always the action in awk.
Semicolon is optional just before }.
You cannot place BEGIN after /pattern/

extract string conditionally from variable column sized text file

From a text file with variable number of columns per row (tab delimited), I would like to extract value with specific condition.
The text file looks like:
S1=dhs Sb=skf S3=ghw QS=ghr</b>
S1=dhf QS=thg S3=eiq<b/>
QS=bhf S3=ruq Gq=qpq GW=tut<b/>
Sb=ruw QS=ooe Gq=qfj GW=uvd<b/>
I would like to have a result like:
QS=ghr<b/>
QS=thg<b/>
QS=bhf<b/>
QS=ooe
Please excuse my naive question but I am a beginner trying to learn some basic bash scripting technique for text manipulation.
Thanks in advance!
You could use awk ,
awk '{for(i=1;i<=NF;i++){if($i~/^QS=/){print $i}}}' file
This awk command iterates through each fields and check for the column which has QS= string at the start. If it finds any, then the corresponding column would be printed.
Through grep,
grep -oP '(^|\t)\KQS=\S*' file
-o parameter means only matching. So it prints only the characters which are matched.
-P this enables the Perl-regex mode.
(^|\t) matches the start of a line or a tab character.
\K discards the previously matched tab or start of the line boundary.
QS= Now it matches the QS= string.
\S* Matches zero or more non-space characters.

Resources