programatically creating regex with sed for use in gawk

programatically creating regex with sed for use in gawk - bash

I am trying to create a big regex from many options in a file, to be used in gawk. The goal is to find matches in lines.txt which match ANY of the options in regex.txt
File of lines to be searched
echo -n "dog
cat
bobcat" > lines.txt
File of regular expressions which will be combined into a big regex
echo -n "dog
cat" > regex.txt
I know the structure of what I am trying to do, but when I use sed to insert positional matching characters into the regex I get a trailing |.
This is what I currently have
rgx=$(cat "regex.txt" | sed 's#^#\\\\<#' | tr '\n' '|')
gawk -v regex=$rgx 'BEGIN {IGNORECASE = 1} {print gsub(regex,"")}' lines.txt
Current output from gawk is
1
1
7
Desired output from gawk is
1
1
0
Please help

It makes no sense to also use sed when you're using awk. It sounds like you want something like:
gawk '
BEGIN { IGNORECASE = 1 }
NR == FNR {
regex = (NR>1 ? regex "|" : "") "\\<" $0 "\\>"
next
}
{ print gsub(regex,"") }
' regex.txt lines.txt
1
1
0

#Stef's comment gets me to the desired output.
My sed was inserting a newline at the end, which was getting replaced by the | and this was causing the unexpected behavior.
So the working regex is rgx=$(cat "regex.txt" | sed 's#^#\\\\<#' | tr '\n' '|' | sed 's#|$##')
But, as Ed Morton's accepted answer shows, this can be done more elegantly using only gawk. I clearly need to learn more about awk!

Related

awk match by variable with dot in it

I have a script that will iterate over a file containing domains (google.com, youtube.com, etc). The purpose of the script is to check how many times each domain is included in the 12th column of a tab seperated value file.
while read domain; do
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
done < domains.txt
However awk seems to be interpretating the dots in the domains as a special character. The following error message is shown:
awk: syntax error at source line 1
context is
$12 ~ >>> google. <<< com
awk: bailing out at source line 1
I am a beginner in bash so any help would be greatly appreciated!

When you write:
domain='google.com'
awk -F '\t' '$12 == '$domain'' data.txt
the $domain is outside of any quotes:
awk -F '\t' '$12 == '$domain' ' data.txt
< > < >
start end start end
and so exposed to the shell for interpretation first and THEN it becomes part of the body of the awk script before awk sees it. So what awk sees is:
awk -F '\t' '$12 == google.com' data.txt
and google.com is not a valid symbol (e.g. variable or function) name nor string nor number. What you MEANT to do was:
awk -F '\t' '$12 == "'"$domain"'"' data.txt
so the shell would see "$domain" instead of just $domain (see https://mywiki.wooledge.org/Quotes for why that's important) and awk would finally see:
awk -F '\t' '$12 == "google.com"' data.txt
which is fine as now "google.com" is a string, not a symbol BUT you should never allow shell variables to expand to become part of an awk script as there are other caveats so what you should really have done is:
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt
See How do I use shell variables in an awk script? for more information.
By the way, even after fixing the above problem do not do this:
while read domain; do
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt | wc -l
done < domains.txt
as it'll be immensely slow and contains insidious bugs (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice). Do something like this instead (untested):
awk -F'\t' '
NR==FNR {
cnt[$1] = 0
next
}
$12 in cnt {
cnt[$12]++
}
END {
for ( dom in cnt ) {
print dom, cnt[dom]
}
}
' domains.txt data.txt
That will be far more efficient, robust, and portable than calling awk inside a shell read loop.
See What are NR and FNR and what does "NR==FNR" imply? for how that awk script works. Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn awk.

awk -F '\t' '$12 == '$domain'' data.txt | wc -l
The single quotes are building an awk program. They are not something visible to awk. So awk sees this:
$12 == google.com
Since there aren't any quotes around google.com, that is a syntax error. You just need to add quotation marks.
awk -F '\t' '$12 == "'"$domain"'"' data.txt
The quotes jammed together like that are a little confusing, but it's just this:
'....' stuff to send to awk. Single quotes are for the shell.
'..."...' a double quote inside the awk program for awk to see
'...'"..." stuff in double quotes _outside_ the awk program for the shell
We can combine those like this:
'..."'"$var"'"...'
That's a bunch of literal awk code ending in a double-quote, followed by the expansion of the shell parameter var, which is double-quoted as usual in the shell for safety, followed by more literal awk code starting with a double quotes. So the end result is a string passed to awk that includes the value of the var inside double quotes.
But you don't have to be so fancy or confusing since awk provides the -v option to set variables from the shell:
awk -v domain="$domain" '$12 == domain' data.txt
Since the domain is not quoted inside the awk code, it is interpreted as the name of a variable. (Periods are not legal in variable names, which is why you got a syntax error with your domains; if you hadn't, though, awk would have treated them as empty and been looking for lines whose twelfth field was likewise blank.)

Use a combination of cut to print the 12th column of the TAB-delimited file, sort and uniq to count the items:
cut -f12 data.txt | sort | uniq -c

This should give the count of how many lines of the input has "google.com" in $12
{m,g}awk -v __="${domain}" '
BEGIN { _*=\
( _ ="\t[^\t]*")*gsub(".",(_)_,_)*sub(".","",_)*\
gsub("[.:&=/-]","[&]",__)*sub("[[][^[]+$",__"\t?",_)*(\
FS=_ } { _+=NF } END { print _-NR }'

Using sed command in shell script for substring and replace position to need

I’m dealing data on text file and I can’t find a way with sed to select a substring at a fixed position and replace it.
This is what I have:
X|001200000000000000000098765432|1234567890|TQ
This is what I need:
‘X’,’00000098765432’,’1234567890’,’TQ’
The following code in sed gives the substring I need (00000098765432) but not overwrites position to need
echo “ X|001200000000000000000098765432|1234567890|TQ” | sed “s/
*//g;s/|/‘,’/g;s/^/‘/;s/$/‘/“
Could you help me?

Rather than sed, I would use awk for this.
echo "X|001200000000000000000098765432|1234567890|TQ" | awk 'BEGIN {FS="|";OFS=","} {print $1,substr($2,17,14),$3,$4}'
Gives output:
X,00000098765432,1234567890,TQ
Here is how it works:
FS = Field separator (in the input)
OFS = Output field separator (the way you want output to be delimited)
BEGIN -> think of it as the place where configurations are set. It runs only one time. So you are saying you want output to be comma delimited and input is pipe delimited.
substr($2,17,14) -> Take $2 (i.e. second field - awk begins counting from 1 - and then apply substring on it. 17 means the beginning character position and 14 means the number of characters from that position onwards)
In my opinion, this is much more readable and maintainable than sed version you have.

If you want to put the quotes in, I'd still use awk.
$: awk -F'|' 'BEGIN{q="\047"} {print q $1 q","q substr($2,17,14) q","q $3 q","q $4 q"\n"}' <<< "X|001200000000000000000098765432|1234567890|TQ"
'X','00000098765432','1234567890','TQ'
If you just want to use sed, note that you say above you want to remove 16 characters, but you are actually only removing 14.
$: sed -E "s/^(.)[|].{14}([^|]+)[|]([^|]+)[|]([^|]+)/'\1','\2','\3','\4'/" <<< "X|0012000000000000000098765432|1234567890|TQ"
'X','00000098765432','1234567890','TQ'

Using sed
$ sed "s/|\(0[0-9]\{15\}\)\?/','/g;s/^\|$/'/g" input_file
'X','00000098765432','1234567890','TQ'

Using any POSIX awk:
$ echo 'X|001200000000000000000098765432|1234567890|TQ' |
awk -F'|' -v OFS="','" -v q="'" '{sub(/.{16}/,"",$2); print q $0 q}'
'X','00000098765432','1234567890','TQ'

not as elegant as I hoped for, but it gets the job done :
'X','00000098765432','1234567890','TQ'
# gawk profile, created Mon May 9 21:19:17 2022
# BEGIN rule(s)
'BEGIN {
1 _ = sprintf("%*s", (__ = +2)^++__+--__*++__,__--)
1 gsub(".", "[0-9]", _)
1 sub("$", "$", _)
1 FS = "[|]"
1 OFS = "\47,\47"
}
# Rule(s)
1 (NF *= NF == __*__) * sub(_, "|&", $__) * \
sub("^.*[|]", "", $__) * sub(".+", "\47&\47") }'
Tested and confirmed working on gnu gawk 5.1.1, mawk 1.3.4, mawk 1.9.9.6, and macosx nawk
— The 4Chan Teller

awk -v del1="\047" \
-v del2="," \
-v start="3" \
-v len="17" \
'{
gsub(substr($0,start+1,len),"");
gsub(/[\|]/,del1 del2 del1);
print del1$0del1
}' input_file
'X',00000098765432','1234567890','TQ'

Bash + sed/awk/cut to delete nth character

I trying to delete 6,7 and 8th character for each line.
Below is the file containing text format.
Actual output..
#cat test
18:40:12,172.16.70.217,UP
18:42:15,172.16.70.218,DOWN
Expecting below, after formatting.
#cat test
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
Even I tried with below , no luck
#awk -F ":" '{print $1":"$2","$3}' test
18:40,12,172.16.70.217,UP
#sed 's/^\(.\{7\}\).\(.*\)/\1\2/' test { Here I can remove only one character }
18:40:1,172.16.70.217,UP
Even with cut also failed
#cut -d ":" -f1,2,3 test
18:40:12,172.16.70.217,UP
Need to delete character in each line like 6th , 7th , 8th
Suggestion please

With GNU cut you can use the --complement switch to remove characters 6 to 8:
cut --complement -c6-8 file
Otherwise, you can just select the rest of the characters yourself:
cut -c1-5,9- file
i.e. characters 1 to 5, then 9 to the end of each line.
With awk you could use substrings:
awk '{ print substr($0, 1, 5) substr($0, 9) }' file
Or you could write a regular expression, but the result will be more complex.
For example, to remove the last three characters from the first comma-separated field:
awk -F, -v OFS=, '{ sub(/...$/, "", $1) } 1' file
Or, using sed with a capture group:
sed -E 's/(.{5}).{3}/\1/' file
Capture the first 5 characters and use them in the replacement, dropping the next 3.

it's a structured text, why count the chars if you can describe them?
$ awk '{sub(":..,",",")}1' file
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
remove the seconds.

The solutions below are generic and assume no knowledge of any format. They just delete character 6,7 and 8 of any line.
sed:
sed 's/.//8;s/.//7;s/.//6' <file> # from high to low
sed 's/.//6;s/.//6;s/.//6' <file> # from low to high (subtract 1)
sed 's/\(.....\).../\1/' <file>
sed 's/\(.{5}\).../\1/' <file>
s/BRE/replacement/n :: substitute nth occurrence of BRE with replacement
awk:
awk 'BEGIN{OFS=FS=""}{$6=$7=$8="";print $0}' <file>
awk -F "" '{OFS=$6=$7=$8="";print}' <file>
awk -F "" '{OFS=$6=$7=$8=""}1' <file>
This is 3 times the same, removing the field separator FS let awk assume a field to be a character. We empty field 6,7 and 8, and reprint the line with an output field separator OFS which is empty.
cut:
cut -c -5,9- <file>
cut --complement -c 6-8 <file>

Just for fun, perl, where you can assign to a substring
perl -pe 'substr($_,5,3)=""' file

With awk :
echo "18:40:12,172.16.70.217,UP" | awk '{ $0 = ( substr($0,1,5) substr($0,9) ) ; print $0}'
Regards!

If you are running on bash, you can use the string manipulation functionality of it instead of having to call awk, sed, cut or whatever binary:
while read STRING
do
echo ${STRING:0:5}${STRING:9}
done < myfile.txt
${STRING:0:5} represents the first five characters of your string, ${STRING:9} represents the 9th character and all remaining characters until the end of the line. This way you cut out characters 6,7 and 8 ...

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution

I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10

cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt

Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.

Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

Join lines based on pattern

I have the following file:
test
1
My
2
Hi
3
i need a way to use cat ,grep or awk to give the following output:
test1
My2
Hi3
How can i achieve this in a single command? something like
cat file.txt | grep ... | awk ...
Note that its always a string followed by a number in the original text file.

sed 'N;s/\n//' file.txt
This should give the desired output when the content is in file.txt

paste -d "" - - < filename
This takes consecutive lines and pastes them together delimited by the empty string.

awk '{printf("%s", $0);} !(NR%2){printf("\n");}' file.txt
EDIT: I just noticed that your question requires the use of cat and grep. Both of those programs are unnecessary to achieve your stated aims. If you have some reason for including them that you haven't mentioned, try this (uselessly inefficient) version of the line I wrote immediately above:
cat file.txt | grep '^' | awk '{printf("%s", $0);} !(NR%2){printf("\n");}'
It is possible that this command uses features not present in the original awk program. You may need to invoke the new awk program, nawk instead.

If your input file is always 1 number then 1 string, and you only want the strings, all you have to do is take every other line.
If you only want the odd lines, you can do awk 'NR % 2' file.txt
If you want the evens, this becomes awk 'NR % 2==0' data

Here is the answer:
cat file.txt | awk 'BEGIN { lno = 0 } { val=$0; if (lno % 2 == 1) {printf "%s\n", $0} else {printf "%s", $0}; ++lno}'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

programatically creating regex with sed for use in gawk - bash

It makes no sense to also use sed when you're using awk. It sounds like you want something like: gawk ' BEGIN { IGNORECASE = 1 } NR == FNR { regex = (NR>1 ? regex "|" : "") "\\<" $0 "\\>" next } { print gsub(regex,"") } ' regex.txt lines.txt 1 1 0

Related

awk match by variable with dot in it

Using sed command in shell script for substring and replace position to need

Bash + sed/awk/cut to delete nth character

Shell command for inserting a newline every nth element of a huge line of comma separated strings

Join lines based on pattern

Categories

Resources