AWK Finding a way to print lines containing a word from a comma separated string - bash

I want to write a bash script that only prints lines that, on their second column, contain a word from a comma separated string. Example:
words="abc;def;ghi;jkl"
>cat log1.txt
hello;abc;1234
house;ab;987
mouse;abcdef;654
What I want is to print only lines that contain a whole word from the "words" variable. That means that "ab" won't match, neither will "abcdef". It seems so simple yet after trying for manymany hours, I was unable to find a solution.
For example, I tried this as my awk command, but it matched any substring.
-F \; -v b="TSLA;NVDA" 'b ~ $2 { print $0 }'
I will appreciate any help. Thank you.
EDIT:
A sample input would look like this
1;UNH;buy;344.74
2;PG;sell;138.60
3;MSFT;sell;237.64
4;TSLA;sell;707.03
A variable like this would be set
filter="PG;TSLA"
And according to this filter, I want to echo these lines
2;PG;sell;138.60
4;TSLA;sell;707.03

Grep is a good choice here:
grep -Fw -f <(tr ';' '\n' <<<"$words") log1.txt
With awk I'd do
awk -F ';' -v w="$words" '
BEGIN {
n = split(w, a, /;/)
# next line moves the words into the _index_ of an array,
# to make the file processing much easier and more efficient
for (i=1; i<=n; i++) words[a[i]]=1
}
$2 in words
' log1.txt

You may use this awk:
words="abc;def;ghi;jkl"
awk -F';' -v s=";$words;" 'index(s, FS $2 FS)' log1.txt
hello;abc;1234

Related

How to print keys from all key-value pairs

Text file looks like this:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
How can I extract keys so that:
key11|key12|key13
key21|key22|key23
I have tried unsuccessfully :
awk '{ gsub(/[^[|]=]+=/,"") }1' file.txt
gives back the actual data:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
Since you tagged bash
while IFS='=|' read -ra words; do
n=${#words[#]}
for ((i=1; i<n; i+=2)); do
unset words[i]
done
( IFS='|'; echo "${words[*]}" )
done < file
gawk
This can be done by awk, by setting FS and OFS :
kent$ awk -F'=[^|]*' -v OFS="" '$1=$1' file
key11|key12|key13
key21|key22|key23
or safer: awk -F.... '{$1=$1}1' file
substitution (by sed for example):
kent$ sed 's/=[^|]*//g' file
key11|key12|key13
key21|key22|key23
Here's one solution
echo "key11=val1|key12=val2|key13=val3" \
| awk -F'[=|]' '{
for (i=1;i<=NF;i+=2){
printf("%s%s", $i, (i<(NF-1))?"|":"")
}
print""
}'
output
key11|key12|key13
It should also work by passing in the filename as an argument to awk, i.e.
awk -F'[=|]' '{for (i=1;i<=NF;i+=2){printf("%s%s", $i, (i<(NF-1))?"|":"") }print""}' file1 [file_more_as_will_fit]
Discussion
We use a multiple character value for FS (FieldSeperator) so each = and | char mark the beginning of a new field.
-F'[=|]'
Because we know we want to start with field1 for output and skip every other field, we use
for (i=1;i<=NF;i+=2)
printf formats the output as defined by the format string '%s%s' . There area a zillion options available for printf format strs, but you only need the value for $i (the looping value that generates the key) and whether to print a | char or not.
printf("%s%s", $i ...)
And we use awk's ternary operator, which evaluates what element number is being processed (i<..). As long as it is not the 2nd to last field, the | char is emitted.
(i<(NF-1))?"|":""
IHTH
sed
I did this with sed:
sed -r 's/([[:alnum:]]*)=[[:alnum:]]*/\1/g' < file.txt
tested here and got:
key11|key12|key13
key21|key22|key23
s/<pattern>/<subst>/ means "replace <pattern> by <subst>", and with the g in the end it will do it for every pattern found in the line.
The [[:alnum:]]* is equivalent to [0-9a-zA-Z]*, and means any number of letters or digits.
The first pattern between parentesis will correspond to \1 in the substitution, the second \2 and so on.
So, it will match every "key=value" and replace it by "key".
awk -F'[=|]' '{print $1,$3,$5}' OFS="|" file
key11|key12|key13
key21|key22|key23

Ignore comma after backslash in a line in a text file using awk or sed

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

print first 3 characters and / rest of the string with stars

I'have this input like this
John:boofoo
I want to print rest of the string with stars and keep only 3 characters of the string.
The output will be like this
John:boo***
this my command
awk -F ":" '{print $1,$2 ":***"}'
I want to use only print command if possible. Thanks
With GNU sed:
echo 'John:boofoo' | sed -E 's/(:...).*/\1***/'
Output:
John:boo***
With GNU awk for gensub():
$ awk 'BEGIN{FS=OFS=":"} {print $1, substr($2,1,3) gensub(/./,"*","g",substr($2,4))}' file
John:boo***
With any awk:
awk 'BEGIN{FS=OFS=":"} {tl=substr($2,4); gsub(/./,"*",tl); print $1, substr($2,1,3) tl}' file
John:boo***
Could you please try following. This will print stars(keeping only first 3 letters same as it is) how many characters are present in 2nd field after first 3 characters.
awk '
BEGIN{
FS=OFS=":"
}
{
stars=""
val=substr($2,1,3)
for(i=4;i<=length($2);i++){
stars=stars"*"
}
$2=val stars
}
1
' Input_file
Output will be as follows.
John:boo***
Explanation: Adding explanation for above code too here.
awk '
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=":" ##Setting FS and OFS value as : here.
} ##Closing block of BEGIN section here.
{ ##Here starts main block of awk program.
stars="" ##Nullifying variable stars here.
val=substr($2,1,3) ##Creating variable val whose value is 1st 3 letters of 2nd field.
for(i=4;i<=length($2);i++){ ##Starting a for loop from 4(becasue we need to have from 4th character to till last in 2nd field) till length of 2nd field.
stars=stars"*" ##Keep concatenating stars variable to its own value with *.
}
$2=val stars ##Assigning value of variable val and stars to 2nd field here.
}
1 ##Mentioning 1 here to print edited/non-edited lines for Input_file here.
' Input_file ##Mentioning Input_file name here.
Or even with good old sed
$ echo "John:boofoo" | sed 's/...$/***/'
Output:
John:boo***
(note: this just replaces the last 3 characters of any string with "***", so if you need to key off the ':', see the GNU sed answer from Cyrus.)
Another awk variant:
awk -F ":" '{print $1 FS substr($2, 1, 3) "***"}' <<< 'John:boofoo'
John:boo***
Since we have the tags awk, bash and sed: for completeness sake here is a bash only solution:
INPUT="John:boofoo"
printf "%s:%s\n" ${INPUT%%:*} $(TMP1=${INPUT#*:};TMP2=${TMP1:3}; echo "${TMP1:0:3}${TMP2//?/*}")
It uses two arguments to printf after the format string. The first one is INPUT stripped of by everything uncluding and after the :. Lets break down the second argument $(TMP1=${INPUT#*:};TMP2=${TMP1:3}; echo "${TMP1:0:3}${TMP2//?/*}"):
$(...) the string is interpreted as a bash command its output is substituted as last argument to printf
TMP1=${INPUT#*:}; remove everything up to and including the :, store the string in TMP1.
TMP2=${TMP1:3}; geht all characters of TMP1 from offset 3 to the end and store them in TMP2.
echo "${TMP1:0:3}${TMP2//?/*}" output the temporary strings: the first three chars from TMP1 unmodified and all chars from TMP2 as *
the output of the last echo is the last argument to printf
Here is the bash -x output:
+ INPUT=John:boofoo
++ TMP1=boofoo
++ TMP2=foo
++ echo 'boo***'
+ printf '%s:%s\n' John 'boo***'
John:boo***
Another sed : replace all chars after the third by *
sed -E ':A;s/([^:]*:...)(.*)[^*]([*]*)/\1\2\3*/;tA'
Some more awk
awk 'BEGIN{FS=OFS=":"}{s=sprintf("%0*d",length(substr($2,4)),0); gsub(/0/,"*",s);print $1,substr($2,1,3) s}' infile
You can use the %* form of printf, which accepts a variable width. And, if you use '0' as your value to print, combined with the right-aligned text that's zero padded on the left..
Better Readable:
awk 'BEGIN{
FS=OFS=":"
}
{
s=sprintf("%0*d",length(substr($2,4)),0);
gsub(/0/,"*",s);
print $1,substr($2,1,3) s
}
' infile
Test Results:
$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
$ cat f
John:boofoo
$ awk 'BEGIN{FS=OFS=":"}{s=sprintf("%0*d",length(substr($2,4)),0); gsub(/0/,"*",s);print $1,substr($2,1,3) s}' f
John:boo***
Another pure Bash, using the builtin regular expression predicate.
input="John:boofoo"
if [[ $input =~ ^([^:]*:...)(.*)$ ]]; then
printf '%s%s\n' "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]//?/*}"
else
echo >&2 "String doesn't match pattern"
fi
We split the string in two parts: the first part being everything up to (and including) the three chars found after the first colon (stored in ${BASH_REMATCH[1]}), the second part being the remaining part of string (stored in ${BASH_REMATCH[2]}). If the string doesn't match this pattern, we just insult the user.
We then print the first part unchanged, and the second part with every character replaced with *.

Iterative replacement of substrings in bash

I'm trying to write a simple script to make several replacements in a big text file. I've a "map" file which contains the records to be searched and replaced,one per line,separated by a space, and a "input" file where I need the changes to be done. The examples files and the script I wrote are beneath.
Map file
new_0 old_0
new_1 old_1
new_2 old_2
new_3 old_3
new_4 old_4
Input file
itsa(old_0)single(old_2)string(old_1)with(old_5)ocurrences(old_4)ofthe(old_3)records
Script
#!/bin/bash
while read -r mapline ; do
mapf1=`awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline"`
mapf2=`awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline"`
for line in $(cat "input") ; do
if [[ "${line}" == *"${mapf2}"* ]] ; then
sed "s/${mapf2}/${mapf1}/g" <<< "${line}"
fi
done < "input"
done < "map"
The thing is that the searches and replaces are made correctly, but I can't find a way to save the output of each iteration and work over it in the next. So, my output looks like this:
itsa(new_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(new_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(new_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(new_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(new_4)ofthe(old_3)records
Yet, the desired output would look like this:
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
May anyone bring some light in this darkly waters??? Thanks in advance!
Improving the existing script
Improvements:
Use "$()" instead of ``. It supports whitespace and is easier to read.
Don't execute sed for each line. sed already loops over all lines and is faster than a loop in bash.
The adapted script:
text="$(< input)"
while read -r mapline; do
mapf1="$(awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline")"
mapf2="$(awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline")"
text="$(sed "s/${mapf2}/${mapf1}/g" <<< "$text")"
done < "map"
echo "$text"
The variable $text contains the complete input file and is modified in each iteration. The output of this script is the file after all replacements were done.
Alternative approach
Convert the map file into a pattern for sed and execute sed just once using that pattern.
pattern="$(sed 's#\(.*\) \(.*\)#s/\2/\1/g#' map)"
sed "$pattern" input
The first command is the conversion step. The file
new_0 old_0
new_1 old_1
...
will result in the pattern
s/old_0/new_0/g
s/old_1/new_1/g
...
It is possible in GNU Awk as follows,
awk 'FNR==NR{hash[$2]=$1; next} \
{for (i=1; i<=NF; i++)\
{for(key in hash) \
{if (match ($i,key)) {$i=sprintf("(%s)",hash[key];break;)}}}print}' \
map-file FS='[()]' OFS= input-file
produces an output as,
itsa(new_0)single(new_2)string(new_1)withold_5ocurrences(new_4)ofthe(new_3)records
Another in Gnu awk, using split and ternary operator(s):
$ awk '
NR==FNR { a[$2]=$1; next }
{
n=split($0,b,"[()]")
for(i=1;i<=n;i++)
printf "%s%s",(i%2 ? b[i] : (b[i] in a? "(" a[b[i]] ")":"")),(i==n?ORS:"")
}' map foo
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
First you read in the map to a hash. When processing the file, split all records by ( and ). Every other could be in the map (i%2==0). While printfing test with ternary operator if matches are found from a and when there is a match, output it parenthesized.

Bash split string according to string

In python, I would do something simple like sRet = sOut.split('Word')
In bash, scrounged from other answers, I have the following two methods that are insufficient in my case, but may be useful to someone in the future:
sOut="I want this Point to matter"
1) sRet=( $sOut )
2) IFS="Point " read -r -a sRet <<< ${sOut}
echo ${sRet[-1]}
I want returned: "to matter"
(1) gives: "matter"
(2) gives: "er"
The first only splits by spaces, the second splits by the last character, in this case it would be 't'.
How do I split by a full string, as I would in python?
sOut="I want this Point to matter"
s="Point "
[[ $sOut =~ $s(.*) ]] && echo ${BASH_REMATCH[1]}
Output:
to matter
IFS is single character, so you will need to deploy another tool. I'd suggest awk in this case:
$ awk -F 'Point' '{print $NF}' <<< "$sOut"
to matter
You can replace 'Point' with a variable holding the delimiter. You can also change which part of the split you get back. The variable $NF means "the last element". You can also use $1 for the first element, $2 for the second, and so on.
You can use awk for splitting the string:
text="I want this Point to matter"
s='Point'
awk -v s="$s" -v text="$text" 'BEGIN {split(text, a, "[[:blank:]]*" s "[[:blank:]]*");
for (i in a) print a[i]}'
I want this
to matter
To get only the last match:
awk -v s="$s" -v text="$text" 'BEGIN {n=split(text, a, "[[:blank:]]*" s "[[:blank:]]*"); print a[n]}'
to matter
Or:
awk -v s="$s" 'BEGIN{FS="[[:blank:]]*" s "[[:blank:]]*"} {print $NF}' <<< "$text"
to matter
IFS on the other hand doesn't work with multiple character string. So IFS='Point' will split the output on each character P, o, i, n, t.
sDelim="Point"
sRet1=$(awk -F ${sDelim} '{print $1}' <<< ${sOut})
sRet2=$(awk -F ${sDelim} '{print $NF}' <<< ${sOut})
Given all the other excellent answers, I prefer this one most for the following reasons:
1) Its short ans sweet
2) Everything is fairly explicit when wanting to use variables
3) Any elements can be selected: 1,2,.. from the beginning, NF, NF-1,.. from the end
4) if sDelim is not actually in sOut, the script doesn't freak out
Thanks mainly to #bishop for leading me to this
You could use the parenthesis feature of sed to retrieve
the string that is matched.
The below code:
sOut="I want this point to matter"
s="point "
echo $sOut | sed "s/.*$s\(.*\)/\1/"
would give me:
to matter
as output.

Resources