Iterative replacement of substrings in bash - bash

I'm trying to write a simple script to make several replacements in a big text file. I've a "map" file which contains the records to be searched and replaced,one per line,separated by a space, and a "input" file where I need the changes to be done. The examples files and the script I wrote are beneath.
Map file
new_0 old_0
new_1 old_1
new_2 old_2
new_3 old_3
new_4 old_4
Input file
itsa(old_0)single(old_2)string(old_1)with(old_5)ocurrences(old_4)ofthe(old_3)records
Script
#!/bin/bash
while read -r mapline ; do
mapf1=`awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline"`
mapf2=`awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline"`
for line in $(cat "input") ; do
if [[ "${line}" == *"${mapf2}"* ]] ; then
sed "s/${mapf2}/${mapf1}/g" <<< "${line}"
fi
done < "input"
done < "map"
The thing is that the searches and replaces are made correctly, but I can't find a way to save the output of each iteration and work over it in the next. So, my output looks like this:
itsa(new_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(new_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(new_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(new_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(new_4)ofthe(old_3)records
Yet, the desired output would look like this:
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
May anyone bring some light in this darkly waters??? Thanks in advance!

Improving the existing script
Improvements:
Use "$()" instead of ``. It supports whitespace and is easier to read.
Don't execute sed for each line. sed already loops over all lines and is faster than a loop in bash.
The adapted script:
text="$(< input)"
while read -r mapline; do
mapf1="$(awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline")"
mapf2="$(awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline")"
text="$(sed "s/${mapf2}/${mapf1}/g" <<< "$text")"
done < "map"
echo "$text"
The variable $text contains the complete input file and is modified in each iteration. The output of this script is the file after all replacements were done.
Alternative approach
Convert the map file into a pattern for sed and execute sed just once using that pattern.
pattern="$(sed 's#\(.*\) \(.*\)#s/\2/\1/g#' map)"
sed "$pattern" input
The first command is the conversion step. The file
new_0 old_0
new_1 old_1
...
will result in the pattern
s/old_0/new_0/g
s/old_1/new_1/g
...

It is possible in GNU Awk as follows,
awk 'FNR==NR{hash[$2]=$1; next} \
{for (i=1; i<=NF; i++)\
{for(key in hash) \
{if (match ($i,key)) {$i=sprintf("(%s)",hash[key];break;)}}}print}' \
map-file FS='[()]' OFS= input-file
produces an output as,
itsa(new_0)single(new_2)string(new_1)withold_5ocurrences(new_4)ofthe(new_3)records

Another in Gnu awk, using split and ternary operator(s):
$ awk '
NR==FNR { a[$2]=$1; next }
{
n=split($0,b,"[()]")
for(i=1;i<=n;i++)
printf "%s%s",(i%2 ? b[i] : (b[i] in a? "(" a[b[i]] ")":"")),(i==n?ORS:"")
}' map foo
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
First you read in the map to a hash. When processing the file, split all records by ( and ). Every other could be in the map (i%2==0). While printfing test with ternary operator if matches are found from a and when there is a match, output it parenthesized.

Related

AWK Finding a way to print lines containing a word from a comma separated string

I want to write a bash script that only prints lines that, on their second column, contain a word from a comma separated string. Example:
words="abc;def;ghi;jkl"
>cat log1.txt
hello;abc;1234
house;ab;987
mouse;abcdef;654
What I want is to print only lines that contain a whole word from the "words" variable. That means that "ab" won't match, neither will "abcdef". It seems so simple yet after trying for manymany hours, I was unable to find a solution.
For example, I tried this as my awk command, but it matched any substring.
-F \; -v b="TSLA;NVDA" 'b ~ $2 { print $0 }'
I will appreciate any help. Thank you.
EDIT:
A sample input would look like this
1;UNH;buy;344.74
2;PG;sell;138.60
3;MSFT;sell;237.64
4;TSLA;sell;707.03
A variable like this would be set
filter="PG;TSLA"
And according to this filter, I want to echo these lines
2;PG;sell;138.60
4;TSLA;sell;707.03
Grep is a good choice here:
grep -Fw -f <(tr ';' '\n' <<<"$words") log1.txt
With awk I'd do
awk -F ';' -v w="$words" '
BEGIN {
n = split(w, a, /;/)
# next line moves the words into the _index_ of an array,
# to make the file processing much easier and more efficient
for (i=1; i<=n; i++) words[a[i]]=1
}
$2 in words
' log1.txt
You may use this awk:
words="abc;def;ghi;jkl"
awk -F';' -v s=";$words;" 'index(s, FS $2 FS)' log1.txt
hello;abc;1234

How to print keys from all key-value pairs

Text file looks like this:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
How can I extract keys so that:
key11|key12|key13
key21|key22|key23
I have tried unsuccessfully :
awk '{ gsub(/[^[|]=]+=/,"") }1' file.txt
gives back the actual data:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
Since you tagged bash
while IFS='=|' read -ra words; do
n=${#words[#]}
for ((i=1; i<n; i+=2)); do
unset words[i]
done
( IFS='|'; echo "${words[*]}" )
done < file
gawk
This can be done by awk, by setting FS and OFS :
kent$ awk -F'=[^|]*' -v OFS="" '$1=$1' file
key11|key12|key13
key21|key22|key23
or safer: awk -F.... '{$1=$1}1' file
substitution (by sed for example):
kent$ sed 's/=[^|]*//g' file
key11|key12|key13
key21|key22|key23
Here's one solution
echo "key11=val1|key12=val2|key13=val3" \
| awk -F'[=|]' '{
for (i=1;i<=NF;i+=2){
printf("%s%s", $i, (i<(NF-1))?"|":"")
}
print""
}'
output
key11|key12|key13
It should also work by passing in the filename as an argument to awk, i.e.
awk -F'[=|]' '{for (i=1;i<=NF;i+=2){printf("%s%s", $i, (i<(NF-1))?"|":"") }print""}' file1 [file_more_as_will_fit]
Discussion
We use a multiple character value for FS (FieldSeperator) so each = and | char mark the beginning of a new field.
-F'[=|]'
Because we know we want to start with field1 for output and skip every other field, we use
for (i=1;i<=NF;i+=2)
printf formats the output as defined by the format string '%s%s' . There area a zillion options available for printf format strs, but you only need the value for $i (the looping value that generates the key) and whether to print a | char or not.
printf("%s%s", $i ...)
And we use awk's ternary operator, which evaluates what element number is being processed (i<..). As long as it is not the 2nd to last field, the | char is emitted.
(i<(NF-1))?"|":""
IHTH
sed
I did this with sed:
sed -r 's/([[:alnum:]]*)=[[:alnum:]]*/\1/g' < file.txt
tested here and got:
key11|key12|key13
key21|key22|key23
s/<pattern>/<subst>/ means "replace <pattern> by <subst>", and with the g in the end it will do it for every pattern found in the line.
The [[:alnum:]]* is equivalent to [0-9a-zA-Z]*, and means any number of letters or digits.
The first pattern between parentesis will correspond to \1 in the substitution, the second \2 and so on.
So, it will match every "key=value" and replace it by "key".
awk -F'[=|]' '{print $1,$3,$5}' OFS="|" file
key11|key12|key13
key21|key22|key23

trying to remove certain characters from inside of double quotes but leave them intact as delimiters

remove characters(semicolons) from inside of quoted string but keep them intact for delimiters
How do I get sed etc. to do this.
My input file
"1234";"ABCDE;";"9999"
"2344;";"PQRST"; "3456;"
My outpuft file needs to be cleaned up to look like
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
As seen above the semicolons need to be retained as delimiters but need to be removed from the quoted parts. Would anyone be able to let me know? Thanks.
I am actually doing some hive programming and my hive scripts are ready and running successfully (as tested on smaller sample data sets). Now those same scripts are giving me errors since these new big data sets are not clean and hence trying to clean them up (& learning sed etc. along the way).
regards,
Rahul
I think you would be better off with a CSV parser.
If you have gawk, you can use the FPAT variable. Try:
gawk 'BEGIN { FPAT="([^; ]+)|(\"[^\"]+\")"; OFS=";" } { for (i=1;i<=NF;i++) gsub(/;/, "", $i) }1' file
Results:
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
If, for whatever reason you cannot easily upgrade your distro, here's a solution using Perl and the CPAN module Text:CSV:
perl -MText::CSV -nle '
BEGIN { $csv = Text::CSV->new({ sep_char => ";", allow_whitespace => 1 }) }
$csv->parse($_) or die;
print join(";", map { s/;//g; s/^|$/"/g; $_ } $csv->fields())
' file
Results:
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
Suppose that you don't have record like this:
";";";";";"
You can break your task into these steps:
cat input
"1234";"ABCDE;";"9999"
"2344;";"PQRST"; "3456;"
sed -r 's#"\s*;\s*"#|#g'
"1234|ABCDE;|9999"
"2344;|PQRST|3456;"
sed -r 's#[";]##g'
1234|ABCDE|9999
2344|PQRST|3456
sed -r 's#[^|]+#"&"#g'
"1234"|"ABCDE"|"9999"
"2344"|"PQRST"|"3456"
sed -r 's#\|#;#g'
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
Put all commands into one:
sed -r 's#"\s*;\s*"#|#g;s#[";]##g;s#[^|]+#"&"#g;s#\|#;#g' input
sed
kent$ sed -r 's/;"(;"|$)/"\1/g' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
awk
one-liner: longer version:
kent$ awk -F'"' -v OFS='"' '{for(i=1;i<=NF;i++)if($i~/\S+;$/){sub(/;$/,"",$i)}}7' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
oner-liner shorter but with trash(") in last line:
kent$ awk -v RS='"' -v ORS='"' '/\S+;$/{sub(/;$/,"")}7' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
"
Here is one way to do it with awk
awk '
{for (i=1;i<=NF;i++) {
if ($i=="\"") f=!f
if ($i==";" && f) $i=x
printf $i}
} {print ""}
' FS="" file
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
This test if ; is within a block of two ", if yes, remove it.
To also remove space between fields, use this:
awk '
{for (i=1;i<=NF;i++) {
if ($i=="\"") f=!f
if ($i==";" && f) $i=x
if ($i==" " && !f) $i=x
printf $i}
} {print ""}
' FS="" file

for loop and if statements in awk

I am a biologist that is starting to have to learn some elementary scripting skills to deal with large DNA sequence data sets. So please go easy on me. I am doing this all in bash. I have a file with my data formatted like this:
CLocus_58919_Sample_25_Locus_33235_Allele_0
TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
CLocus_58919_Sample_9_Locus_54109_Allele_0
TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
What I need is to do is loop through this file and write all the sequences from the same sample into their own file. Just to be clear, these sequences come from samples 25 and 9. So my idea was to use awk to reformat my file in the following way:
CLocus_58919_Sample_25_Locus_33235_Allele_0_TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
CLocus_58919_Sample_9_Locus_54109_Allele_0_TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
then pipe this into another awk if statement to say "if sample=$i then write out that entire line to a file named sample.$i" Here is my code so far:
#!/bin/bash
a=`ls /scratch/tkchafin/data/raw | wc -l`;
b=1;
c=$((a-b));
mkdir /scratch/tkchafin/data/phylogenetics
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ '{if($4==$i) print}' >> /scratch/tkchafin/data/phylogenetics/sample.$i
done;
I understand this is not working because $i is in single quotes so bash is not recognizing it. I know awk has a -v option for passing external variables to it, but I don't know how I would apply that in this case. I tried to move the for loop inside the awk statement but this does not produce the desired result either. Any help would be much appreciated.
You can have awk write directly to the desired output file, without a shell loop:
awk -F_ '(NR % 2) == 1 { line1 = $0; fn="/scratch/tkchafin/data/phylogenetics/sample."$4; }
(NR % 2) == 0 { print line1"_"$0 > fn; }' "$1"
But to show how you would use -v in your version, it would be:
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ -v i=$i '$4 == i' >> /scratch/tkchafin/data/phylogenetics/sample.$i
done;

Explode to Array

I put together this shell script to do two things:
Change the delimiters in a data file ('::' to ',' in this case)
Select the columns and I want and append them to a new file
It works but I want a better way to do this. I specifically want to find an alternative method for exploding each line into an array. Using command line arguments doesn't seem like the way to go. ANY COMMENTS ARE WELCOME.
# Takes :: separated file as 1st parameters
SOURCE=$1
# create csv target file
TARGET=${SOURCE/dat/csv}
touch $TARGET
echo #userId,itemId > $TARGET
IFS=","
while read LINE
do
# Replaces all matches of :: with a ,
CSV_LINE=${LINE//::/,}
set -- $CSV_LINE
echo "$1,$2" >> $TARGET
done < $SOURCE
Instead of set, you can use an array:
arr=($CSV_LINE)
echo "${arr[0]},${arr[1]}"
The following would print columns 1 and 2 from infile.dat. Replace with
a comma-separated list of the numbered columns you do want.
awk 'BEGIN { IFS='::'; OFS=","; } { print $1, $2 }' infile.dat > infile.csv
Perl probably has a 1 liner to do it.
Awk can probably do it easily too.
My first reaction is a combination of awk and sed:
Sed to convert the delimiters
Awk to process specific columns
cat inputfile | sed -e 's/::/,/g' | awk -F, '{print $1, $2}'
# Or to avoid a UUOC award (and prolong the life of your keyboard by 3 characters
sed -e 's/::/,/g' inputfile | awk -F, '{print $1, $2}'
awk is indeed the right tool for the job here, it's a simple one-liner.
$ cat test.in
a::b::c
d::e::f
g::h::i
$ awk -F:: -v OFS=, '{$1=$1;print;print $2,$3 >> "altfile"}' test.in
a,b,c
d,e,f
g,h,i
$ cat altfile
b,c
e,f
h,i
$

Resources