Replace multiple patterns, but not with the same string - bash

is it possible to change multiply patterns to different values at the same command?
lets say I have
A B C D ABC
and I want to change every A to 1 every B to 2 and every C to 3
so the output will be
1 2 3 D 123
since I have 3 patterns to change I would like to avoid substitute them separately.
I thought there would be something like
sed -r s/'(A|B|C)'/(1|2|3)/
but of course this just replace A or B or C to (1|2|3).
I should just mention that my real patterns are more complicated than that...
thank you!

Easy in sed:
sed 's/WORD1/NEW_WORD1/g;s/WORD2/NEW_WORD2/g;s/WORD3/NEW_WORD3/g'
You can separate multiple commands on the same line by a ;
Update
Probably this was too easy. NeronLeVelu pointed out that the above command can lead to unwanted results because the second substitution might even touch results of the first substitution (and so on).
If you care about this you can avoid this side effect with the t command. The t command branches to the end of the script, but only if a substitution did happen:
sed 's/WORD1/NEW_WORD1/g;t;s/WORD2/NEW_WORD2/g;t;s/WORD3/NEW_WORD3/g'

Easy in Perl:
perl -pe '%h = (A => 1, B => 2, C => 3); s/(A|B|C)/$h{$1}/g'
If you use more complex patterns, put the more specific ones before the more general ones in the alternative list. Sorting by length might be enough:
perl -pe 'BEGIN { %h = (A => 1, AA => 2, AAA => 3);
$re = join "|", sort { length $b <=> length $a } keys %h; }
s/($re)/$h{$1}/g'
To add word or line boundaries, just change the pattern to
/\b($re)\b/
# or
/^($re)$/
# resp.

This will work if your "words" don't contain RE metachars (. * ? etc.):
$ cat file
there is the problem when the foo is closed
$ cat tst.awk
BEGIN {
split("the a foo bar",tmp)
for (i=1;i in tmp;i+=2) {
old = (i>1 ? old "|" : "\\<(") tmp[i]
map[tmp[i]] = tmp[i+1]
}
old = old ")\\>"
}
{
head = ""
tail = $0
while ( match(tail,old) ) {
head = head substr(tail,1,RSTART-1) map[substr(tail,RSTART,RLENGTH)]
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}
$ awk -f tst.awk file
there is a problem when a bar is closed
The above obviously maps "the" to "a" and "foo" to "bar" and uses GNU awk for word boundaries.
If your "words" do contain RE metachars etc. then you need a string-based solution using index() instead of an RE based one using match() (note that sed ONLY supports REs, not strings).

replace with callback function in javascript
similar to the perl solution by choroba
var i = 'abcd'
var r = {ab: "cd", cd: "ab"}
var o = i.replace(/ab|cd/g, (...args) => r[args[0]])
o == 'cdab'
can be optimized with capture groups like /(ab)|(cd)/g
and checking args[i] for undefined values

Related

Bash: How to decode and encode specific URL characters? [duplicate]

is it possible to change multiply patterns to different values at the same command?
lets say I have
A B C D ABC
and I want to change every A to 1 every B to 2 and every C to 3
so the output will be
1 2 3 D 123
since I have 3 patterns to change I would like to avoid substitute them separately.
I thought there would be something like
sed -r s/'(A|B|C)'/(1|2|3)/
but of course this just replace A or B or C to (1|2|3).
I should just mention that my real patterns are more complicated than that...
thank you!
Easy in sed:
sed 's/WORD1/NEW_WORD1/g;s/WORD2/NEW_WORD2/g;s/WORD3/NEW_WORD3/g'
You can separate multiple commands on the same line by a ;
Update
Probably this was too easy. NeronLeVelu pointed out that the above command can lead to unwanted results because the second substitution might even touch results of the first substitution (and so on).
If you care about this you can avoid this side effect with the t command. The t command branches to the end of the script, but only if a substitution did happen:
sed 's/WORD1/NEW_WORD1/g;t;s/WORD2/NEW_WORD2/g;t;s/WORD3/NEW_WORD3/g'
Easy in Perl:
perl -pe '%h = (A => 1, B => 2, C => 3); s/(A|B|C)/$h{$1}/g'
If you use more complex patterns, put the more specific ones before the more general ones in the alternative list. Sorting by length might be enough:
perl -pe 'BEGIN { %h = (A => 1, AA => 2, AAA => 3);
$re = join "|", sort { length $b <=> length $a } keys %h; }
s/($re)/$h{$1}/g'
To add word or line boundaries, just change the pattern to
/\b($re)\b/
# or
/^($re)$/
# resp.
This will work if your "words" don't contain RE metachars (. * ? etc.):
$ cat file
there is the problem when the foo is closed
$ cat tst.awk
BEGIN {
split("the a foo bar",tmp)
for (i=1;i in tmp;i+=2) {
old = (i>1 ? old "|" : "\\<(") tmp[i]
map[tmp[i]] = tmp[i+1]
}
old = old ")\\>"
}
{
head = ""
tail = $0
while ( match(tail,old) ) {
head = head substr(tail,1,RSTART-1) map[substr(tail,RSTART,RLENGTH)]
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}
$ awk -f tst.awk file
there is a problem when a bar is closed
The above obviously maps "the" to "a" and "foo" to "bar" and uses GNU awk for word boundaries.
If your "words" do contain RE metachars etc. then you need a string-based solution using index() instead of an RE based one using match() (note that sed ONLY supports REs, not strings).
replace with callback function in javascript
similar to the perl solution by choroba
var i = 'abcd'
var r = {ab: "cd", cd: "ab"}
var o = i.replace(/ab|cd/g, (...args) => r[args[0]])
o == 'cdab'
can be optimized with capture groups like /(ab)|(cd)/g
and checking args[i] for undefined values

Bash sed deleting lines with words existing in another pattern

I've got console output, sth like:
SECTION/foo
SECTION/fo1
SECTION/fo3
Foo = N
Fo1 = N
Fo2 = N
Fo3 = N
Bar = Y
as an output, I want to have:
Foo = N
Fo1 = N
Fo3 = N
Any (simple) solution?
Thanks in advance!
Using awk you can do:
awk -F' *[/=] *' '$1 == "SECTION" {a[tolower($2)]} tolower($1) in a' file
Foo = N
Fo1 = N
Fo3 = N
Description:
We split each line using custom field separator as ' *[/=] *' which means / or = surrounded with 0 or more spaces on each side.
When first field is SECTION then we store each lowercase column 2 into an array a
Later when lowercase first column is found in array a then we print each line (default action).
Perl to the rescue!
perl -ne ' $h{ ucfirst $1 } = 1 if m(SECTION/(.*));
print if /(.*) = / && $h{$1};
' < input
A hash table is created from lines containing SECTION/. If the line contains = and its left hand side is stored in the hash, it gets printed.
This might work for you (GNU sed):
sed -nr '/SECTION/H;s/.*/&\n&/;G;s/\n.*/\L&/;/\n(.*) .*\n.*\/\1/P' file
Collect all SECTION lines in the hold space (HS). Double the line and delimit by a newline. Append the collected lines from the HS and convert everything from the first newline to the end to lowercase. Using a backreference match the variable to the section suffix and if so print only the first line i.e. the original line unadulterated.
N.B. the -n invokes the grep-like nature of sed and the -r reduces the number of backslashes needed to write a regexp.
awk '$1 ~ /Foo|Fo1|Fo3/' file
Foo = N
Fo1 = N
Fo3 = N

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?
The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etc×¥ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).
Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}
It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}
Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)
a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

Deleting characters from a column if they appear fewer than 20 times

I have a CSV file with two columns:
cat # c a t
dog # d o g
bat # b a t
To simplify communication, I've used English letters for this example, but I'm dealing with CJK in UTF-8.
I would like to delete any character appearing in the second column, which appears on fewer than 20 lines within the first column (characters could be anything from numbers, letters, to Chinese characters, and punctuation, but not spaces).
For e.g., if "o" appears on 15 lines in the first column, all appearances of "o" are deleted from the second column. If "a" appears on 35 lines in the first column, no change is made.
The first column must not be changed.
I don't need to count multiple appearances of a letter on a single line. For e.g. "robot" has 2 o's, but this detail is not important, only that "robot" has an "o", so that is counted as one line.
How can I delete the characters that appear less than 20 times?
Here is a script using awk. Change the var num to be your frequency cutoff point. I've set it to 1 to show how it works against a small sample file. Note how f is still deleted even though it shows up three times on a single line. Also, passing the same input file twice is not a typo.
awk -v num=1 '
BEGIN { OFS=FS="#" }
FNR==NR{
split($1,a,"")
for (x in a)
if(a[x] != " " && !c[a[x]]++)
l[a[x]]++
delete c
next
}
!flag++{
for (x in l)
if (l[x] <= num)
cclass = cclass x
}
{
gsub("["cclass"]", " " , $2)
}1' ./infile.csv ./infile.csv
Sample Input
$ cat ./infile
fff # f f f
cat # c a t
dog # d o g
bat # b a t
Output
$ ./delchar.sh
fff #
cat # a t
dog #
bat # a t
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
open my $IN, '<:utf8', $ARGV[0] or die $!;
my %chars;
while (<$IN>) {
chomp;
my #cols = split /#/;
my %linechars;
undef #linechars{ split //, $cols[0] };
$chars{$_}++ for keys %linechars;
}
seek $IN, 0, 0;
my #remove = grep $chars{$_} < 20, keys %chars;
my $remove_reg = '[' . join(q{}, #remove) . ']';
warn $remove_reg;
while (<$IN>) {
my #cols = split /#/;
$cols[1] =~ s/$remove_reg//g;
print join '#', #cols;
}
I am not sure how whitespace should be handled, so you might need to adjust the script.
the answer is:
cut -d " " -f #column $file | sed -e 's/\.//g' -e 's/\,//g' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
where $file is your text file and $column is the column you need to look for its frequency. It gives you out the list of their frequency
then you can go on looping on those results which have the first digit greater than your treshold and grepping on the whole lines.

What is an efficient way to replace list of strings with another list in Unix file?

Suppose I have two lists of strings (list A and list B) with the exact same number of entries, N, in each list, and I want to replace all occurrences of the the nth element of A with the nth element of B in a file in Unix (ideally using Bash scripting).
What's the most efficient way to do this?
An inefficient way would be to make N calls to "sed s/stringA/stringB/g".
This will do it in one pass. It reads listA and listB into awk arrays, then for each line of the linput, it examines each word and if the word is found in listA, the word is replaced by the corresponding word in listB.
awk '
FILENAME == ARGV[1] { listA[$1] = FNR; next }
FILENAME == ARGV[2] { listB[FNR] = $1; next }
{
for (i = 1; i <= NF; i++) {
if ($i in listA) {
$i = listB[listA[$i]]
}
}
print
}
' listA listB filename > filename.new
mv filename.new filename
I'm assuming the strings in listA do not contain whitespace (awk's default field separator)
Make one call to sed that writes the sed script, and another to use it? If your lists are in files listA and listB, then:
paste -d : listA listB | sed 's/\([^:]*\):\([^:]*\)/s%\1%\2%/' > sed.script
sed -f sed.script files.to.be.mapped.*
I'm making some sweeping assumptions about 'words' not containing either colon or percent symbols, but you can adapt around that. Some versions of sed have upper bounds on the number of commands that can be specified; if that's a problem because your word lists are big enough, then you may have to split the generated sed script into separate files which are applied - or change to use something without the limit (Perl, for example).
Another item to be aware of is sequence of changes. If you want to swap two words, you need to craft your word lists carefully. In general, if you map (1) wordA to wordB and (2) wordB to wordC, it matters whether the sed script does mapping (1) before or after mapping (2).
The script shown is not careful about word boundaries; you can make it careful about them in various ways, depending on the version of sed you are using and your criteria for what constitutes a word.
I needed to do something similar, and I wound up generating sed commands based on a map file:
$ cat file.map
abc => 123
def => 456
ghi => 789
$ cat stuff.txt
abc jdy kdt
kdb def gbk
qng pbf ghi
non non non
try one abc
$ sed `cat file.map | awk '{print "-e s/"$1"/"$3"/"}'`<<<"`cat stuff.txt`"
123 jdy kdt
kdb 456 gbk
qng pbf 789
non non non
try one 123
Make sure your shell supports as many parameters to sed as you have in your map.
This is fairly straightforward with Tcl:
set fA [open listA r]
set fB [open listB r]
set fin [open input.file r]
set fout [open output.file w]
# read listA and listB and create the mapping of corresponding lines
while {[gets $fA strA] != -1} {
set strB [gets $fB]
lappend map $strA $strB
}
# apply the mapping to the input file
puts $fout [string map $map [read $fin]]
# if the file is large, do it line by line instead
#while {[gets $fin line] != -1} {
# puts $fout [string map $map $line]
#}
close $fA
close $fB
close $fin
close $fout
file rename output.file input.file
you can do this in bash. Get your lists into arrays.
listA=(a b c)
listB=(d e f)
data=$(<file)
echo "${data//${listA[2]}/${listB[2]}}" #change the 3rd element. Redirect to file where necessary

Resources