Deleting characters from a column if they appear fewer than 20 times - bash

I have a CSV file with two columns:
cat # c a t
dog # d o g
bat # b a t
To simplify communication, I've used English letters for this example, but I'm dealing with CJK in UTF-8.
I would like to delete any character appearing in the second column, which appears on fewer than 20 lines within the first column (characters could be anything from numbers, letters, to Chinese characters, and punctuation, but not spaces).
For e.g., if "o" appears on 15 lines in the first column, all appearances of "o" are deleted from the second column. If "a" appears on 35 lines in the first column, no change is made.
The first column must not be changed.
I don't need to count multiple appearances of a letter on a single line. For e.g. "robot" has 2 o's, but this detail is not important, only that "robot" has an "o", so that is counted as one line.
How can I delete the characters that appear less than 20 times?

Here is a script using awk. Change the var num to be your frequency cutoff point. I've set it to 1 to show how it works against a small sample file. Note how f is still deleted even though it shows up three times on a single line. Also, passing the same input file twice is not a typo.
awk -v num=1 '
BEGIN { OFS=FS="#" }
FNR==NR{
split($1,a,"")
for (x in a)
if(a[x] != " " && !c[a[x]]++)
l[a[x]]++
delete c
next
}
!flag++{
for (x in l)
if (l[x] <= num)
cclass = cclass x
}
{
gsub("["cclass"]", " " , $2)
}1' ./infile.csv ./infile.csv
Sample Input
$ cat ./infile
fff # f f f
cat # c a t
dog # d o g
bat # b a t
Output
$ ./delchar.sh
fff #
cat # a t
dog #
bat # a t

Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
open my $IN, '<:utf8', $ARGV[0] or die $!;
my %chars;
while (<$IN>) {
chomp;
my #cols = split /#/;
my %linechars;
undef #linechars{ split //, $cols[0] };
$chars{$_}++ for keys %linechars;
}
seek $IN, 0, 0;
my #remove = grep $chars{$_} < 20, keys %chars;
my $remove_reg = '[' . join(q{}, #remove) . ']';
warn $remove_reg;
while (<$IN>) {
my #cols = split /#/;
$cols[1] =~ s/$remove_reg//g;
print join '#', #cols;
}
I am not sure how whitespace should be handled, so you might need to adjust the script.

the answer is:
cut -d " " -f #column $file | sed -e 's/\.//g' -e 's/\,//g' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
where $file is your text file and $column is the column you need to look for its frequency. It gives you out the list of their frequency
then you can go on looping on those results which have the first digit greater than your treshold and grepping on the whole lines.

Related

Parsing unstructured text file with grep

I am trying to analyze this IDS log file from MIT, found here.
Summarized attack: 41.084031
IDnum Date StartTime Duration Destination Attackname insider? manual? console?success? aDump? oDump iDumpBSM? SysLogs FSListing StealthyNew? Category OS
41.08403103/29/1999 08:18:35 00:04:07 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:19:37 00:01:56 209.154.098.104ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:29:27 00:00:43 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:40:14 00:24:26 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
I am trying to write commands that do two things:
First, parse through the entire file and determine the amount of distinct "summarized attacks" that begin with 4x.xxxxx. I have accomplished this with:
grep -o -E "Summarized attack: 4". It returns 80.
Second, for each of the "Summarized Attacks" found by the above command, parse the table and determine the amount of IDnum rows, and return the total amount of rows (i.e., attacks) across all "Summarized attack" finds. I would imagine that number is somewhere around 200.
However, I am struggling to get the individual number of IDs, i.e., that are in the IDnum column of this text file.
Since it is a text file with technically no structure, how can I parse this .txt file as if it had a tabular structure to retrieve the total entries in the IDnum column, for each Summarized attack that follows the above grep command's search text?
Desired output would be a count of all IDnum's for the Summarized attacks found by the above command. I don't know the count, but I would imagine an integer output, similar to the return of 80 for grep -o -E "Summarized attack: 4". The output would be <int> where <int> is the # of "attacks" as defined by rows in the IDnum column across all 80 of the found "Summarized attacks" by the above grep command.
If another command other than grep is better suited, that is OK.
to count matches you can use grep -c
grep -cE '(^Summarized.attack:.4[0-9]\.[0-9]+$)'
you can use colon as delimiter for cut -d
(if you loop over results the leading whitespace does not care)
grep -oE '(^Summarized.attack:.4[0-9]\.[0-9]+$)' | cut -d: -f2
example loop
file="path/to/master-listfile-condensed.txt"
for var in $(grep -oE '(^Summarized.attack:.4[0-9]\.[0-9]+$)' "$file" | cut -d: -f2)
do
printf "Summarized attacks: %s: %s\n" $var \
$(grep -cE "(^.${var}[0-9]+/[0-9]{2}/[0-9]{4})" "$file")
done
^ start of line
$ end of line
. any byte (in this case single whitespace)
\. single dot (escaped)
[0-9] single digit
+ one (or more) occurrence
{4} four occurrence
Assuming you have more than one "Summarized attack:" in your input file this may be what you're looking for:
$ cat tst.awk
/^Summarized attack:/ {
prt()
atk = ($3 ~ /^4/ ? $3 : 0)
cnt = 0
}
atk { cnt++ }
END {
prt()
print "TOTAL", tot
}
function prt() {
if ( atk ) {
cnt -= 2
print atk, cnt
}
tot += cnt
}
.
$ awk -f tst.awk file
For your first part, fgrep -c "Summarized attacks: 4" or fgrep -F "Summarized attacks: 4" is sufficient.
If I understand your second part, for each of those blocks, you want to add up the attack rows and print a grand total. You can do that with
gawk '/^Summarized attack: 4/ { on=1; next} /^ 4[0-9.]*/ { if (on) ++ids; next} /^ IDnum/ {next} /^ */ {next} { on=0} END {print ids;}'< master-listfile-condensed.txt
The first statement says, search (/.../) for every line that begins with (^) "Summarized attack: 4", and upon finding it, turn on the "on" flag, and go to the next line. The second statement says, if this is an attack record (i.e. begins with 4 followed by a string [*] of digits), then check the flag; if it is on, count it. Basically, we want the flag to be on when we are in a stanza of target attack records. The next two statements say for every line that starts with " IDnum" or are all whitespace (sometimes blank lines are inserted), go to the next line; this is needed to counteract the next statement, which says that if this is not a line that matches any of the previous statements, turn off the "on" flag. This prevents us from counting attacks outside the target. Finally, END means at the end, print the grand total. I get 757 which is pretty far out of your range. But I think it is correct.
But a far easier way, assuming the Summarized timestamp is always repeated in the IDnum at least to the first significant digit, would be to use
grep -Ec '^ 4' master-listfile-condensed.txt
That means count all the lines that begin with space-4. In this case it gives us the correct result.

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

Bash sed deleting lines with words existing in another pattern

I've got console output, sth like:
SECTION/foo
SECTION/fo1
SECTION/fo3
Foo = N
Fo1 = N
Fo2 = N
Fo3 = N
Bar = Y
as an output, I want to have:
Foo = N
Fo1 = N
Fo3 = N
Any (simple) solution?
Thanks in advance!
Using awk you can do:
awk -F' *[/=] *' '$1 == "SECTION" {a[tolower($2)]} tolower($1) in a' file
Foo = N
Fo1 = N
Fo3 = N
Description:
We split each line using custom field separator as ' *[/=] *' which means / or = surrounded with 0 or more spaces on each side.
When first field is SECTION then we store each lowercase column 2 into an array a
Later when lowercase first column is found in array a then we print each line (default action).
Perl to the rescue!
perl -ne ' $h{ ucfirst $1 } = 1 if m(SECTION/(.*));
print if /(.*) = / && $h{$1};
' < input
A hash table is created from lines containing SECTION/. If the line contains = and its left hand side is stored in the hash, it gets printed.
This might work for you (GNU sed):
sed -nr '/SECTION/H;s/.*/&\n&/;G;s/\n.*/\L&/;/\n(.*) .*\n.*\/\1/P' file
Collect all SECTION lines in the hold space (HS). Double the line and delimit by a newline. Append the collected lines from the HS and convert everything from the first newline to the end to lowercase. Using a backreference match the variable to the section suffix and if so print only the first line i.e. the original line unadulterated.
N.B. the -n invokes the grep-like nature of sed and the -r reduces the number of backslashes needed to write a regexp.
awk '$1 ~ /Foo|Fo1|Fo3/' file
Foo = N
Fo1 = N
Fo3 = N

What is the fastest way to the delete lines in a file which have no match in a second file?

I have two files, wordlist.txt and text.txt.
The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:
你
你们
我
The second file, text.txt, contains long passages, e.g.:
你们要去哪里?
卡拉OK好不好?
I want to create a new word list (wordsfount.txt), but it should only contain those lines from wordlist.txt which are found at least once within text.txt. The output file from the above should show this:
你
你们
"我" is not found in this list because it is never found in text.txt.
I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.
I know a simple way in BASH to check each line in worlist.txt and see if it is in text.txt using grep:
a=1
while read line
do
c=`grep -c $line text.txt`
if [ "$c" -ge 1 ]
then
echo $line >> wordsfound.txt
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < wordlist.txt
Unfortunately, as wordlist.txt is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:
As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:
我
我们
Due to this fact, if "我" is never found within text.txt, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt that also contained within wordlist.txt. If there are about 8,000 unique characters found in wordlist.txt, then the script should not need to check so many lines.
What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?
I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words which are also in war_and_peace.txt. You can change that with:
perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt
On my computer, it takes just over a second to run.
use strict;
use warnings;
use utf8::all;
use Getopt::Long;
my $wordlist = '/usr/share/dict/words';
my $text = 'war_and_peace.txt';
GetOptions(
"worlist=s" => \$wordlist,
"text=s" => \$text,
);
open my $text_fh, '<', $text
or die "Cannot open '$text' for reading: $!";
my %is_in_text;
while ( my $line = <$text_fh> ) {
chomp($line);
# you will want to customize this line
my #words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
next unless #words;
# This beasty uses the 'x' builtin in list context to assign
# the value of 1 to all keys (the words)
#is_in_text{#words} = (1) x #words;
}
open my $wordlist_fh, '<', $wordlist
or die "Cannot open '$wordlist' for reading: $!";
while ( my $word = <$wordlist_fh> ) {
chomp($word);
if ( $is_in_text{$word} ) {
print "$word\n";
}
}
And here's my timing:
• [ovid] $ wc -w war_and_peace.txt
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt
real 0m1.081s
user 0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt
15277 wordsfound.txt
Just use comm
http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html
comm -1 wordlist.txt text.txt
This might work for you:
tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt
Basically, create a new word list from text.txt and grep it against wordlist.txt file.
N.B. You may want to use the software you used to build the original wordlist.txt. In which case all you need is:
yoursoftware < text.txt > newwordlist.txt
grep -f newwordlist.txt wordlist.txt
Use grep with fixed-strings (-F) semantics, this will be fastest. Similarly, if you want to write it in Perl, use the index function instead of regex.
sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt
I'm surprised that there are already four answers, but no one posted this yet. People just don't know their toolbox anymore.
I would probably use Perl;
use strict;
my #aWordList = ();
open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);
while(my $sWord = <WORDLIST>)
{
chomp($sWord);
push(#aWordList, $sWord);
}
close(WORDLIST);
open(TEXT, "< text.txt") || die("Can't open text.txt);
while(my $sText = <TEXT>)
{
foreach my $sWord (#aWordList)
{
if($sText =~ /$sWord/)
{
print("$sWord\n");
}
}
}
close(TEXT);
This won't be too slow, but if you could let us know the size of the files you're dealing with I could have a go at writing something much more clever with hash tables
Quite sure not the fastest solution, but at least a working one (I hope).
This solution needs ruby 1.9, the text file are expected to be UTF-8.
#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
f << new_wordlist.join("\n")
}
Can you provide a bigger example to make some benchmark on different methods? (Perhaps some test files to download?)
Below a benchmark with four methods.
#encoding: utf-8
require 'benchmark'
N = 10_000 #Number of Test loops
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
def solution_count
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.count(word) > 0
}
new_wordlist.sort
end
#Faster then count, it can stop after the first hit
def solution_include
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
new_wordlist.sort
end
def solution_combine()
#get biggest word size
max = 0
$wordlist.each{|word| max = word.size if word.size > max }
#Build list of all letter combination from text
words_in_txt = []
0.upto($txt.size){|i|
1.upto(max){|l|
words_in_txt << $txt[i,l]
}
}
(words_in_txt & $wordlist).sort
end
#Idea behind:
#- remove string if found.
#- the next comparison is faster, the search text is shorter.
#
#This will not work with overlapping words.
#Example:
# abcdef contains def.
# if we check bcd first, the 'd' of def will be deleted, def is not detected.
def solution_gsub
new_wordlist = []
txt = $txt.dup #avoid to manipulate data source for other methods
#We must start with the big words.
#If we start with small one, we destroy long words
$wordlist.sort_by{|x| x.size }.reverse.each{|word|
new_wordlist << word if txt.gsub!(word,'')
}
#Now we must add words which where already part of longer words
new_wordlist.dup.each{|neww|
$wordlist.each{|word|
new_wordlist << word if word != neww and neww.include?(word)
}
}
new_wordlist.sort
end
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
#~ f << solution_include.join("\n")
f << solution_combine.join("\n")
}
#Check the different results
if solution_count != solution_include
puts "Difference solution_count <> solution_include"
end
if solution_gsub != solution_include
puts "Difference solution_gsub <> solution_include"
end
if solution_combine != solution_include
puts "Difference solution_combine <> solution_include"
end
#Benchmark the solution
Benchmark.bmbm(10) {|b|
b.report('count') { N.times { solution_count } }
b.report('include') { N.times { solution_include } }
b.report('gsub') { N.times { solution_gsub } } #wrong results
b.report('combine') { N.times { solution_gsub } } #wrong results
} #Benchmark
I think, the solution_gsub variant is not correct. See the comment in the method definition. If CJK may allow this solution, the please give me a feedback.
That variant is the slowest in my test, but perhaps it will tune up with bigger examples.
And perhaps it can be tuned a bit.
The variant combine is also very slow, but it would be interestiung what happens with a bigger example.
First TXR Lisp solution ( http://www.nongnu.org/txr ):
(defvar tg-hash (hash)) ;; tg == "trigraph"
(unless (= (len *args*) 2)
(put-line `arguments required: <wordfile> <textfile>`)
(exit nil))
(defvar wordfile [*args* 0])
(defvar textfile [*args* 1])
(mapcar (lambda (line)
(dotimes (i (len line))
(push line [tg-hash [line i..(succ i)]])
(push line [tg-hash [line i..(ssucc i)]])
(push line [tg-hash [line i..(sssucc i)]])))
(file-get-lines textfile))
(mapcar (lambda (word)
(if (< (len word) 4)
(if [tg-hash word]
(put-line word))
(if (find word [tg-hash [word 0..3]]
(op search-str #2 #1))
(put-line word))))
(file-get-lines wordfile))
The strategy here is to reduce the corpus of words to a hash table which is indexed on individual characters, digraphs and trigraphs occuring in the lines, associating these fragments with the lines. Then when we process the word list, this reduces the search effort.
Firstly if the word is short, three characters or less (probably common in Chinese words), we can try to get an instant match in the hash table. If no match, word is not in the corpus.
If the word is longer than three characters, we can try to get a match for the first three characters. That gives us a list of lines which contain a match for the trigraph. We can search those lines exhaustively to see which ones of them match the word. I suspect that this will greatly reduce the number of lines that have to be searched.
I would need your data, or something representative thereof, to be able to see what the behavior is like.
Sample run:
$ txr words.tl words.txt text.txt
water
fire
earth
the
$ cat words.txt
water
fire
earth
the
it
$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth
(TXR reads UTF-8 and does all string manipulation in Unicode, so testing with ASCII characters is valid.)
The use of lazy lists means that we do not store the entire list of 300,000 words, for instance. Although we are using the Lisp mapcar function, the list is being generated on the fly and because we don't keep the reference to the head of the list, it is eligible for garbage collection.
Unfortunately we do have to keep the text corpus in memory because the hash table associates lines.
If that's a problem, the solution could be reversed. Scan all the words, and then process the text corpus lazily, tagging those words which occur. Then eliminate the rest. I will post such a solution also.
new file newlist.txt
for each word in wordlist.txt:
check if word is in text.txt (I would use grep, if you're willing to use bash)
if yes:
append it to newlist.txt (probably echo word >> newlist.txt)
if no:
next word
Simplest way with bash script:
Preprocessing first with "tr" and "sort" to format it to one word a line and remove duplicated lines.
Do this:
cat wordlist.txt | while read i; do grep -E "^$i$" text.txt; done;
That's the list of words you want...
Try this:
cat wordlist.txt | while read line
do
if [[ grep -wc $line text.txt -gt 0 ]]
then
echo $line
fi
done
Whatever you do, if you use grep you must use -w to match a whole word. Otherwise if you have foo in wordlist.txt and foobar in text.txt, you'll get wrong match.
If the files are VERY big, and this loop takes too much time to run, you can convert text.txt to a list of work (easy with AWK), and use comm to find the words that are in both lists.
This solution is in perl, maintains your original symantics and uses the optimization you suggested.
#!/usr/bin/perl
#list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(#list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
my $j = $i+1;
while ($list[$j]=~/^$list[$i]/) {
++$j;
}
$skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
#text = (<IN>);
close IN;
foreach $c(#text) {
for ($i=0;$i<$size;++$i) {
if ($c=~/$list[$i]/) {
$found{$list[$i]}=1;
last;
}
else {
$i+=$skip[$i];
}
}
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
print OUT "$key\n";
}
close OUT;
exit;
Use paralel processing to speed up the processing.
1) sort & uniq on wordlist.txt, then split it to several files (X)
Do some testing, X is equal with your computer cores.
split -d -l wordlist.txt
2) use xargs -p X -n 1 script.sh x00 > output-x00.txt
to process the files in paralel
find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE
3) cat output* > output.txt concatenate output files
This will speed up the processing enough, and you are able to use tools that you could understand. This will ease up the maintinging "cost".
The script almost identical that you used in the first place.
script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
c=`grep -c $line ${FILE} `
if [ "$c" -ge 1 ]
then
echo $line >> ${OUTPUTFILE}
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < ${WORDLIST}

What is an efficient way to replace list of strings with another list in Unix file?

Suppose I have two lists of strings (list A and list B) with the exact same number of entries, N, in each list, and I want to replace all occurrences of the the nth element of A with the nth element of B in a file in Unix (ideally using Bash scripting).
What's the most efficient way to do this?
An inefficient way would be to make N calls to "sed s/stringA/stringB/g".
This will do it in one pass. It reads listA and listB into awk arrays, then for each line of the linput, it examines each word and if the word is found in listA, the word is replaced by the corresponding word in listB.
awk '
FILENAME == ARGV[1] { listA[$1] = FNR; next }
FILENAME == ARGV[2] { listB[FNR] = $1; next }
{
for (i = 1; i <= NF; i++) {
if ($i in listA) {
$i = listB[listA[$i]]
}
}
print
}
' listA listB filename > filename.new
mv filename.new filename
I'm assuming the strings in listA do not contain whitespace (awk's default field separator)
Make one call to sed that writes the sed script, and another to use it? If your lists are in files listA and listB, then:
paste -d : listA listB | sed 's/\([^:]*\):\([^:]*\)/s%\1%\2%/' > sed.script
sed -f sed.script files.to.be.mapped.*
I'm making some sweeping assumptions about 'words' not containing either colon or percent symbols, but you can adapt around that. Some versions of sed have upper bounds on the number of commands that can be specified; if that's a problem because your word lists are big enough, then you may have to split the generated sed script into separate files which are applied - or change to use something without the limit (Perl, for example).
Another item to be aware of is sequence of changes. If you want to swap two words, you need to craft your word lists carefully. In general, if you map (1) wordA to wordB and (2) wordB to wordC, it matters whether the sed script does mapping (1) before or after mapping (2).
The script shown is not careful about word boundaries; you can make it careful about them in various ways, depending on the version of sed you are using and your criteria for what constitutes a word.
I needed to do something similar, and I wound up generating sed commands based on a map file:
$ cat file.map
abc => 123
def => 456
ghi => 789
$ cat stuff.txt
abc jdy kdt
kdb def gbk
qng pbf ghi
non non non
try one abc
$ sed `cat file.map | awk '{print "-e s/"$1"/"$3"/"}'`<<<"`cat stuff.txt`"
123 jdy kdt
kdb 456 gbk
qng pbf 789
non non non
try one 123
Make sure your shell supports as many parameters to sed as you have in your map.
This is fairly straightforward with Tcl:
set fA [open listA r]
set fB [open listB r]
set fin [open input.file r]
set fout [open output.file w]
# read listA and listB and create the mapping of corresponding lines
while {[gets $fA strA] != -1} {
set strB [gets $fB]
lappend map $strA $strB
}
# apply the mapping to the input file
puts $fout [string map $map [read $fin]]
# if the file is large, do it line by line instead
#while {[gets $fin line] != -1} {
# puts $fout [string map $map $line]
#}
close $fA
close $fB
close $fin
close $fout
file rename output.file input.file
you can do this in bash. Get your lists into arrays.
listA=(a b c)
listB=(d e f)
data=$(<file)
echo "${data//${listA[2]}/${listB[2]}}" #change the 3rd element. Redirect to file where necessary

Resources