shell script subtract fields from pairs of lines - bash

Suppose I have the following file:
stub-foo-start: 10
stub-foo-stop: 15
stub-bar-start: 3
stub-bar-stop: 7
stub-car-start: 21
stub-car-stop: 51
# ...
# EOF at the end
with the goal of writing a script which would append to it like so:
stub-foo-start: 10
stub-foo-stop: 15
stub-bar-start: 3
stub-bar-stop: 7
stub-car-start: 21
stub-car-stop: 51
# ...
# appended:
stub-foo: 5 # 5 = stop(15) - start(10)
stub-bar: 4 # and so on...
stub-car: 30
# ...
# new EOF
The format is exactly this sequential pairing of start and stop tags (stop being the closing one) and no nesting in between.
What is the recommended approach to writing such a script using awk and/or sed? Mostly, what I've tried is greping lines, storing to a variable, but that seemed to overcomplicate things and trail off.
Any advice or helpful links welcome. (Most tutorials I found on shell scripting were illustrative at best)

A naive implementation in plain bash
#!/bin/bash
while read -r start && read -r stop; do
printf '%s: %d\n' "${start%-*}" $(( ${stop##*:} - ${start##*:} ))
done < file
This assumes pairs are contiguous and there are no interlaced or nested pairs.

Using GNU awk:
awk -F '[ -]' '{ map[$2][$3]=$4;print } END { for (i in map) { print i": "(map[i]["stop:"]-map[i]["start:"])" // ("map[i]["stop:"]"-"-map[i]["start:"]")" } }' file
Explanation:
awk -F '[ -]' '{ # Set the field delimiter to space or "-"
map[$2][$3]=$4; # Create a two dimensional array with the second and third field as indexes and the fourth field as the value
print # Print the line
}
END { for (i in map) {
print i": "(map[i]["stop:"]-map[i]["start:"])" // ("map[i]["stop:"]"-"-map[i]["start:"]")" # Loop through the array and print the data in the required format
}
}' file

Related

Removing beginnings sequences in fasta from a list with size

I want to remove specific sequence in the list with IDs and extract sequence from large fasta file.
input test.fasta file:
>GHAT8X
MKFNDIRNDGHEDCFNNIIFASKLSSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANAIAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLAPEYI
>GHAMNO
MRLIGCCLETENPVLVFEYVEYGTLADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVAKLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQRAVD
>GHAXM6
MYSCLGAIKNSGKEDKEKCIMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVLLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVELAVKCVSES
seqid_len.txt file:
GHAT8X 25
GHAMNO 26
GHAXM6 20
Expected output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
MRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVL
LTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVEL
AVKCVSES
I tried:
sed 's/_/|/g' seqid_len.txt | while read line;do grep -i -A1 ${line%%[1-9]*} test.fasta | seqkit subseq -r ${line##[a-z]* }:-1 ; done
Only getting GHAT8X 25 and GHAMNO 26 sequence out. However, renaming the header does not work.
Any correction on this or any python solution would be really helpful.
Have a great weekend.
Thanks
Would you please try the following:
#!/bin/bash
awk 'NR==FNR {a[">" $1] = $2 + 0; next} # create an array which maps the header to the starting position of the sequence
$0 in a { # the header matches an array index
start = a[$0] # get the starting position
print # print the header
getline # read the sequence line
print substr($0, start) # print the sequence by removing the beginnings
}
' seqid_len.txt test.fasta | fold -w 60 # wrap the output within 60 columns
Output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
IMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLV
LLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVE
LAVKCVSES
You'll see the 3rd sequence starts with IMR.., one column shifted compared with your expected MRN... If the 3rd one is correct and the 1st and the 2nd sequences should be fixed, tweak the calculation $2 + 0 as $2 + 1.

Saving lines between "start"s and "end"s to different arrays

I have a file that consists of a bunch of things but what I need are numbers between start and end strings: For example :
ghghgh
start
23
34
22
12
end
ghbd
wodkkh
234
start
14
56
74
end
So, I need two arrays here one containing 23,34,22,12 and one containing 14,56,74. What's the best command to use?
If I only had one start and one end I would be able to use mapfile and awk to obtain that array, but there's many start and ends in the file and I need to save all the arrays.
You can do it with sed.
sed -n '/start/{:a;N;/end/!ba;s/\n/, /g;s/, [^,][a-z][^,]*//Ig;s/start, //p}'
The code will iterate through all chunks between 'start' and 'end' lines.
It will remove all items with non-digit symbols and output each "array" on separate line.
Here is output from your data sample:
23, 34, 22, 12
14, 56, 74
You need to implement a small state machine - switching between in block and out of block:
awk '/end/{block = 0; print a; a = ""} (block) {a = a " " $0} /start/{block = 1}'
If at end, leave block, print and empty the accumulator. If in block, accumulate current line. If at start, mark that we're inside a block.
You can tell awk to change the output file every time a new sequence starts
awk '/start/{i++;f=1;next} /end/{f=0} f{print > "arr"i}' file
For the example file, this will create files: arr1, arr2. Then you can create separated arrays with the lines of these files:
for i in $( ls arr* ); do readarray -t $i < $i; done
note: I have assumed that all lines between matching patterns are numeric and acceptable as in the example.
If you trust your input files enough for an eval:
$ cat tst.sh
eval $(
awk '
f {
if ( /end/ ) {
print "declare arr" ++cnt "=(" vals " )"
vals = ""
f = 0
}
else {
vals = vals OFS $0
}
}
/start/ { f = 1 }
' "$1"
)
printf "arr1:%s\n" "${arr1[#]}"
printf "arr2:%s\n" "${arr2[#]}"
$ ./tst.sh file
arr1:23
arr1:34
arr1:22
arr1:12
arr2:14
arr2:56
arr2:74
Check the quoting and all other shell gotchas...

Find nth row using AWK and assign them to a variable

Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.

Bash: Sum fields of a line

I have a file with the following format:
a 1 2 3 4
b 7 8
c 120
I want it to be parsed into:
a 10
b 15
c 120
I know this can be easily done with awk, but I'm not familiar with the syntax and can't get it to work for me.
Thanks for any help
ok simple awk primer:
awk '{ for (i=2;i<=NF;i++) { total+=$i }; print $1,total; total=0 }' file
NF is an internal variable that is reset on each line and is equal to the number of fields on that line so
for (i=2;i<=NF;i++) starts a for loop starting at 2
total+=$i means the var total has the value of the i'th field added to it. and is performed for each iteration of the loop above.
print $1,total prints the 1st field followed by the contents of OFS variable (space by default) then the total for that line.
total=0 resets the totals var ready for the next iteration.
all of the above is done on each line of input.
For more info see grymoires intro here
Start from column two and add them:
awk '{tot=0; for(i=2;i<$NF;i++) tot+=$i; print $1, tot;}' file
A pure bash solution:
$ while read f1 f2
> do
> echo $f1 $((${f2// /+}))
> done < file
On running it, got:
a 10
b 15
c 120
The first field is read into variable f1 and the rest of the fields are i f2. In variable f2 , spaces are replaced in place with + and evaluated.
Here's a tricky way to use a subshell, positional parameters and IFS. Works with various amounts of whitespace between the fields.
while read label numbers; do
echo $label $(set -- $numbers; IFS=+; bc <<< "$*")
done < filename
This works because the shell expands "$*" into a single string of the positional parameters joined by the first char of $IFS (documentation)

Finding and replacing many words

I frequently need to make many replacements within files. To solve this problem, I have created two files old.text and new.text. The first contains a list of words which must be found. The second contains the list of words which should replace those.
All of my files use UTF-8 and make use of various languages.
I have built this script, which I hoped could do the replacement. First, it reads old.text one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text file.
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
However, my solution does not work well. When I run the script:
On line 6, the sed command does not know where the $number ends.
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
Do you have any suggestions?
Update:
The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.
One line 6, the sed command does not know where the $number ends.
Try quoting the variable with double quotes
linefromnewwords=$(sed -n "$number"p newwords.txt)
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Do this instead:
number=`expr $number + 1`
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using -v option.
Here is a solution in bash that would do what you need.
Bash Solution:
#!/bin/bash
while read -r sub && read -r rep <&3; do
sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text
This solution reads one line at a time from substitution file and replacement file and performs in-line sed substitution.
Why not to
paste -d/ oldwords.txt newwords.txt |\
sed -e 's#/# / #' -e 's#^#s/ #' -e 's#$# /g#' >/tmp/$$.sed
sed -f /tmp/$$.sed original >changed
rm /tmp/$$.sed
?
I love this kind of questions, so here is my answer:
First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)
hello=Bye dudes
the morNing=next Afternoon
first=last
Then you can define a proper separator in the script. (file replaceWords.sh)
#!/bin/bash
SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE
Take this example (file changeMe)
Hello, this is me.
I will be there at first time in the morning
Call it with
$ bash replaceWords.sh = changeThis changeMe
And you will get
Bye dudes, this is me.
I will be there at last time in next Afternoon
Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-
Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.
#!/bin/bash
SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
line=$(echo $line | sed "s/${origin}/${dest}/gI")
done < $TRANSLATION
echo $line
done < $FILE
This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.
Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout
The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.
#!/bin/env python
import sys, re
def replacer(match):
global new
return new[match.lastindex-1]
if __name__ == '__main__':
fname_old, fname_new, fname_txt = sys.argv[1:4]
#fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()
with file(fname_old) as f:
# Form regular expression that matches old words, grouped in order
old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
for word in f.read().strip().split()) + ')'
with file(fname_new) as f:
# Ordered list of replacement words
new = [word for word in f.read().strip().split()]
with file(fname_txt) as f:
# input text
txt = f.read()
# Output the new text
print( re.subn(old, replacer, txt)[0] )
I just did some stats on a ~100K byte text file:
Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%
The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.
The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).
Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.
Paddy.
A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:
my %map = (
19 => 'A',
20 => 'B',
);
my $key_regex = '(' . join('|', keys %map) . ')';
while (<>) {
s/$key_regex/$map{$1}/g;
print $_;
}
You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.
At line 6, the sed command does not know where the $number ends.
linefromnewwords=$(sed -n '${number}p' newwords.txt)
I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Arithmetic integer evaluation in bash can be done with $(( )) and is better than eval (eval=evil).
number=$((number + 1))
In general, I would recommend using one file with
s/ ni3 / nǐ /g
s/ nei3 / neǐ /g
and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:
sed -f translate.sed input > output
So you can always easily compare the mappings.
s/\bni3\b/nǐ/g
might be prefered over blanks as explicit delimiters, because \b:=word boundary matches start/end of line and punctuation characters.
This should reduce the time by some means as this avoids unnecessary loops.
Merge two input files:
Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.
We will create a new text file which will act as a sed script to your main file using the following awk one-liner:
awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat old.text
19
20
[jaypal:~/Temp] cat new.text
A
B
[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat merge.text
s/ 19 / A /g
s/ 20 / B /g
Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.
Using merged file as sed script:
Once your merged file has been created, we will use -f option of sed utility.
sed -f merge.text input_file
[jaypal:~/Temp] cat input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
[jaypal:~/Temp] sed -f merge.text input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
You can redirect this into another file using the > operator.
This might work for you:
paste {old,new}words.txt |
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' |
sed -i -f - text.txt
Here is a Python 2 script that should be both space and time efficient:
import sys
import codecs
import re
sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
(line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))
regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))
for line in codecs.open("input.txt", "r", "utf-8"):
result = regexp.sub(lambda match:sub[match.group(0)], line)
sys.stdout.write(result.encode("utf-8"))
Here it is in action:
$ cat old.txt
19
20
$ cat new.txt
A
B
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$
EDIT: Hat tip to #Paddy3118 for whitespace handling.
Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.
#!/usr/bin/env perl
# usage:
# replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt
use strict;
use warnings;
sub read_words {
my $file = shift;
open my $fh, "<$file" or die "Error reading file: $file; $!\n";
my #words = <$fh>;
chomp #words;
close $fh;
return \#words;
}
sub word_map {
my ($old_words, $new_words) = #_;
if (scalar #$old_words != scalar #$new_words) {
warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
}
my $list_size = scalar #$old_words;
$list_size = scalar #$new_words if $list_size > scalar #$new_words;
my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;
return \%map;
}
sub build_regex {
my $words = shift;
my $pattern = join "|", sort { length $b <=> length $a } #$words;
return qr/$pattern/;
}
my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);
my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
s/($old_pattern)/$word_map->{$&}/g;
print;
}
close $input;
__END__
Old words file:
$ cat old.txt
19
20
New words file:
$ cat new.txt
A
B
Input file:
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
Create output:
$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).
use warnings;
use strict;
open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");
my #hay = <>;
my #old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my #new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;
my %r;
; #r{#old} = #new;
print defined $r{$_} ? $r{$_} : $_ for split (
/(\s+)/, "#hay"
);
Use: perl script.pl /file/to/modify, result is printed to stdout.
EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)
I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...
I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.
take it on a run and let me know!
by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment:
#! /usr/bin/perl
# getting the user's input
if ($#ARGV == 1)
{
$LUT_file = shift;
$file = shift;
$outfile = $file . ".out.txt";
}
elsif ($#ARGV == 2)
{
$LUT_file = shift;
$file = shift;
$outfile = shift;
}
else { &usage; }
# opening the relevant files
open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
open FILE,"<",$file or die "can't open $file for reading!\n : $!";
open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";
# getting the lines from the text to be changed and changing them
%word_LUT = ();
WORD_EXT:while (<LUT>)
{
$_ =~ m/(\w+),(\w+)/;
$word_LUT{ $1 } = $2 ;
}
close LUT;
OUTER:while ($line = <FILE>)
{
#words = split(/\s+/,$line);
for( $i = 0; $i <= $#words; $i++)
{
if ( exists ($word_LUT { $words[$i] }) )
{
$words[$i] = $word_LUT { $words[$i] };
}
}
$newline = join(' ',#words);
print "old line - $line\nnewline - $newline\n\n";
print OUT $newline . "\n";
}
# now we have all the signals needed in the swav array, build the file.
close OUT;close FILE;
# Sub Routines
#
#
sub usage(){
print "\n\n\replacer.pl Usage:\n";
print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
print "<LUT file> - a LookUp Table of words, from the old word to the new one.
\t\t\twith the following csv format:
\t\t\told word,new word\n";
print "<Input file> - the input file\n";
print "<out file> - out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";
exit;
}

Resources