Add variable content to a new column in file - bash

I don't have much experience using UNIX and I wonder how to do this:
I have a bash variable with this content:
82 195 9 53
Current file looks like:
A
B
C
D
I want to add a new column to a file with that numbers, like this:
A 82
B 195
C 9
D 53
Hope you can help me. Thanks in advance.

Or simply use paste with a space as the delimiter, e.g. with your example file content in file:
var="82 195 9 53"
paste -d ' ' file <(printf "%s\n" $var)
(note: $var is use unquoted in the process-substitution with printf)
Result
A 82
B 195
C 9
D 53
Note for a general POSIX shell solution, you would simply pipe the output of printf to paste instead of using the bash-only process substitution, e.g.
printf "%s\n" $var | paste -d ' ' file -

With bash and an array:
numbers="82 195 9 53"
array=($numbers)
declare -i c=0 # declare with integer flag
while read -r line; do
echo "$line ${array[$c]}"
c=c+1
done < file
Output:
A 82
B 195
C 9
D 53

One idea using awk:
x='82 195 9 53'
awk -v x="${x}" 'BEGIN { split(x,arr) } { print $0,arr[FNR] }' file.txt
Where:
-v x="${x}" - pass OS variable "${x}" is as awk variable x
split(x,arr) - split awk variable x into array arr[] (default delimiter is space); this will give us arr[1]=82, arr[2]=195, arr[3]=9 and arr[4]=53
This generates:
A 82
B 195
C 9
D 53
The question has been tagged with windows-subsystem-for-linux.
If the input file has Windows/DOS line endings (\r\n) the proposed awk solution may generate incorrect results, eg:
82 # each line
195 # appears
9 # to begin
53 # with a space
In this scenario OP has a couple options:
before calling awk run dos2unix file.txt to convert to unix line engines (\n) or ...
change the awk/BEGIN block to BEGIN { RS="\r\n"; split(x,arr) }

With mapfile aka readarray which is a bash4+ feature.
#!/usr/bin/env bash
##: The variable with numbers
variable='82 195 9 53'
##: Save the the variable into an array named numbers
mapfile -t numbers <<< "${variable// /$'\n'}"
##: Save the content of the file into an array named file_content
mapfile -t file_content < file.txt
##: Loop through the indices of both the arrays and print them side-by-side
for i in "${!file_content[#]}"; do
printf '%s %d\n' "${file_content[i]}" "${numbers[i]}"
done

Related

Is there a command for substituting a set of characters by a set of strings?

I'm would like to substitute a set of edit: single byte characters with a set of literal strings in a stream, without any constraint on the line size.
#!/bin/bash
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
chars_to_strings $'\a\b\t\v' '<bell>' '<backspace>' '<horizontal-tab>' '<vertical-tab>'
The expected output would be:
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>...
I can think of a bash function that would do that, something like:
chars_to_strings() {
local delim buffer
while true
do
delim=''
IFS='' read -r -d '.' -n 4096 buffer && (( ${#buffer} != 4096 )) && delim='.'
if [[ -n "${delim:+_}" ]] || [[ -n "${buffer:+_}" ]]
then
# Do the replacements in "$buffer"
# ...
printf "%s%s" "$buffer" "$delim"
else
break
fi
done
}
But I'm looking for a more efficient way, any thoughts?
Since you seem to be okay with using ANSI C quoting via $'...' strings, then maybe use sed?
sed $'s/\a/<bell>/g; s/\b/<backspace>/g; s/\t/<horizontal-tab>/g; s/\v/<vertical-tab>/g'
Or, via separate commands:
sed -e $'s/\a/<bell>/g' \
-e $'s/\b/<backspace>/g' \
-e $'s/\t/<horizontal-tab>/g' \
-e $'s/\v/<vertical-tab>/g'
Or, using awk, which replaces newline characters too (by customizing the Output Record Separator, i.e., the ORS variable):
$ printf '\a,\b,\t,\v\n' | awk -vORS='<newline>' '
{
gsub(/\a/, "<bell>")
gsub(/\b/, "<backspace>")
gsub(/\t/, "<horizontal-tab>")
gsub(/\v/, "<vertical-tab>")
print $0
}
'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><newline>
For a simple one-liner with reasonable portability, try Perl.
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
perl -pe 's/\a/<bell>/g;
s/\b/<backspace>/g;s/\t/<horizontal-tab>/g;s/\v/<vertical-tab>/g'
Perl internally does some intelligent optimizations so it's not encumbered by lines which are longer than its input buffer or whatever.
Perl by itself is not POSIX, of course; but it can be expected to be installed on any even remotely modern platform (short of perhaps embedded systems etc).
Assuming the overall objective is to provide the ability to process a stream of data in real time without having to wait for a EOL/End-of-buffer occurrence to trigger processing ...
A few items:
continue to use the while/read -n loop to read a chunk of data from the incoming stream and store in buffer variable
push the conversion code into something that's better suited to string manipulation (ie, something other than bash); for sake of discussion we'll choose awk
within the while/read -n loop printf "%s\n" "${buffer}" and pipe the output from the while loop into awk; NOTE: the key item is to introduce an explicit \n into the stream so as to trigger awk processing for each new 'line' of input; OP can decide if this additional \n must be distinguished from a \n occurring in the original stream of data
awk then parses each line of input as per the replacement logic, making sure to append anything leftover to the front of the next line of input (ie, for when the while/read -n breaks an item in the 'middle')
General idea:
chars_to_strings() {
while read -r -n 15 buffer # using '15' for demo purposes otherwise replace with '4096' or whatever OP wants
do
printf "%s\n" "${buffer}"
done | awk '{print NR,FNR,length($0)}' # replace 'print ...' with OP's replacement logic
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1 # add some delay to data being streamed to chars_to_strings()
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
A variation on this idea using a named pipe:
mkfifo /tmp/pipeX
sleep infinity > /tmp/pipeX # keep pipe open so awk does not exit
awk '{print NR,FNR,length($0)}' < /tmp/pipeX &
chars_to_strings() {
while read -r -n 15 buffer
do
printf "%s\n" "${buffer}"
done > /tmp/pipeX
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
# kill background 'awk' and/or 'sleep infinity' when no longer needed
don't waste FS/OFS - use the built-in variables to take 2 out of the 5 needed :
echo $' \t abc xyz \t \a \n\n ' |
mawk 'gsub(/\7/, "<bell>", $!(NF = NF)) + gsub(/\10/,"<bs>") +\
gsub(/\11/,"<h-tab>")^_' OFS='<v-tab>' FS='\13' ORS='<newline>'
<h-tab> abc xyz <h-tab> <bell> <newline><newline> <newline>
To have NO constraint on the line length you could do something like this with GNU awk:
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(foo,bar)
print
}'
That will read and process the input 100 chars at a time no matter which chars are present, whether it has newlines or not, and even if the input was one multi-terabyte line.
Replace gsub(foo,bar) with whatever substitution(s) you have in mind, e.g.:
$ printf '\a,\b,\t,\v' |
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(/\a/,"<bell>")
gsub(/\b/,"<backspace>")
gsub(/\t/,"<horizontal-tab>")
gsub(/\v/,"<vertical-tab>")
print
}'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab>
and of course it'd be trivial to pass a list of old and new strings to awk rather than hardcoding them, you'd just have to sanitize any regexp or backreference metachars before calling gsub().

assign two different variables from same line and loop for every line [duplicate]

This question already has answers here:
How to read variables from file, with multiple variables per line?
(2 answers)
Closed last month.
I am trying to assign variables obtained by awk, from a 2 columned txt file.
To a command, which includes every two value as two variables in it.
For example, the file I use is;
foo.txt
10 20
33 40
65 78
my command is aiming to print ;
end=20 start=10
end=40 start=33
end=78 start=65
Basically, I want to iterate the code for every line, and for output, there will be two variables from the two columns of the input file.
I am not an awk expert (I am trying my best), what I could have done so far is this fusion;
while read -r line ; do awk '{ second_variable=$2 ; first_variable=$1 ; }' ; echo "end=$first_name start=$second_name"; done <foo.txt
but it only gives this output;
end= start=
only one time without any variable. I would appreciate any suggestion. Thank you.
In bash you only need while, read and printf:
while read -r start end
do printf 'end=%d start=%d\n' "$end" "$start"
done < foo.txt
end=20 start=10
end=40 start=33
end=78 start=65
With awk, you could do:
awk '{print "end=" $2, "start=" $1}' foo.txt
end=20 start=10
end=40 start=33
end=78 start=65
With sed you'd use regular expressions:
sed -E 's/([0-9]+) ([0-9]+)/end=\2 start=\1/' foo.txt
end=20 start=10
end=40 start=33
end=78 start=65
Just in Bash:
while read -r end start; do echo "end=$end start=$start"; done <foo.txt
What about using xargs?
xargs -n2 sh -c 'echo end=$1 start=$2' sh < file.txt
Demo
xargs -n2 sh -c 'echo end=$1 start=$2' sh <<INPUT
10 20
33 40
65 78
INPUT
Output
end=10 start=20
end=33 start=40
end=65 start=78

Bash compute the letter suffix from the split command (i.e. integer into base 26 with letters)

The split command produces by default a file suffix of the form "aa" "ab" ... "by" "bz"...
However in a script, I need to recover this suffix, starting from the file number as an integer (without globbing).
I wrote the following code, but maybe bash wizards here have a more concise solution?
alph="abcdefghijklmnopqrstuvwxyz"
for j in {0..100}; do
# Convert j to the split suffix (aa ab ac ...)
first=$(( j / 26 ))
sec=$(( j % 26 ))
echo "${alph:$first:1}${alph:$sec:1}"
done
Alternatively, I could use bc with the obase variable, but it only outputs one number in case j<26.
bc <<< 'obase=26; 5'
# 05
bc <<< 'obase=26; 31'
# 01 05
Use this Perl one-liner and specify the file numbers (0-indexed) as arguments, for example:
perl -le 'print for ("aa".."zz")[#ARGV]' 0 25 26
Output:
aa
az
ba
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
#ARGV : array of the command-line arguments.
From top of my head, depending on 97 beeing ASCII a:
printf "\x$(printf %x $((97+j/26)))\x$(printf %x $((97+j%26)))\n"
printf "\\$(printf %o $((97+j/26)))\\$(printf %o $((97+j%26)))\n"
awk "BEGIN{ printf \"%c%c\\n\", $((97+j/26)), $((97+j%26))}" <&-
printf %x $((97+j/26)) $((97+j%26)) | xxd -r -p
You could also just write without temporary variables:
echo "${alph:j/26:1}${alph:j%26:1}"
In my use case, I do want to generate the full list
awk should be fast:
awk 'BEGIN{ for (i=0;i<=100;++i) printf "%c%c\n", 97+i/26, 97+i%26}' <&-

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

Finding and replacing many words

I frequently need to make many replacements within files. To solve this problem, I have created two files old.text and new.text. The first contains a list of words which must be found. The second contains the list of words which should replace those.
All of my files use UTF-8 and make use of various languages.
I have built this script, which I hoped could do the replacement. First, it reads old.text one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text file.
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
However, my solution does not work well. When I run the script:
On line 6, the sed command does not know where the $number ends.
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
Do you have any suggestions?
Update:
The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.
One line 6, the sed command does not know where the $number ends.
Try quoting the variable with double quotes
linefromnewwords=$(sed -n "$number"p newwords.txt)
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Do this instead:
number=`expr $number + 1`
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using -v option.
Here is a solution in bash that would do what you need.
Bash Solution:
#!/bin/bash
while read -r sub && read -r rep <&3; do
sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text
This solution reads one line at a time from substitution file and replacement file and performs in-line sed substitution.
Why not to
paste -d/ oldwords.txt newwords.txt |\
sed -e 's#/# / #' -e 's#^#s/ #' -e 's#$# /g#' >/tmp/$$.sed
sed -f /tmp/$$.sed original >changed
rm /tmp/$$.sed
?
I love this kind of questions, so here is my answer:
First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)
hello=Bye dudes
the morNing=next Afternoon
first=last
Then you can define a proper separator in the script. (file replaceWords.sh)
#!/bin/bash
SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE
Take this example (file changeMe)
Hello, this is me.
I will be there at first time in the morning
Call it with
$ bash replaceWords.sh = changeThis changeMe
And you will get
Bye dudes, this is me.
I will be there at last time in next Afternoon
Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-
Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.
#!/bin/bash
SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
line=$(echo $line | sed "s/${origin}/${dest}/gI")
done < $TRANSLATION
echo $line
done < $FILE
This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.
Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout
The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.
#!/bin/env python
import sys, re
def replacer(match):
global new
return new[match.lastindex-1]
if __name__ == '__main__':
fname_old, fname_new, fname_txt = sys.argv[1:4]
#fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()
with file(fname_old) as f:
# Form regular expression that matches old words, grouped in order
old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
for word in f.read().strip().split()) + ')'
with file(fname_new) as f:
# Ordered list of replacement words
new = [word for word in f.read().strip().split()]
with file(fname_txt) as f:
# input text
txt = f.read()
# Output the new text
print( re.subn(old, replacer, txt)[0] )
I just did some stats on a ~100K byte text file:
Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%
The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.
The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).
Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.
Paddy.
A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:
my %map = (
19 => 'A',
20 => 'B',
);
my $key_regex = '(' . join('|', keys %map) . ')';
while (<>) {
s/$key_regex/$map{$1}/g;
print $_;
}
You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.
At line 6, the sed command does not know where the $number ends.
linefromnewwords=$(sed -n '${number}p' newwords.txt)
I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Arithmetic integer evaluation in bash can be done with $(( )) and is better than eval (eval=evil).
number=$((number + 1))
In general, I would recommend using one file with
s/ ni3 / nǐ /g
s/ nei3 / neǐ /g
and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:
sed -f translate.sed input > output
So you can always easily compare the mappings.
s/\bni3\b/nǐ/g
might be prefered over blanks as explicit delimiters, because \b:=word boundary matches start/end of line and punctuation characters.
This should reduce the time by some means as this avoids unnecessary loops.
Merge two input files:
Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.
We will create a new text file which will act as a sed script to your main file using the following awk one-liner:
awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat old.text
19
20
[jaypal:~/Temp] cat new.text
A
B
[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat merge.text
s/ 19 / A /g
s/ 20 / B /g
Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.
Using merged file as sed script:
Once your merged file has been created, we will use -f option of sed utility.
sed -f merge.text input_file
[jaypal:~/Temp] cat input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
[jaypal:~/Temp] sed -f merge.text input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
You can redirect this into another file using the > operator.
This might work for you:
paste {old,new}words.txt |
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' |
sed -i -f - text.txt
Here is a Python 2 script that should be both space and time efficient:
import sys
import codecs
import re
sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
(line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))
regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))
for line in codecs.open("input.txt", "r", "utf-8"):
result = regexp.sub(lambda match:sub[match.group(0)], line)
sys.stdout.write(result.encode("utf-8"))
Here it is in action:
$ cat old.txt
19
20
$ cat new.txt
A
B
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$
EDIT: Hat tip to #Paddy3118 for whitespace handling.
Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.
#!/usr/bin/env perl
# usage:
# replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt
use strict;
use warnings;
sub read_words {
my $file = shift;
open my $fh, "<$file" or die "Error reading file: $file; $!\n";
my #words = <$fh>;
chomp #words;
close $fh;
return \#words;
}
sub word_map {
my ($old_words, $new_words) = #_;
if (scalar #$old_words != scalar #$new_words) {
warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
}
my $list_size = scalar #$old_words;
$list_size = scalar #$new_words if $list_size > scalar #$new_words;
my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;
return \%map;
}
sub build_regex {
my $words = shift;
my $pattern = join "|", sort { length $b <=> length $a } #$words;
return qr/$pattern/;
}
my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);
my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
s/($old_pattern)/$word_map->{$&}/g;
print;
}
close $input;
__END__
Old words file:
$ cat old.txt
19
20
New words file:
$ cat new.txt
A
B
Input file:
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
Create output:
$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).
use warnings;
use strict;
open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");
my #hay = <>;
my #old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my #new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;
my %r;
; #r{#old} = #new;
print defined $r{$_} ? $r{$_} : $_ for split (
/(\s+)/, "#hay"
);
Use: perl script.pl /file/to/modify, result is printed to stdout.
EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)
I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...
I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.
take it on a run and let me know!
by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment:
#! /usr/bin/perl
# getting the user's input
if ($#ARGV == 1)
{
$LUT_file = shift;
$file = shift;
$outfile = $file . ".out.txt";
}
elsif ($#ARGV == 2)
{
$LUT_file = shift;
$file = shift;
$outfile = shift;
}
else { &usage; }
# opening the relevant files
open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
open FILE,"<",$file or die "can't open $file for reading!\n : $!";
open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";
# getting the lines from the text to be changed and changing them
%word_LUT = ();
WORD_EXT:while (<LUT>)
{
$_ =~ m/(\w+),(\w+)/;
$word_LUT{ $1 } = $2 ;
}
close LUT;
OUTER:while ($line = <FILE>)
{
#words = split(/\s+/,$line);
for( $i = 0; $i <= $#words; $i++)
{
if ( exists ($word_LUT { $words[$i] }) )
{
$words[$i] = $word_LUT { $words[$i] };
}
}
$newline = join(' ',#words);
print "old line - $line\nnewline - $newline\n\n";
print OUT $newline . "\n";
}
# now we have all the signals needed in the swav array, build the file.
close OUT;close FILE;
# Sub Routines
#
#
sub usage(){
print "\n\n\replacer.pl Usage:\n";
print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
print "<LUT file> - a LookUp Table of words, from the old word to the new one.
\t\t\twith the following csv format:
\t\t\told word,new word\n";
print "<Input file> - the input file\n";
print "<out file> - out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";
exit;
}

Resources