Iterate two arrays simultaneously - bash n sed - bash

I need help with executing two arrays simulataneously in bash and using sed to replace a word with the words in the second array in each separate file. There are two lists:
List1 - contains the names of the directories, which have similar parameter file "surfnet.par" in each directory (the file in different directories have the same name "surfnet.par")
List1.txt
3IT6_1
3IT6_3
3IT6_6
3IT6_9
3IT6_11
3IT6_12
3IT6_19
3IT6_23
3IT6_54
3IT6_62
List2 - contains numbers corresponding to each directory which has to be replaced with a specific word (single occurrence) in the file "surfnet.par" existing in different directories
List2.txt
11351
11357
11371
11384
11350
11373
11383
11365
11377
11382
To make it still clear
I want to replace a word "Resnum" in "surfnet.par" in the directory "3IT6_1" of List1 with "11351" of List2, likewise, replace the same word in "surfnet.par" of 3IT6_2 with 11357, 3IT6_3 with 11357, 3IT6_6 with 11371 and so on.
I have tried pushing the list to array and then using a for loop to replace the word, but failed in doing so, as it took the first value of List2 and replace in all "surfnet.par" files in different directories. The script I have been using is as below:
#!/bin/bash
declare -a dir
declare -a res
dir=(`cat "List1.txt" `)
res=(`cat 'List2.txt'`)
for i in "${dir[#]}"
do
echo $i
cd $i
sed -e "s/Resnum/${res$[0]}/g" surfnet.par > surfnet2.par
cd ..
done
I will appreciate it very much, if any of you can help me resolve this code and point out the modification that needs to be done. In case, my code doesn't make any sense please provide me the solution using bash, awk, sed or perl

I think you may have been a whole lot closer to a solution than you thought. This is one situation in bash where making use of a c-style loop and iterating on an index can come in very handy. The following slight changes to your code should work, give it a try (note: I added a check on cd and used the starting directory as current to enable use of absolute paths):
#!/bin/bash
declare -a dir
declare -a res
dir=( $(<List1.txt) )
res=( $(<List2.txt) )
current="$PWD"
for ((i = 0; i < ${#dir[#]}; i++))
do
cd "$current/${dir[i]}" || {
echo "failed to change to ${dir[i]}"
continue
}
printf "%3d %-8s Resnum -> %s\n" $i ${dir[i]} ${res[i]}
sed -e "s/Resnum/${res[i]}/g" surfnet.par > surfnet2.par
done
Example Use
Tested with your ListX.txt files with cd and sed calls commented out.
$ bash resnum.sh
0 3IT6_1 Resnum -> 11351
1 3IT6_3 Resnum -> 11357
2 3IT6_6 Resnum -> 11371
3 3IT6_9 Resnum -> 11384
4 3IT6_11 Resnum -> 11350
5 3IT6_12 Resnum -> 11373
6 3IT6_19 Resnum -> 11383
7 3IT6_23 Resnum -> 11365
8 3IT6_54 Resnum -> 11377
9 3IT6_62 Resnum -> 11382
Note: in bash for indexed arrays, the use of the $ on the index variable is not required. (e.g. ${dir[$i]} is fine as ${dir[i]}. It is treated the same as if it were enclosed in ((..)) as in the loop declaration.
Note2: you should probably add a validation that both values are available at the top of the loop before calling cd to change to the desired directory:
## validate both values available
[ -z ${dir[i]} -o -z ${res[i]} ] && {
echo "'${dir[i]}' or '${res[i]}' missing."
continue
}

If you don't like to type too much, you can do this
while read d s; do sed 's/target/'"$s"'/g' "$d"/f.txt > "$d"/f2.txt; done < <(paste list1 list2)
appropriately replace target with your search word, f.txt f2.txt list1 and list2 with the file names you use. It should be clear which is which.

Your question bears a Perl tag so I assume that Perl solutions are acceptable
Your question isn't very clear, but I think this program should help you
use strict;
use warnings;
use v5.10.1;
use autodie;
my #dirs = slurp('Listdirs204.txt');
my #res = slurp('LastHetatmRes.txt');
die "File sizes don't match" unless #dirs == #res;
for my $i ( 0 .. $#dirs ) {
my ($dir, $res) = ($dirs[$i], $res[$i]);
my $file = "$dir/surfnet.par";
my #lines = slurp($file);
s/Resnum/$res/g for #lines;
open my $fh, '>', $file;
print $fh "$_\n" for #lines;
close $fh;
}
sub slurp {
open my $fh, '<', shift;
my #lines = <$fh>;
chomp #lines;
#lines;
}

Related

Script to pick random directory in bash

I have a directory full of directories containing exam subjects I would like to work on randomly to simulate the real exam.
They are classified by difficulty level:
0-0, 0-1 .. 1-0, 1-1 .. 2-0, 2-1 ..
I am trying to write a shell script allowing me to pick one subject (directory) randomly based on the parameter I pass when executing the script (0, 1, 2 ..).
I can't quite figure it, here is my progress so far:
ls | find . -name "1$~" | sort -r | head -n 1
What am I missing here?
There's no need for any external commands (ls, find, sort, head) for this at all:
#!/usr/bin/env bash
set -o nullglob # make globs expand to nothing, not themselves, when no matches found
dirs=( "$1"*/ ) # list directories starting with $1 into an array
# Validate that our glob actually had at least one match
(( ${#dirs[#]} )) || { printf 'No directories start with %q at all\n' "$1" >&2; exit 1; }
idx=$(( RANDOM % ${#dirs[#]} )) # pick a random index into our array
echo "${dirs[$idx]}" # and look up what's at that index

Shell script: segregate multiple files

I have this in my local directory ~/Report:
Rep_{ReportType}_{Date}_{Seq}.csv
Rep_0001_20150102_0.csv
Rep_0001_20150102_1.csv
Rep_0102_20150102_0.csv
Rep_0503_20150102_0.csv
Rep_0503_20150102_0.csv
Using shell-script,
How do I get multiple files from a local directory with a fixed batch size?
How do I segregate/group the files together by report type (0001 files are grouped together, 0102 grouped together, 0503 grouped together, etc.)
I will generate a sequence file (using forqlift) for EACH group/report type. The output would be Report0001.seq, Report0102.seq, Report0503.seq (3 sequence files). In which I will save to a different directory.
Note: In sequence files, the key is the filename of csv (Rep_0001_20150102.csv), and the value is the content of the file. It is stored as [String, BytesWritable].
This is my code:
1 reportTypes=(0001 0102 8902)
2
3 # collect all files matching expression into an array
4 filesWithDir=(~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-1].csv)
5
6 # take only the first hundred
7 filesWithDir =( "${filesWithDir[#]:0:100}" )
8
9 # files="${filesWithDir[#]##*/}" #### commented out since forqlift cannot create sequence file without the path/to/file
10 # echo ${files[#]}
11
12 shopt -s nullglob
13
14 # Line 21 is commented out since it has a bug. It collects files in
15 # current directory when it should be filtering the "files array" created
16 # in line 7
17
18
19 for i in ${reportTypes[#]}; do
20 printf -v val '%04d' "$i"
21 # files=("Rep_${val}_"*.csv)
# solution to BUG: (filter files array)
groupFiles=( $( for j in ${filesWithDir[#]} ; do echo $j ; done | grep ${val} ) )
22
23 # Generate sequence file for EACH Report Type
24 forqlift create --file="Report${val}.seq" "${groupFiles[#]}"
25 done
(Note: The sequence file output should be in current directory, not in ~/Report)
It's easy to take only a subset of an array:
# collect all files matching expression into an array
files=( ~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].csv )
# take only the first hundred
files=( "${files[#]:0:100}" )
The second part is trickier: Bash has associative arrays ("maps"), but the only legal values which can be stored in arrays are strings -- not other arrays -- so you can't store a list of filenames as a value associated with a single entry (without serializing the array to and from a string -- a moderately tricky thing to do safely, since file paths in UNIX can contain any character other than NUL, newlines included).
It's better, then, to just generate the array as you need it.
shopt -s nullglob # allow a glob to expand to zero arguments
for ((i=1; i<=1000; i++)); do
printf -v val '%04d' "$i" # pad digits: 12 -> 0012
files=( "Rep_${val}_"*.csv ) # collect files that match
## emit NUL-separated list of files, if any were found
#(( ${#files[#]} )) && printf '%s\0' "${files[#]}" >"Reports.$val.txt"
# Create a sequence file with forqlift
forqlift create --file="Reports-${val}.seq" "${files[#]}"
done
If you really don't want to do that, then we can put something together that uses namevars for redirection:
#!/bin/bash
# This only works with bash 4.3
re='^REP_([[:digit:]]{4})_[[:digit:]]{8}.csv$'
counter=0
for f in *; do
[[ $f =~ $re ]] || continue # skip files not matching regex
if ((++counter > 100)); then break; fi # stop after 100 files
group=${BASH_REMATCH[1]} # retrieve first regex group
declare -g -a "array${group}" # declare an array
declare -n group_arr="array${group}" # redirect group_arr to that array
group_arr+=( "$f" ) # append to the array
done
for varname in "${!array#}"; do
declare -n group_arr="$varname"
## NUL-delimited form
#printf '%s\0' "${group_arr[#]}" \
# >"collection${varname#array}" # write to files named collection0001, etc.
# forqlift sequence file form
forqlift create --file="Reports-${varname#array}.seq" "${group_arr[#]}"
done
I would move away from shell scripts and start to look towards perl.
#!/usr/bin/env perl
use strict;
use warnings;
my %groups;
while ( my $filename = glob ( "~/Reports/Rep_*.csv" ) ) {
my ( $group, $id ) = ( $filename =~ m,/Rep_(\d{4})_(\d{8})\.csv$, );
next unless $group; #undefined means it didn't match;
#anything past 100 in a group is discarded:
if ( #{$groups{$group}} < 100 ) {
push ( #{$groups{$group}}, $filename );
}
}
foreach my $group ( keys %groups ) {
print "$group contains:\n";
print join ("\n", #{$groups{$group});
}
Another alternative is to clobber some bash commands together with regexp.
See implementation below
# Explanation:
# ls -p = List all files and directories in local directory by path
# grep -v / = ignore subdirectories
# grep "^Rep_\d{4}_\d{8}\.csv$" = Look for files matching your regexp
# tail -100 = get 100 results
for file in $(ls -p | grep -v / | grep "^Rep_\d{4}_\d{8}\.csv$" | tail -100);
do echo $file;
# Use reg exp to extract the desired sequence
re="^Rep_([[:digit:]]{4})_([[:digit:]]{8}).csv$";
if [[ $name =~ $re ]]; then
sequence = ${BASH_REMATCH[1};
# Didn't end up using date, but in case you want it
# date = ${BASH_REMATCH[2]};
# Just in case the sequence file doesn't exist
if [ ! -f "$sequence" ] ; then
touch "$sequence"
fi
# Output/Concat your filename to the sequence file, which you can
# read in later to do whatever administrative tasks you wish to do
# to them
echo "$file" >> "$sequence"
fi
done;

Iterative and conditional deleting of lines in a file

Intro
I have a file named data.dat with the following structure:
1: 67: 1 :s
1: 315: 1 :s
1: 648: 1 :ns
1: 799: 1 :s
1: 809: 1 :s
1: 997: 1 :ns
2: 32: 1 :s
Algorithm
The algorithm that I'm looking for is:
Generate a random number between 1 and number of lines in this file.
Delete that line if the fourth column is "s".
Otherwise generate another random number and repeat this until the number of lines reaches to a certain value.
Technical Concepts
Though technical concepts are irrelevant to this algorithm, but I try to explain the problem. The data shows connectivity table of a network. This algorithm allows us to run it over different initial conditions and study general properties of these networks. Especially, because of randomness property of deleting bonds, any common behavior among these networks can be interpreted as a fundamental law.
Update: Another good reason to produce a random number in each step is that after removing each line, it's possible that property of being s/ns of remaining lines can be changed.
Code
Here is the code I have until now:
#!/bin/bash
# bash in OSX
While ((#there is at least 1 s in the fourth column)); do
LEN=$(grep -c "." data.dat) # number of lines
RAND=$((RANDOM%${LEN}+1)) # generating random number
if [[awk -F, "NR==$RAND" 'data.dat' | cut -d ':' -f 4- == "s"]]; then
sed '$RANDd' data.txt
else
#go back and produce another random
done
exit
I try to find the fourth column with awk -F, "NR==$RAND" 'data.dat' | cut -d ':' -f 4- and deleting the line by sed '$RANDd' data.txt.
Questions
How should I check that there is s pairs in my file?
I am not sure if the condition in if is correct.
Also, I don't know how to force loop after else to go back to generate another random number.
Thank you,
I really appreciate your help.
Personally, I would recommend against doing this in bash unless you have absolutely no choice.
Here's another way you could do it in Perl (quite similar in functionality to Alex's answer but a bit simpler):
use strict;
use warnings;
my $filename = shift;
open my $fh, "<", $filename or die "could not open $filename: $!";
chomp (my #lines = <$fh>);
my $sample = 0;
my $max_samples = 10;
while ($sample++ < $max_samples) {
my $line_no = int rand #lines;
my $line = $lines[$line_no];
if ($line =~ /:s\s*$/) {
splice #lines, $line_no, 1;
}
}
print "$_\n" for #lines;
Usage: perl script.pl data.dat
Read the file into the array #lines. Pick a random line from the array and if it ends with :s (followed by any number of spaces), remove it. Print the remaining lines at the end.
This does what you want but I should warn you that relying on built-in random number generators in any language is not a good way to arrive at statistically significant conclusions. If you need high-quality random numbers, you should consider using a module such as Math::Random::MT::Perl to generate them, rather than the built-in rand.
#!/usr/bin/env perl
# usage: $ excise.pl < data.dat > smaller_data.dat
my $sampleLimit = 10; # sample up to ten lines before printing output
my $dataRef;
my $flagRef;
while (<>) {
chomp;
push (#{$dataRef}, $_);
push (#{$flagRef}, 1);
}
my $lineCount = scalar #elems;
my $sampleIndex = 0;
while ($sampleIndex < $sampleLimit) {
my $sampleLineIndex = int(rand($lineCount));
my #sampleElems = split("\t", $dataRef->[$sampleLineIndex];
if ($sampleElems[3] == "s") {
$flagRef->[$sampleLineIndex] = 0;
}
$sampleIndex++;
}
# print data.dat to standard output, minus any sampled lines that had an 's' in them
foreach my $lineIndex (0..(scalar #{$dataRef} - 1)) {
if ($flagRef->[$lineIndex] == 1) {
print STDOUT $dataRef->[$lineIndex]."\n";
}
}
NumLine=$( grep -c "" data.dat )
while [ ${NumLine} -gt ${TargetLine} ]
do
# echo "Line at start: ${NumLine}"
RndLine=$(( ( ${RANDOM} % ${NumLine} ) + 1 ))
RndValue="$( echo " ${RANDOM}" | sed 's/.*\(.\{6\}\)$/\1/' )"
sed "${RndLine} {
s/^\([^:]*:\)[^:]*\(:.*:ns$\)/\1${RndValue}\2/
t
d
}" data.dat > /tmp/data.dat
mv /tmp/data.dat data.dat
NumLine=$( grep -c "" data.dat )
#cat data.dat
#echo "- Next Iteration -------"
done
tested on AIX (so not a GNU sed). Under Linux, use --posix for sed option and you can use a -i in place of temporary file + redirection + move in this case
Dont't forget that RANDOM is NOT a real RANDOM so study on network behavior based on not random value could not reflect a reality bu a specific case

bash shell script two variables in for loop

I am new to shell scripting. so kindly bear with me if my doubt is too silly.
I have png images in 2 different directories and an executable which takes an images from each directory and processes them to generate a new image.
I am looking for a for loop construct which can take two variables simultaneously..this is possible in C, C++ etc but how do I accomplish something of the following. The code is obviously wrong.
#!/bin/sh
im1_dir=~/prev1/*.png
im2_dir=~/prev3/*.png
index=0
for i,j in $im1_dir $im2_dir # i iterates in im1_dir and j iterates in im2_dir
do
run_black.sh $i $j
done
thanks!
If you are depending on the two directories to match up based on a locale sorted order (like your attempt), then an array should work.
im1_files=(~/prev1/*.png)
im2_files=(~/prev3/*.png)
for ((i=0;i<=${#im1_files[#]};i++)); do
run_black.sh "${im1_files[i]}" "${im2_files[i]}"
done
Here are a few additional ways to do what you're looking for with notes about the pros and cons.
The following only works with filenames that do not include newlines. It pairs the files in lockstep. It uses an extra file descriptor to read from the first list. If im1_dir contains more files, the loop will stop when im2_dir runs out. If im2_dir contains more files, file1 will be empty for all unmatched file2. Of course if they contain the same number of files, there's no problem.
#!/bin/bash
im1_dir=(~/prev1/*.png)
im2_dir=(~/prev3/*.png)
exec 3< <(printf '%s\n' "${im1_dir[#]}")
while IFS=$'\n' read -r -u 3 file1; read -r file2
do
run_black "$file1" "$file2"
done < <(printf '%s\n' "${im1_dir[#]}")
exec 3<&-
You can make the behavior consistent so that the loop stops with only non-empty matched files no matter which list is longer by replacing the semicolon with a double ampersand like so:
while IFS=$'\n' read -r -u 3 file1 && read -r file2
This version uses a for loop instead of a while loop. This one stops when the shorter of the two lists run out.
#!/bin/bash
im1_dir=(~/prev1/*.png)
im2_dir=(~/prev3/*.png)
for ((i = 0; i < ${#im1_dir[#]} && i < ${#im2_dir[#]}; i++))
do
run_black "${im1_dir[i]}" "${im2_dir[i]}"
done
This version is similar to the one immediately above, but if one of the lists runs out it wraps around to reuse the items until the other one runs out. It's very ugly and you could do the same thing another way more simply.
#!/bin/bash
im1_dir=(~/prev1/*.png)
im2_dir=(~/prev3/*.png)
for ((i = 0, j = 0,
n1 = ${#im1_dir[#]},
n2 = ${#im2_dir[#]},
s = n1 >= n2 ? n1 : n2,
is = 0, js = 0;
is < s && js < s;
i++, is = i, i %= n1,
j++, js = j, j %= n2))
do
run_black "${im1_dir[i]}" "${im2_dir[i]}"
done
This version only uses an array for the inner loop (second directory). It will only execute as many times as there are files in the first directory.
#!/bin/bash
im1_dir=~/prev1/*.png
im2_dir=(~/prev3/*.png)
for file1 in $im1_dir
do
run_black "$file1" "${im2_dir[i++]}"
done
If you don't mind going off the beaten path (bash), the Tool Command Language (TCL) has such a loop construct:
#!/usr/bin/env tclsh
set list1 [glob dir1/*]
set list2 [glob dir2/*]
foreach item1 $list1 item2 $list2 {
exec command_name $item1 $item2
}
Basically, the loop reads: for each item1 taken from list1, and item2 taken from list2. You can then replace command_name with your own command.
This might be another way to use two variables in the same loop. But you need to know the total number of files (or, the number of times you want to run the loop) in the directory to use it as the value of iteration i.
Get the number of files in the directory:
ls /path/*.png | wc -l
Now run the loop:
im1_dir=(~/prev1/*.png)
im2_dir=(~/prev3/*.png)
for ((i = 0; i < 4; i++)); do run_black.sh ${im1_dir[i]} ${im2_dir[i]}; done
For more help please see this discussion.
I have this problem for a similar situation where I want a top and bottom range simultaneously. Here was my solution; it's not particularly efficient but it's easy and clean and not at all complicated with icky BASH arrays and all that nonsense.
SEQBOT=$(seq 0 5 $((PEAKTIME-5)))
SEQTOP=$(seq 5 5 $((PEAKTIME-0)))
IDXBOT=0
IDXTOP=0
for bot in $SEQBOT; do
IDXTOP=0
for top in $SEQTOP; do
if [ "$IDXBOT" -eq "$IDXTOP" ]; then
echo $bot $top
fi
IDXTOP=$((IDXTOP + 1))
done
IDXBOT=$((IDXBOT + 1))
done
It is very simple you can use two for loop functions in this problem.
#bin bash
index=0
for i in ~/prev1/*.png
do
for j ~/prev3/*.png
do
run_black.sh $i $j
done
done
The accepted answer can be further simplified using the ${!array[#]} syntax to iterate over array's indexes:
a=(x y z); b=(q w e); for i in ${!a[#]}; do echo ${a[i]}-${b[i]}; done
Another solution. The two lists with filenames are pasted into one.
paste <(ls --quote-name ~/prev1/*.png) <(ls --quote-name ~/prev3/*.png) | \
while read args ; do
run_black $args
done

Finding and replacing many words

I frequently need to make many replacements within files. To solve this problem, I have created two files old.text and new.text. The first contains a list of words which must be found. The second contains the list of words which should replace those.
All of my files use UTF-8 and make use of various languages.
I have built this script, which I hoped could do the replacement. First, it reads old.text one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text file.
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
However, my solution does not work well. When I run the script:
On line 6, the sed command does not know where the $number ends.
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
Do you have any suggestions?
Update:
The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.
One line 6, the sed command does not know where the $number ends.
Try quoting the variable with double quotes
linefromnewwords=$(sed -n "$number"p newwords.txt)
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Do this instead:
number=`expr $number + 1`
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using -v option.
Here is a solution in bash that would do what you need.
Bash Solution:
#!/bin/bash
while read -r sub && read -r rep <&3; do
sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text
This solution reads one line at a time from substitution file and replacement file and performs in-line sed substitution.
Why not to
paste -d/ oldwords.txt newwords.txt |\
sed -e 's#/# / #' -e 's#^#s/ #' -e 's#$# /g#' >/tmp/$$.sed
sed -f /tmp/$$.sed original >changed
rm /tmp/$$.sed
?
I love this kind of questions, so here is my answer:
First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)
hello=Bye dudes
the morNing=next Afternoon
first=last
Then you can define a proper separator in the script. (file replaceWords.sh)
#!/bin/bash
SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE
Take this example (file changeMe)
Hello, this is me.
I will be there at first time in the morning
Call it with
$ bash replaceWords.sh = changeThis changeMe
And you will get
Bye dudes, this is me.
I will be there at last time in next Afternoon
Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-
Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.
#!/bin/bash
SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
line=$(echo $line | sed "s/${origin}/${dest}/gI")
done < $TRANSLATION
echo $line
done < $FILE
This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.
Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout
The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.
#!/bin/env python
import sys, re
def replacer(match):
global new
return new[match.lastindex-1]
if __name__ == '__main__':
fname_old, fname_new, fname_txt = sys.argv[1:4]
#fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()
with file(fname_old) as f:
# Form regular expression that matches old words, grouped in order
old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
for word in f.read().strip().split()) + ')'
with file(fname_new) as f:
# Ordered list of replacement words
new = [word for word in f.read().strip().split()]
with file(fname_txt) as f:
# input text
txt = f.read()
# Output the new text
print( re.subn(old, replacer, txt)[0] )
I just did some stats on a ~100K byte text file:
Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%
The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.
The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).
Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.
Paddy.
A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:
my %map = (
19 => 'A',
20 => 'B',
);
my $key_regex = '(' . join('|', keys %map) . ')';
while (<>) {
s/$key_regex/$map{$1}/g;
print $_;
}
You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.
At line 6, the sed command does not know where the $number ends.
linefromnewwords=$(sed -n '${number}p' newwords.txt)
I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Arithmetic integer evaluation in bash can be done with $(( )) and is better than eval (eval=evil).
number=$((number + 1))
In general, I would recommend using one file with
s/ ni3 / nǐ /g
s/ nei3 / neǐ /g
and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:
sed -f translate.sed input > output
So you can always easily compare the mappings.
s/\bni3\b/nǐ/g
might be prefered over blanks as explicit delimiters, because \b:=word boundary matches start/end of line and punctuation characters.
This should reduce the time by some means as this avoids unnecessary loops.
Merge two input files:
Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.
We will create a new text file which will act as a sed script to your main file using the following awk one-liner:
awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat old.text
19
20
[jaypal:~/Temp] cat new.text
A
B
[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat merge.text
s/ 19 / A /g
s/ 20 / B /g
Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.
Using merged file as sed script:
Once your merged file has been created, we will use -f option of sed utility.
sed -f merge.text input_file
[jaypal:~/Temp] cat input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
[jaypal:~/Temp] sed -f merge.text input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
You can redirect this into another file using the > operator.
This might work for you:
paste {old,new}words.txt |
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' |
sed -i -f - text.txt
Here is a Python 2 script that should be both space and time efficient:
import sys
import codecs
import re
sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
(line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))
regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))
for line in codecs.open("input.txt", "r", "utf-8"):
result = regexp.sub(lambda match:sub[match.group(0)], line)
sys.stdout.write(result.encode("utf-8"))
Here it is in action:
$ cat old.txt
19
20
$ cat new.txt
A
B
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$
EDIT: Hat tip to #Paddy3118 for whitespace handling.
Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.
#!/usr/bin/env perl
# usage:
# replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt
use strict;
use warnings;
sub read_words {
my $file = shift;
open my $fh, "<$file" or die "Error reading file: $file; $!\n";
my #words = <$fh>;
chomp #words;
close $fh;
return \#words;
}
sub word_map {
my ($old_words, $new_words) = #_;
if (scalar #$old_words != scalar #$new_words) {
warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
}
my $list_size = scalar #$old_words;
$list_size = scalar #$new_words if $list_size > scalar #$new_words;
my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;
return \%map;
}
sub build_regex {
my $words = shift;
my $pattern = join "|", sort { length $b <=> length $a } #$words;
return qr/$pattern/;
}
my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);
my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
s/($old_pattern)/$word_map->{$&}/g;
print;
}
close $input;
__END__
Old words file:
$ cat old.txt
19
20
New words file:
$ cat new.txt
A
B
Input file:
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
Create output:
$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).
use warnings;
use strict;
open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");
my #hay = <>;
my #old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my #new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;
my %r;
; #r{#old} = #new;
print defined $r{$_} ? $r{$_} : $_ for split (
/(\s+)/, "#hay"
);
Use: perl script.pl /file/to/modify, result is printed to stdout.
EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)
I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...
I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.
take it on a run and let me know!
by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment:
#! /usr/bin/perl
# getting the user's input
if ($#ARGV == 1)
{
$LUT_file = shift;
$file = shift;
$outfile = $file . ".out.txt";
}
elsif ($#ARGV == 2)
{
$LUT_file = shift;
$file = shift;
$outfile = shift;
}
else { &usage; }
# opening the relevant files
open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
open FILE,"<",$file or die "can't open $file for reading!\n : $!";
open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";
# getting the lines from the text to be changed and changing them
%word_LUT = ();
WORD_EXT:while (<LUT>)
{
$_ =~ m/(\w+),(\w+)/;
$word_LUT{ $1 } = $2 ;
}
close LUT;
OUTER:while ($line = <FILE>)
{
#words = split(/\s+/,$line);
for( $i = 0; $i <= $#words; $i++)
{
if ( exists ($word_LUT { $words[$i] }) )
{
$words[$i] = $word_LUT { $words[$i] };
}
}
$newline = join(' ',#words);
print "old line - $line\nnewline - $newline\n\n";
print OUT $newline . "\n";
}
# now we have all the signals needed in the swav array, build the file.
close OUT;close FILE;
# Sub Routines
#
#
sub usage(){
print "\n\n\replacer.pl Usage:\n";
print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
print "<LUT file> - a LookUp Table of words, from the old word to the new one.
\t\t\twith the following csv format:
\t\t\told word,new word\n";
print "<Input file> - the input file\n";
print "<out file> - out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";
exit;
}

Resources