Iterative and conditional deleting of lines in a file

Iterative and conditional deleting of lines in a file - macos

Intro
I have a file named data.dat with the following structure:
1: 67: 1 :s
1: 315: 1 :s
1: 648: 1 :ns
1: 799: 1 :s
1: 809: 1 :s
1: 997: 1 :ns
2: 32: 1 :s
Algorithm
The algorithm that I'm looking for is:
Generate a random number between 1 and number of lines in this file.
Delete that line if the fourth column is "s".
Otherwise generate another random number and repeat this until the number of lines reaches to a certain value.
Technical Concepts
Though technical concepts are irrelevant to this algorithm, but I try to explain the problem. The data shows connectivity table of a network. This algorithm allows us to run it over different initial conditions and study general properties of these networks. Especially, because of randomness property of deleting bonds, any common behavior among these networks can be interpreted as a fundamental law.
Update: Another good reason to produce a random number in each step is that after removing each line, it's possible that property of being s/ns of remaining lines can be changed.
Code
Here is the code I have until now:
#!/bin/bash
# bash in OSX
While ((#there is at least 1 s in the fourth column)); do
LEN=$(grep -c "." data.dat) # number of lines
RAND=$((RANDOM%${LEN}+1)) # generating random number
if [[awk -F, "NR==$RAND" 'data.dat' | cut -d ':' -f 4- == "s"]]; then
sed '$RANDd' data.txt
else
#go back and produce another random
done
exit
I try to find the fourth column with awk -F, "NR==$RAND" 'data.dat' | cut -d ':' -f 4- and deleting the line by sed '$RANDd' data.txt.
Questions
How should I check that there is s pairs in my file?
I am not sure if the condition in if is correct.
Also, I don't know how to force loop after else to go back to generate another random number.
Thank you,
I really appreciate your help.

Personally, I would recommend against doing this in bash unless you have absolutely no choice.
Here's another way you could do it in Perl (quite similar in functionality to Alex's answer but a bit simpler):
use strict;
use warnings;
my $filename = shift;
open my $fh, "<", $filename or die "could not open $filename: $!";
chomp (my #lines = <$fh>);
my $sample = 0;
my $max_samples = 10;
while ($sample++ < $max_samples) {
my $line_no = int rand #lines;
my $line = $lines[$line_no];
if ($line =~ /:s\s*$/) {
splice #lines, $line_no, 1;
}
}
print "$_\n" for #lines;
Usage: perl script.pl data.dat
Read the file into the array #lines. Pick a random line from the array and if it ends with :s (followed by any number of spaces), remove it. Print the remaining lines at the end.
This does what you want but I should warn you that relying on built-in random number generators in any language is not a good way to arrive at statistically significant conclusions. If you need high-quality random numbers, you should consider using a module such as Math::Random::MT::Perl to generate them, rather than the built-in rand.

#!/usr/bin/env perl
# usage: $ excise.pl < data.dat > smaller_data.dat
my $sampleLimit = 10; # sample up to ten lines before printing output
my $dataRef;
my $flagRef;
while (<>) {
chomp;
push (#{$dataRef}, $_);
push (#{$flagRef}, 1);
}
my $lineCount = scalar #elems;
my $sampleIndex = 0;
while ($sampleIndex < $sampleLimit) {
my $sampleLineIndex = int(rand($lineCount));
my #sampleElems = split("\t", $dataRef->[$sampleLineIndex];
if ($sampleElems[3] == "s") {
$flagRef->[$sampleLineIndex] = 0;
}
$sampleIndex++;
}
# print data.dat to standard output, minus any sampled lines that had an 's' in them
foreach my $lineIndex (0..(scalar #{$dataRef} - 1)) {
if ($flagRef->[$lineIndex] == 1) {
print STDOUT $dataRef->[$lineIndex]."\n";
}
}

NumLine=$( grep -c "" data.dat )
while [ ${NumLine} -gt ${TargetLine} ]
do
# echo "Line at start: ${NumLine}"
RndLine=$(( ( ${RANDOM} % ${NumLine} ) + 1 ))
RndValue="$( echo " ${RANDOM}" | sed 's/.*\(.\{6\}\)$/\1/' )"
sed "${RndLine} {
s/^\([^:]*:\)[^:]*\(:.*:ns$\)/\1${RndValue}\2/
t
d
}" data.dat > /tmp/data.dat
mv /tmp/data.dat data.dat
NumLine=$( grep -c "" data.dat )
#cat data.dat
#echo "- Next Iteration -------"
done
tested on AIX (so not a GNU sed). Under Linux, use --posix for sed option and you can use a -i in place of temporary file + redirection + move in this case
Dont't forget that RANDOM is NOT a real RANDOM so study on network behavior based on not random value could not reflect a reality bu a specific case

Related

BASH Script to find a string in a file by position, match, then modify that position and insert if it exists

I have several lines in a file (input.in) that may look like this (asterisks are not literal; added for emphasis):
200928,121546,00002,**0000004015K**,**0000000641}**,00102020
200928,121546,00002,**0000000227B**,**0000000970R**,84839923
200928,121546,00003,**0000001197A**,**0000000227B**,93877763
I need to be able to find the value of the last character in the forth and fifth element (or look at the position 31 and 43) to determine what the actual number should be and if it's positive or negative. The result should look like the following after modifications:
200928,121546,00002,-00000040152,-00000006410,00102020
200928,121546,00002,00000002272,-00000009709,84839923
200928,121546,00003,00000011971,00000002272,93877763
{ABCDEFGHI correspond to all positive field and subs are 0123456789
}JKLMNOPQR correspond to all negative field and subs are 0123456789
I'm able to get all the positive number conversions working correctly but I am having problems with the negative conversions.
My code looks sorta like this for getting the positive switches (This is a "packed field" conversion btw):
sed -i -E "s/^(.{$a})\{/\10/" input.in
This is for the { positive case where the sub will be 0.
Where $a is introduced by a for a in 30 42 do loop. I have no issues identifying and updating the last char for that string but I can't figure out how to only flip the negative values if the corresponding character is found. I was thinking something like looking at the entire group of 11 (4th and 5th element) and if the last char in that group is }JKLMNOPQR, insert - at the first position and replace }JKLMNOPQR with 0123456789. respectively. Stuck here though. Of course the objective is to update the file with the changes after subs have been completed.
Code sample:
input="input.in"
for a in 30 42
do
while IFS= read -r line
do
echo "${line:$a:1} found, converting"
edbvalue=${line:$a:1}
case $edbvalue in
{)
echo -n -e "{ being replaced with 0\n"
sed -i -E "s/^(.{$a})\{/\10/" input.in
;;
A)
echo -n -e "A being replaced with 1\n"
sed -i -E "s/^(.{$a})A/\11/" input.in
;;
.
.
.
R)
echo -n -e "R being replaced with 9\n"
sed -i -E "s/^(.{$a})R/\19/" input.in
;;
*)
echo -n -e "no conversion needed\n"
;;
esac
done < "$input"
done

Rewriting the input file repeatedly is horrendously inefficient. You want to perform all the replacements in one go.
sed is rather hard to read once you start doing nontrivial things, so I would recommend switching to Awk (or a proper modern scripting language like Python if you want to invest more into this).
awk -F , 'BEGIN { OFS=FS
pos = "{ABCDEFGHI"; neg = "}JKLMNOPQR";
for (i=0; i<10; ++i) { p[substr(pos, i+1, 1)] = i; n[substr(neg, i+1, 1)] = i }
}
{ for (i=4; i<=5; i++) {
where = length($i)
what = substr($i, where, 1)
if (what ~ "^[" pos "]$") sign = ""
else if (what ~ "^[" neg "]$") sign = "-"
else print "Error: field " i " " $i " malformed" >"/dev/stderr"
$i = sign substr($i, 1, where-1) (sign ? n[what] : p[what])
}
}1' input.in
Demo: https://ideone.com/z8wK0V
This isn't entirely obvious, but here's a quick breakdown.
In the BEGIN block, we create two associative arrays, such that
p["{"] = 0, n["}"] = 0
p["A"] = 1, n["J"] = 1
p["B"] = 2, n["K"] = 2
p["C"] = 3, n["L"] = 3
p["D"] = 4, n["M"] = 4
p["E"] = 5, n["N"] = 5
p["F"] = 6, n["O"] = 6
p["G"] = 7, n["P"] = 7
p["H"] = 8, n["Q"] = 8
p["I"] = 9, n["R"] = 9
(We also set OFS to FS so that Awk will print the output comma-separated, like it reads the input.)
Down in the main block, we loop over fields 4 and 5, extracting the last character and mapping it to the corresponding entry from the correct one of the two arrays, and add a sign if warranted.
This simply writes to standard output; save to a new file and move it back over the original input file, or if you have GNU Awk, explore its -i inplace option.
If you really wanted to do this in sed, it offers a rather convenient y/{ABCDEFGHI/0123456789/ but picking apart the fields and then reassembling the line when you are done is not going to be pleasant.

How to select a specific percentage of lines?

Goodmorning !
I have a file.csv with 140 lines and 26 columns. I need to sort the lines in according the values in column 23. This is an exemple :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller2,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-1088,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller3,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-2159,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
To sort the lines according the values of the column 23, I do this :
awk -F "%*," '$23 > 4' myfikle.csv
The result :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
In my example, I use the value of 4% in column 23, the goal being to retrieve all the rows with their value in % which increases significantly in column 23. The problem is that I can't base myself on the 4% value because it is only representative of the current table. So I have to find another way to retrieve the rows that have a high value in column 23.
I have to sort the Controllers in descending order according to the percentage in column 23, I prefer to process the first 10% of the sorted lines to make sure I have the controllers with a large percentage.
The goal is to be able to vary the percentage according to the number of lines in the table.
Do you have any tips for that ?
Thanks ! :)

I could have sworn that this question was a duplicate, but so far I couldn't find a similar question.
Whether your file is sorted or not does not really matter. From any file you can extract the NUMBER first lines with head -n NUMBER. There is no built-in way to specify the number percentually, but you can compute that PERCENT% of your file's lines are NUMBER lines.
percentualHead() {
percent="$1"
file="$2"
linesTotal="$(wc -l < "$file")"
(( lines = linesTotal * percent / 100 ))
head -n "$lines" "$file"
}
or shorter but less readable
percentualHead() {
head -n "$(( "$(wc -l < "$2")" * "$1" / 100 ))" "$2"
}
Calling percentualHead 10 yourFile will print the first 10% of lines from yourFile to stdout.
Note that percentualHead only works with files because the file has to be read twice. It does not work with FIFOs, <(), or pipes.

If you want to use standard tools, you'll need to read the file twice. But if you're content to use perl, you can simply do:
perl -e 'my #sorted = sort <>; print #sorted[0..$#sorted * .10]' input-file

Here is one for GNU awk to get the top p% from the file but they are outputed in the order of appearance:
$ awk -F, -v p=0.5 ' # 50 % of top $23 records
NR==FNR { # first run
a[NR]=$23 # hash precentages to a, NR as key
next
}
FNR==1 { # second run, at beginning
n=asorti(a,a,"#val_num_desc") # sort percentages to descending order
for(i=1;i<=n*p;i++) # get only the top p %
b[a[i]] # hash their NRs to b
}
(FNR in b) # top p % BUT not in order
' file file | cut -d, -f 23 # file processed twice, cut 23rd for demo
45.04%
19.18%
Commenting this in a bit.

Having SUM issues with a bash script

I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total

Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.

The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).

Finding and replacing many words

I frequently need to make many replacements within files. To solve this problem, I have created two files old.text and new.text. The first contains a list of words which must be found. The second contains the list of words which should replace those.
All of my files use UTF-8 and make use of various languages.
I have built this script, which I hoped could do the replacement. First, it reads old.text one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text file.
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
However, my solution does not work well. When I run the script:
On line 6, the sed command does not know where the $number ends.
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
Do you have any suggestions?
Update:
The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.

One line 6, the sed command does not know where the $number ends.
Try quoting the variable with double quotes
linefromnewwords=$(sed -n "$number"p newwords.txt)
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Do this instead:
number=`expr $number + 1`
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using -v option.
Here is a solution in bash that would do what you need.
Bash Solution:
#!/bin/bash
while read -r sub && read -r rep <&3; do
sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text
This solution reads one line at a time from substitution file and replacement file and performs in-line sed substitution.

Why not to
paste -d/ oldwords.txt newwords.txt |\
sed -e 's#/# / #' -e 's#^#s/ #' -e 's#$# /g#' >/tmp/$$.sed
sed -f /tmp/$$.sed original >changed
rm /tmp/$$.sed
?

I love this kind of questions, so here is my answer:
First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)
hello=Bye dudes
the morNing=next Afternoon
first=last
Then you can define a proper separator in the script. (file replaceWords.sh)
#!/bin/bash
SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE
Take this example (file changeMe)
Hello, this is me.
I will be there at first time in the morning
Call it with
$ bash replaceWords.sh = changeThis changeMe
And you will get
Bye dudes, this is me.
I will be there at last time in next Afternoon
Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-
Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.
#!/bin/bash
SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
line=$(echo $line | sed "s/${origin}/${dest}/gI")
done < $TRANSLATION
echo $line
done < $FILE

This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.
Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout
The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.
#!/bin/env python
import sys, re
def replacer(match):
global new
return new[match.lastindex-1]
if __name__ == '__main__':
fname_old, fname_new, fname_txt = sys.argv[1:4]
#fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()
with file(fname_old) as f:
# Form regular expression that matches old words, grouped in order
old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
for word in f.read().strip().split()) + ')'
with file(fname_new) as f:
# Ordered list of replacement words
new = [word for word in f.read().strip().split()]
with file(fname_txt) as f:
# input text
txt = f.read()
# Output the new text
print( re.subn(old, replacer, txt)[0] )
I just did some stats on a ~100K byte text file:
Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%
The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.
The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).
Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.
Paddy.

A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:
my %map = (
19 => 'A',
20 => 'B',
);
my $key_regex = '(' . join('|', keys %map) . ')';
while (<>) {
s/$key_regex/$map{$1}/g;
print $_;
}
You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.

At line 6, the sed command does not know where the $number ends.
linefromnewwords=$(sed -n '${number}p' newwords.txt)
I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Arithmetic integer evaluation in bash can be done with $(( )) and is better than eval (eval=evil).
number=$((number + 1))
In general, I would recommend using one file with
s/ ni3 / nǐ /g
s/ nei3 / neǐ /g
and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:
sed -f translate.sed input > output
So you can always easily compare the mappings.
s/\bni3\b/nǐ/g
might be prefered over blanks as explicit delimiters, because \b:=word boundary matches start/end of line and punctuation characters.

This should reduce the time by some means as this avoids unnecessary loops.
Merge two input files:
Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.
We will create a new text file which will act as a sed script to your main file using the following awk one-liner:
awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat old.text
19
20
[jaypal:~/Temp] cat new.text
A
B
[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat merge.text
s/ 19 / A /g
s/ 20 / B /g
Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.
Using merged file as sed script:
Once your merged file has been created, we will use -f option of sed utility.
sed -f merge.text input_file
[jaypal:~/Temp] cat input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
[jaypal:~/Temp] sed -f merge.text input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
You can redirect this into another file using the > operator.

This might work for you:
paste {old,new}words.txt |
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' |
sed -i -f - text.txt

Here is a Python 2 script that should be both space and time efficient:
import sys
import codecs
import re
sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
(line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))
regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))
for line in codecs.open("input.txt", "r", "utf-8"):
result = regexp.sub(lambda match:sub[match.group(0)], line)
sys.stdout.write(result.encode("utf-8"))
Here it is in action:
$ cat old.txt
19
20
$ cat new.txt
A
B
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$
EDIT: Hat tip to #Paddy3118 for whitespace handling.

Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.
#!/usr/bin/env perl
# usage:
# replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt
use strict;
use warnings;
sub read_words {
my $file = shift;
open my $fh, "<$file" or die "Error reading file: $file; $!\n";
my #words = <$fh>;
chomp #words;
close $fh;
return \#words;
}
sub word_map {
my ($old_words, $new_words) = #_;
if (scalar #$old_words != scalar #$new_words) {
warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
}
my $list_size = scalar #$old_words;
$list_size = scalar #$new_words if $list_size > scalar #$new_words;
my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;
return \%map;
}
sub build_regex {
my $words = shift;
my $pattern = join "|", sort { length $b <=> length $a } #$words;
return qr/$pattern/;
}
my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);
my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
s/($old_pattern)/$word_map->{$&}/g;
print;
}
close $input;
__END__
Old words file:
$ cat old.txt
19
20
New words file:
$ cat new.txt
A
B
Input file:
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
Create output:
$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf

I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).
use warnings;
use strict;
open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");
my #hay = <>;
my #old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my #new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;
my %r;
; #r{#old} = #new;
print defined $r{$_} ? $r{$_} : $_ for split (
/(\s+)/, "#hay"
);
Use: perl script.pl /file/to/modify, result is printed to stdout.

EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)
I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...
I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.
take it on a run and let me know!
by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment:
#! /usr/bin/perl
# getting the user's input
if ($#ARGV == 1)
{
$LUT_file = shift;
$file = shift;
$outfile = $file . ".out.txt";
}
elsif ($#ARGV == 2)
{
$LUT_file = shift;
$file = shift;
$outfile = shift;
}
else { &usage; }
# opening the relevant files
open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
open FILE,"<",$file or die "can't open $file for reading!\n : $!";
open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";
# getting the lines from the text to be changed and changing them
%word_LUT = ();
WORD_EXT:while (<LUT>)
{
$_ =~ m/(\w+),(\w+)/;
$word_LUT{ $1 } = $2 ;
}
close LUT;
OUTER:while ($line = <FILE>)
{
#words = split(/\s+/,$line);
for( $i = 0; $i <= $#words; $i++)
{
if ( exists ($word_LUT { $words[$i] }) )
{
$words[$i] = $word_LUT { $words[$i] };
}
}
$newline = join(' ',#words);
print "old line - $line\nnewline - $newline\n\n";
print OUT $newline . "\n";
}
# now we have all the signals needed in the swav array, build the file.
close OUT;close FILE;
# Sub Routines
#
#
sub usage(){
print "\n\n\replacer.pl Usage:\n";
print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
print "<LUT file> - a LookUp Table of words, from the old word to the new one.
\t\t\twith the following csv format:
\t\t\told word,new word\n";
print "<Input file> - the input file\n";
print "<out file> - out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";
exit;
}

What's an easy way to read random line from a file?

What's an easy way to read random line from a file in a shell script?

You can use shuf:
shuf -n 1 $FILE
There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.
rl -c 1 $FILE

Another alternative:
head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1

sort --random-sort $FILE | head -n 1
(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)

This is simple.
cat file.txt | shuf -n 1
Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.

perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:
perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

using a bash script:
#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}

Single bash line:
sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt
Slight problem: duplicate filename.

Here's a simple Python script that will do the job:
import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])
Usage:
python randline.py file_to_get_random_line_from

Another way using 'awk'
awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name

A solution that also works on MacOSX, and should also works on Linux(?):
N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file
Where:
N is the number of random lines you want
NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2
--> save line numbers written in file1 and then print corresponding line in file2
jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.

#!/bin/bash
IFS=$'\n' wordsArray=($(<$1))
numWords=${#wordsArray[#]}
sizeOfNumWords=${#numWords}
while [ True ]
do
for ((i=0; i<$sizeOfNumWords; i++))
do
let ranNumArray[$i]=$(( ( $RANDOM % 10 ) + 1 ))-1
ranNumStr="$ranNumStr${ranNumArray[$i]}"
done
if [ $ranNumStr -le $numWords ]
then
break
fi
ranNumStr=""
done
noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}

Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.
RANDOM1=`jot -r 1 1 235886`
#range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
echo $RANDOM1
head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1
The echo of the variable is to get a visual of the generated random number.

Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:
sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME
(This works even if FILENAME is empty, in which case no line is emitted.)
One possible advantage of this approach is that it only calls rand() once.
As pointed out by #AdamKatz in the comments, another possibility would be to call rand() for each line:
awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME
(A simple proof of correctness can be given based on induction.)
Caveat about rand()
"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."
-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio