Expand Grouped Data using shell or perl - performance

I have a log file that is grouping http requests in 5 minute increments based on a unique set of characteristics. Format is as follows:
beginTime endTime platform hostname osVersion os requestType httpStatus nbInstances
So a sample log line could be:
1423983600 1423983900 platform1 test01 8.1 win createAcct 200 15
This indicates in that 5 minute timeframe there were 15 requests with this unique attribute set. What I would like to do is then take this and generate 15 lines identical lines in an output file.
Right now I have a very simple script that is getting the job done but probably not very efficient:
#!/bin/bash
file=$1
count=0
cat $file | while read line
do
string=`echo $line | awk '{print $1,$2,$3,$4,$5,$6,$7,$8}'`
nbInst=`echo $line | awk '{print $9}'`
while [[ $count -lt $nbInst ]]
do
echo "$string" >> test_data.log
count=`expr $count + 1`
done
count=0
done
Any ideas on a faster solution in bash or perl? Thanks.

As mentioned in the comments - it seems unusual that you need to de-coalesce your events to process and index.
However this should do what you're asking:
#!/usr/bin/perl
use strict;
use warnings;
#uses DATA segment from below as file. You'll probably want either STDIN
#or open a file handle.
while (<DATA>) {
#separate line on whitespace
my #line = split;
#grab the last element of the line (pop returns the value, and removes
#from the list)
for ( 1 .. pop(#line) ) {
print join( " ", #line ), "\n";
}
}
__DATA__
1423983600 1423983900 platform1 test01 8.1 win createAcct 200 15

Related

Is there a command for substituting a set of characters by a set of strings?

I'm would like to substitute a set of edit: single byte characters with a set of literal strings in a stream, without any constraint on the line size.
#!/bin/bash
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
chars_to_strings $'\a\b\t\v' '<bell>' '<backspace>' '<horizontal-tab>' '<vertical-tab>'
The expected output would be:
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>...
I can think of a bash function that would do that, something like:
chars_to_strings() {
local delim buffer
while true
do
delim=''
IFS='' read -r -d '.' -n 4096 buffer && (( ${#buffer} != 4096 )) && delim='.'
if [[ -n "${delim:+_}" ]] || [[ -n "${buffer:+_}" ]]
then
# Do the replacements in "$buffer"
# ...
printf "%s%s" "$buffer" "$delim"
else
break
fi
done
}
But I'm looking for a more efficient way, any thoughts?
Since you seem to be okay with using ANSI C quoting via $'...' strings, then maybe use sed?
sed $'s/\a/<bell>/g; s/\b/<backspace>/g; s/\t/<horizontal-tab>/g; s/\v/<vertical-tab>/g'
Or, via separate commands:
sed -e $'s/\a/<bell>/g' \
-e $'s/\b/<backspace>/g' \
-e $'s/\t/<horizontal-tab>/g' \
-e $'s/\v/<vertical-tab>/g'
Or, using awk, which replaces newline characters too (by customizing the Output Record Separator, i.e., the ORS variable):
$ printf '\a,\b,\t,\v\n' | awk -vORS='<newline>' '
{
gsub(/\a/, "<bell>")
gsub(/\b/, "<backspace>")
gsub(/\t/, "<horizontal-tab>")
gsub(/\v/, "<vertical-tab>")
print $0
}
'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><newline>
For a simple one-liner with reasonable portability, try Perl.
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
perl -pe 's/\a/<bell>/g;
s/\b/<backspace>/g;s/\t/<horizontal-tab>/g;s/\v/<vertical-tab>/g'
Perl internally does some intelligent optimizations so it's not encumbered by lines which are longer than its input buffer or whatever.
Perl by itself is not POSIX, of course; but it can be expected to be installed on any even remotely modern platform (short of perhaps embedded systems etc).
Assuming the overall objective is to provide the ability to process a stream of data in real time without having to wait for a EOL/End-of-buffer occurrence to trigger processing ...
A few items:
continue to use the while/read -n loop to read a chunk of data from the incoming stream and store in buffer variable
push the conversion code into something that's better suited to string manipulation (ie, something other than bash); for sake of discussion we'll choose awk
within the while/read -n loop printf "%s\n" "${buffer}" and pipe the output from the while loop into awk; NOTE: the key item is to introduce an explicit \n into the stream so as to trigger awk processing for each new 'line' of input; OP can decide if this additional \n must be distinguished from a \n occurring in the original stream of data
awk then parses each line of input as per the replacement logic, making sure to append anything leftover to the front of the next line of input (ie, for when the while/read -n breaks an item in the 'middle')
General idea:
chars_to_strings() {
while read -r -n 15 buffer # using '15' for demo purposes otherwise replace with '4096' or whatever OP wants
do
printf "%s\n" "${buffer}"
done | awk '{print NR,FNR,length($0)}' # replace 'print ...' with OP's replacement logic
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1 # add some delay to data being streamed to chars_to_strings()
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
A variation on this idea using a named pipe:
mkfifo /tmp/pipeX
sleep infinity > /tmp/pipeX # keep pipe open so awk does not exit
awk '{print NR,FNR,length($0)}' < /tmp/pipeX &
chars_to_strings() {
while read -r -n 15 buffer
do
printf "%s\n" "${buffer}"
done > /tmp/pipeX
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
# kill background 'awk' and/or 'sleep infinity' when no longer needed
don't waste FS/OFS - use the built-in variables to take 2 out of the 5 needed :
echo $' \t abc xyz \t \a \n\n ' |
mawk 'gsub(/\7/, "<bell>", $!(NF = NF)) + gsub(/\10/,"<bs>") +\
gsub(/\11/,"<h-tab>")^_' OFS='<v-tab>' FS='\13' ORS='<newline>'
<h-tab> abc xyz <h-tab> <bell> <newline><newline> <newline>
To have NO constraint on the line length you could do something like this with GNU awk:
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(foo,bar)
print
}'
That will read and process the input 100 chars at a time no matter which chars are present, whether it has newlines or not, and even if the input was one multi-terabyte line.
Replace gsub(foo,bar) with whatever substitution(s) you have in mind, e.g.:
$ printf '\a,\b,\t,\v' |
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(/\a/,"<bell>")
gsub(/\b/,"<backspace>")
gsub(/\t/,"<horizontal-tab>")
gsub(/\v/,"<vertical-tab>")
print
}'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab>
and of course it'd be trivial to pass a list of old and new strings to awk rather than hardcoding them, you'd just have to sanitize any regexp or backreference metachars before calling gsub().

Awk replacement for a slow while loop for processing a large data set

I have a shell script which reads a mobile number from a file and compares with another file to determine which region the number belongs to. I have to check if the series is the first 4 digits or the first 5 digits of the number which gives me the operator and the region which the number belongs to.
If I am performing this task for about 500k numbers than the execution time for this script is about 2 hours on a machine with 24 cores and 64 GB of RAM which is a shame, hence I am here to find a better method to perform the same task more efficiently.
My code is -
while read num
do
num4=$(echo $num | awk '{print substr($0,0,4)}'); #First 4 digits
num5=$(echo $num | awk '{print substr($0,0,5)}'); #First 5 digits
taco=$(grep -w $num5 db.txt | wc -l); #Count if found in file db.txt
taca=$(grep -w $num4 db.txt | wc -l); #Count if found in file db.txt
if [ "$taco" -eq 0 ]; then
if [ "$taca" -eq 0 ]; then
grep "$num4" mobile_db.txt >> operator_temp.txt #If not found in db.txt file then check in different file
else
grep -w $num4 db.txt >> operator_temp.txt #Copy the content found to a file
fi
else
grep -w $num5 db.txt >> operator_temp.txt #Copy the content found to a file
fi
done < output_num_temp.txt
Sample of db.txt file is-
7101 Idea UttarPradesh(West)
7107 RelianceMobile MadhyaPradesh
7108 RelianceMobile Gujarat
7110 RelianceMobile Rajasthan
73027 Airtel UttarPradesh(West)
73028 Airtel UttarPradesh(West)
73029 Airtel UttarPradesh(West)
91210 Airtel AndhraPradesh
91211 Airtel AndhraPradesh
91212 Airtel AndhraPradesh
This is also the sample for mobile_db.txt and this is the expected output that will be stored in the file operator_temp.txt
It SOUNDS like what you need is:
awk '
FILENAME == ARGV[1] { db[$1] = $0; next }
FILENAME == ARGV[2] { mo[$1] = $0; next }
{ num5 = substr($1,1,5) }
num5 in db { print db[num5]; next }
{ num4 = substr($1,1,4) }
num4 in mo { print mo[num4]; next }
num4 in db { print db[num4] }
' db.txt mobile_db.txt output_num_temp.txt > operator_temp.txt
but without clearer requirements and more info on your input files and concise, testable sample input and expected output in your question it's just a guess.
As long as your 2 db files aren't absolutely massive the above should run in a few seconds instead of hours. See https://unix.stackexchange.com/q/169716/133219 for a discussion of some of the many reasons why you should not use a shell loop to manipulate text.

unix find the difference from a file row wise

I have some data like
[09359]0000.365604| =>SttSasph_Hmbm_bSPO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365687| =>Hmbm_bSPO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365879| =>SttSasph_Hmbm_quOuO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365890| =>Hmbm_quOuO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
[09359]0001.625300| db_HOt_POPPWon_Wd: aspuQQOnt POPPWon WD WP 1016,59
[09359]0002.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
Every Line starts with a process number (Which can change) in square brackets
Then Seconds after the module (0001) in this case
Then MicroSeconds after the fullstop.
Then a Pipe to terminate.
Rest part can be ignored
What I need is to caluclate
Convert Seconds into MircoSeconds
Add the Microsconds to Converted Microseconds (From 1)
Find out the difference in microseconds. for eg. line2-line1 , line3-line2, line4- line3 and so.
Print the result in seperate file.
I tried to use this logic. But, it didnt work.
May I get suggestions with optimised way to do it or
improvement in my existing logic
sec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f1)
msec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f2)
$f_msec=$((sec * 1000000 + msec)) > final_difference_file
If you are comfortable with awk, then you can use this script:
script.awk
BEGIN{ FS="[\\[\\]\\|]+" }
{ printf("[%s]%011.6f|%s\n", $2,$3-prev,$4)
prev = $3 }
Use it like this: awk -f script.awk yourfile
The first line setups the fieldsplitting to use the brackets and pipe (ignore the backslashes they are need to escape the symbols that are regexp metacharacters). The second line prints the fields and calculates the timediff. The last line stores the current time for the calculation in the next line.
This can also be done with a bash script. Since bash lacks floating point arithmetic, we have to gather seconds and microseconds seperately (or call an external tool like bc for each line):
script.sh
IFS='|[].'
factor=1000000
prev=0
while read dummy pid secs msecs text;
do
msecs=$(( $secs * $factor + $msecs ))
timediff=$(( $msecs - $prev ))
prev=$msecs
secs=$(( $timediff / $factor ))
msecs=$(( $timediff - $secs * $factor ))
printf "[%s]%04d.%06d|%s\n" "$pid" "$secs" "$msecs" "$text"
done
Use it like this: bash script.sh yourfile

Sorting and printing a file in bash UNIX

I have a file with a bunch of paths that look like so:
7 /usr/file1564
7 /usr/file2212
6 /usr/file3542
I am trying to use sort to pull out and print the path(s) with the most occurrences. Here it what I have so far:
cat temp| sort | uniq -c | sort -rk1 > temp
I am unsure how to only print the highest occurrences. I also want my output to be printed like this:
7 1564
7 2212
7 being the total number of occurrences and the other numbers being the file numbers at the end of the name. I am rather new to bash scripting so any help would be greatly appreciated!
To emit only the first line of output (with the highest number, since you're doing a reverse numeric sort immediately prior), pipe through head -n1.
To remove all content which is not either a number or whitespace, pipe through tr -cd '0-9[:space:]'.
To filter for only the values with the highest number, allowing there to be more than one:
{
read firstnum name && printf '%s\t%s\n' "$firstnum" "$name"
while read -r num name; do
[[ $num = $firstnum ]] || break
printf '%s\t%s\n' "$num" "$name"
done
} < temp
If you want to avoid sort and you are allowed to use awk, then you can do this:
awk '{
if($1>maxcnt) {s=$1" "substr($2,10,4); maxcnt=$1} else
if($1==maxcnt) {s=s "\n"$1" "substr($2,10,4)}} END{print s}' \
temp

Finding and replacing many words

I frequently need to make many replacements within files. To solve this problem, I have created two files old.text and new.text. The first contains a list of words which must be found. The second contains the list of words which should replace those.
All of my files use UTF-8 and make use of various languages.
I have built this script, which I hoped could do the replacement. First, it reads old.text one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text file.
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
However, my solution does not work well. When I run the script:
On line 6, the sed command does not know where the $number ends.
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
Do you have any suggestions?
Update:
The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.
One line 6, the sed command does not know where the $number ends.
Try quoting the variable with double quotes
linefromnewwords=$(sed -n "$number"p newwords.txt)
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Do this instead:
number=`expr $number + 1`
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using -v option.
Here is a solution in bash that would do what you need.
Bash Solution:
#!/bin/bash
while read -r sub && read -r rep <&3; do
sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text
This solution reads one line at a time from substitution file and replacement file and performs in-line sed substitution.
Why not to
paste -d/ oldwords.txt newwords.txt |\
sed -e 's#/# / #' -e 's#^#s/ #' -e 's#$# /g#' >/tmp/$$.sed
sed -f /tmp/$$.sed original >changed
rm /tmp/$$.sed
?
I love this kind of questions, so here is my answer:
First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)
hello=Bye dudes
the morNing=next Afternoon
first=last
Then you can define a proper separator in the script. (file replaceWords.sh)
#!/bin/bash
SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE
Take this example (file changeMe)
Hello, this is me.
I will be there at first time in the morning
Call it with
$ bash replaceWords.sh = changeThis changeMe
And you will get
Bye dudes, this is me.
I will be there at last time in next Afternoon
Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-
Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.
#!/bin/bash
SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
line=$(echo $line | sed "s/${origin}/${dest}/gI")
done < $TRANSLATION
echo $line
done < $FILE
This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.
Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout
The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.
#!/bin/env python
import sys, re
def replacer(match):
global new
return new[match.lastindex-1]
if __name__ == '__main__':
fname_old, fname_new, fname_txt = sys.argv[1:4]
#fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()
with file(fname_old) as f:
# Form regular expression that matches old words, grouped in order
old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
for word in f.read().strip().split()) + ')'
with file(fname_new) as f:
# Ordered list of replacement words
new = [word for word in f.read().strip().split()]
with file(fname_txt) as f:
# input text
txt = f.read()
# Output the new text
print( re.subn(old, replacer, txt)[0] )
I just did some stats on a ~100K byte text file:
Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%
The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.
The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).
Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.
Paddy.
A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:
my %map = (
19 => 'A',
20 => 'B',
);
my $key_regex = '(' . join('|', keys %map) . ')';
while (<>) {
s/$key_regex/$map{$1}/g;
print $_;
}
You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.
At line 6, the sed command does not know where the $number ends.
linefromnewwords=$(sed -n '${number}p' newwords.txt)
I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Arithmetic integer evaluation in bash can be done with $(( )) and is better than eval (eval=evil).
number=$((number + 1))
In general, I would recommend using one file with
s/ ni3 / nǐ /g
s/ nei3 / neǐ /g
and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:
sed -f translate.sed input > output
So you can always easily compare the mappings.
s/\bni3\b/nǐ/g
might be prefered over blanks as explicit delimiters, because \b:=word boundary matches start/end of line and punctuation characters.
This should reduce the time by some means as this avoids unnecessary loops.
Merge two input files:
Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.
We will create a new text file which will act as a sed script to your main file using the following awk one-liner:
awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat old.text
19
20
[jaypal:~/Temp] cat new.text
A
B
[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat merge.text
s/ 19 / A /g
s/ 20 / B /g
Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.
Using merged file as sed script:
Once your merged file has been created, we will use -f option of sed utility.
sed -f merge.text input_file
[jaypal:~/Temp] cat input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
[jaypal:~/Temp] sed -f merge.text input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
You can redirect this into another file using the > operator.
This might work for you:
paste {old,new}words.txt |
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' |
sed -i -f - text.txt
Here is a Python 2 script that should be both space and time efficient:
import sys
import codecs
import re
sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
(line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))
regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))
for line in codecs.open("input.txt", "r", "utf-8"):
result = regexp.sub(lambda match:sub[match.group(0)], line)
sys.stdout.write(result.encode("utf-8"))
Here it is in action:
$ cat old.txt
19
20
$ cat new.txt
A
B
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$
EDIT: Hat tip to #Paddy3118 for whitespace handling.
Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.
#!/usr/bin/env perl
# usage:
# replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt
use strict;
use warnings;
sub read_words {
my $file = shift;
open my $fh, "<$file" or die "Error reading file: $file; $!\n";
my #words = <$fh>;
chomp #words;
close $fh;
return \#words;
}
sub word_map {
my ($old_words, $new_words) = #_;
if (scalar #$old_words != scalar #$new_words) {
warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
}
my $list_size = scalar #$old_words;
$list_size = scalar #$new_words if $list_size > scalar #$new_words;
my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;
return \%map;
}
sub build_regex {
my $words = shift;
my $pattern = join "|", sort { length $b <=> length $a } #$words;
return qr/$pattern/;
}
my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);
my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
s/($old_pattern)/$word_map->{$&}/g;
print;
}
close $input;
__END__
Old words file:
$ cat old.txt
19
20
New words file:
$ cat new.txt
A
B
Input file:
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
Create output:
$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).
use warnings;
use strict;
open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");
my #hay = <>;
my #old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my #new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;
my %r;
; #r{#old} = #new;
print defined $r{$_} ? $r{$_} : $_ for split (
/(\s+)/, "#hay"
);
Use: perl script.pl /file/to/modify, result is printed to stdout.
EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)
I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...
I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.
take it on a run and let me know!
by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment:
#! /usr/bin/perl
# getting the user's input
if ($#ARGV == 1)
{
$LUT_file = shift;
$file = shift;
$outfile = $file . ".out.txt";
}
elsif ($#ARGV == 2)
{
$LUT_file = shift;
$file = shift;
$outfile = shift;
}
else { &usage; }
# opening the relevant files
open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
open FILE,"<",$file or die "can't open $file for reading!\n : $!";
open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";
# getting the lines from the text to be changed and changing them
%word_LUT = ();
WORD_EXT:while (<LUT>)
{
$_ =~ m/(\w+),(\w+)/;
$word_LUT{ $1 } = $2 ;
}
close LUT;
OUTER:while ($line = <FILE>)
{
#words = split(/\s+/,$line);
for( $i = 0; $i <= $#words; $i++)
{
if ( exists ($word_LUT { $words[$i] }) )
{
$words[$i] = $word_LUT { $words[$i] };
}
}
$newline = join(' ',#words);
print "old line - $line\nnewline - $newline\n\n";
print OUT $newline . "\n";
}
# now we have all the signals needed in the swav array, build the file.
close OUT;close FILE;
# Sub Routines
#
#
sub usage(){
print "\n\n\replacer.pl Usage:\n";
print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
print "<LUT file> - a LookUp Table of words, from the old word to the new one.
\t\t\twith the following csv format:
\t\t\told word,new word\n";
print "<Input file> - the input file\n";
print "<out file> - out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";
exit;
}

Resources