Make cat command to operate recursively looping through a directory - bash

I have a large directory of data files which I am in the process of manipulating to get them in a desired format. They each begin and end 15 lines too soon, meaning I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence.
To begin, I have written the following code to separate the relevant data into easy chunks:
#!/bin/bash
destination='media/user/directory/'
for file1 in `ls $destination*.ascii`
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
done
This worked perfectly, so the next step is the worlds simplest cat command:
cat $file3 $file2 > outfile
However, what I need to do is to stitch file2 to the previous file3. Look at this screenshot of the directory for better understanding.
See how these files are all sequential over time:
*_20090412T235945_20090413T235944_* ### April 13
*_20090413T235945_20090414T235944_* ### April 14
So I need to take the 15 lines snipped off the April 14 example above and paste it to the end of the April 13 example.
This doesn't have to be part of the original code, in fact it would be probably best if it weren't. I was just hoping someone would be able to help me get this going.
Thanks in advance! If there is anything I have been unclear about and needs further explanation please let me know.

"I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence."
If I understand what you want correctly, it can be done with one line of code:
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
When this has run, the files file1.new, file2.new, and file3.new will be in the new form with the lines transferred. Of course, you are not limited to three files: you may specify as many as you like on the command line.
Example
To keep our example short, let's just strip the first 2 lines instead of 15. Consider these test files:
$ cat file1
1
2
3
$ cat file2
4
5
6
7
8
$ cat file3
9
10
11
12
13
14
15
Here is the result of running our command:
$ awk 'NR==1 || FNR==3{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
$ cat file1.new
1
2
3
4
5
$ cat file2.new
6
7
8
9
10
$ cat file3.new
11
12
13
14
15
As you can see, the first two lines of each file have been transferred to the preceding file.
How it works
awk implicitly reads each file line-by-line. The job of our code is to choose which new file a line should be written to based on its line number. The variable f will contain the name of the file that we are writing to.
NR==1 || FNR==16{f=FILENAME ".new"}
When we are reading the first line of the first file, NR==1, or when we are reading the 16th line of whatever file we are on, FNR==16, we update f to be the name of the current file with .new added to the end.
For the short example, which transferred 2 lines instead of 15, we used the same code but with FNR==16 replaced with FNR==3.
print>f
This prints the current line to file f.
(If this was a shell script, we would use >>. This is not a shell script. This is awk.)
Using a glob to specify the file names
destination='media/user/directory/'
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' "$destination"*.ascii

Your task is not that difficult at all. You want to gather a list of all _end files in the directory (using a for loop and globbing, NOT looping on the results of ls). Once you have all the end files, you simply parse the dates using parameter expansion w/substing removal say into d1 and d2 for date1 and date2 in:
stuff_20090413T235945_20090414T235944_end
| d1 | | d2 |
then you simply subtract 1 from d1 into say date0 or d0 and then construct a previous filename out of d0 and d1 using _snip instead of _end. Then just test for the existence of the previous _snip filename, and if it exists, paste your info from the current _end file to the previous _snip file. e.g.
#!/bin/bash
for i in *end; do ## find all _end files
d1="${i#*stuff_}" ## isolate first date in filename
d1="${d1%%T*}"
d2="${i%T*}" ## isolate second date
d2="${d2##*_}"
d0=$((d1 - 1)) ## subtract 1 from first, get snip d1
prev="${i/$d1/$d0}" ## create previous 'snip' filename
prev="${prev/$d2/$d1}"
prev="${prev%end}snip"
if [ -f "$prev" ] ## test that prev snip file exists
then
printf "paste to : %s\n" "$prev"
printf " from : %s\n\n" "$i"
fi
done
Test Input Files
$ ls -1
stuff_20090413T235945_20090414T235944_end
stuff_20090413T235945_20090414T235944_snip
stuff_20090414T235945_20090415T235944_end
stuff_20090414T235945_20090415T235944_snip
stuff_20090415T235945_20090416T235944_end
stuff_20090415T235945_20090416T235944_snip
stuff_20090416T235945_20090417T235944_end
stuff_20090416T235945_20090417T235944_snip
stuff_20090417T235945_20090418T235944_end
stuff_20090417T235945_20090418T235944_snip
stuff_20090418T235945_20090419T235944_end
stuff_20090418T235945_20090419T235944_snip
Example Use/Output
$ bash endsnip.sh
paste to : stuff_20090413T235945_20090414T235944_snip
from : stuff_20090414T235945_20090415T235944_end
paste to : stuff_20090414T235945_20090415T235944_snip
from : stuff_20090415T235945_20090416T235944_end
paste to : stuff_20090415T235945_20090416T235944_snip
from : stuff_20090416T235945_20090417T235944_end
paste to : stuff_20090416T235945_20090417T235944_snip
from : stuff_20090417T235945_20090418T235944_end
paste to : stuff_20090417T235945_20090418T235944_snip
from : stuff_20090418T235945_20090419T235944_end
(of course replace stuff_ with your actual prefix)
Let me know if you have questions.

You could store the previous $file3 value in a variable (and do a check if it is not the first run with -z check):
#!/bin/bash
destination='media/user/directory/'
prev=""
for file1 in $destination*.ascii
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
if [ -z "$prev" ]; then
cat $prev $file2 > outfile
fi
prev=$file3
done

Related

Split large csv file into multiple files and keep header in each part

How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part ?
How can I achieve this
h1 h2
a aa
b bb
c cc
.
.
12483720 rows
into
h1 h2
a aa
b bb
.
.
.
1000 rows
And
h1 h2
x xx
y yy
.
.
.
10000 rows
Another awk. First some test records:
$ seq 1 1234567 > file
Then the awk:
$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file
Explained:
$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file
Count the records:
$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total
Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?
awk -v l=1000 '(NR==1){header=$0;next}
(n==l) {
c=sprintf("%0.5d",c+1);
close(file); file=FILENAME; sub(/csv$/,c".csv",file)
print header > file
n=0;l*=10
}
{print $0 > file; n++}' file.csv
This works in the following way:
(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
(n==l){...}: Every time we wrote the requested amount of records/lines, we need to start writing to a new file. This happens every time n==l and we perform the following actions:
c=sprintf("%0.5d",c+1): increase the counter with one, and print it as 000xx
close(file): close the file you just wrote too.
file=FILENAME; sub(/csv$/,c".csv",file): define the new filename
print header > file: open the file and write the header to that file.
n=0: reset the current record count
l*=10: increase the maximum record count for the next file
{print $0 > file; n++}: write the entries to the file and increment the record count
Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt, and then move that file back to the original filename.
# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv
# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'
# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'
# Walk the files that do not have headers. For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
mv tmp.txt $f
done
Here's a first approach:
#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
cp header h$f
cat $f >>h$f
done
rm -f header
rm -f y*
The following bash solution should work nicely :
IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
curr_file_name="file_with_${curr_file_max_rows}_rows"
echo "$header" > "$curr_file_name"
for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
IFS='' read -r row || break 2
echo "$row" >> "$curr_file_name"
}
}
We have a first iteration level which produces the number of rows we're going to write for each successive file. It generates the file names and write the header to them. It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.
Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file. In this loop we try to read a line from the input file. If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.
You can try it here.

Pattern match in a column in bash

I want to grab the rows containing Subject01, Subject02,...Subject50 in the text file and separate them in each file. Below is code I did and the result of outputted files are empty. Can anyone tell me what am I doing wrong?
Subject01/path/here 4
Subject01/path/here 1
Subject02/path/here 3
Subject03/path/here 5
Subject03/path/here 6
...
so one of the output can be in the format below:
Subject03/path/here 5
Subject03/path/here 6
here is the code I tried and it failed.
#!/bin/sh
subject=Subject
for i in {01..50}
do
awk '{ if ($1 == "${subject}${i}") { print } }' output-0 > output-0-sub-$i
done
You can simpyl use grep for that
for f in {01..10}; do
grep "Subject$f" inputFile.txt >> output-0-sub-$f
if [[ ! -s output-0-sub-${f} ]] ; then
rm output-0-sub-$f
fi
done
The if condition is checking if the file is empty and if so, it is deleted.
You could also add the -f flag to check if the file exists, but it depends on how your script works.

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

Should I use a for loop to process text files line by line?

So I have two text files
FILE1: 1-40 names
FILE2: 1-40 names
Now what I want the program to do (Terminal) is to go through each name, by incrementing by ONE in each file so that the first name from FILE1 runs the first line from FILE2, and 20th name from FILE1 runs the 20th line from FILE2.
BUT I DON'T WANT IT TO run first name of FILE1, and then run through all of the names listed in FILE2, and repeat that over and over again.
Should I do a for loop?
I was thinking of doing something like:
for f in (cat FILE1); do
flirt -in $f -ref (cat FILE2);
done
I'm doing this using BASH.
Yes, you can do it quite easily, but it will require reading from two-different file descriptors at once. You can simply redirect one of the files into the next available file descriptor and use it to feed your read loop, e.g.
while read f1var && read -u 3 f2var; do
echo "f1var: $f1var -- f2var: $f2var"
done <file1.txt 3<file2.txt
Which will read line-by-line from each file reading a line from file1.txt on the standard file descriptor into f1var and from file2.txt on fd3 into f2var.
A short example might help:
Example Input Files
$ cat f1.txt
a
b
c
$ cat f2.txt
d
e
f
Example Use
$ while read f1var && read -u 3 f2var; do \
echo "f1var: $f1var -- f2var: $f2var"; \
done <f1.txt 3<f2.txt
f1var: a -- f2var: d
f1var: b -- f2var: e
f1var: c -- f2var: f
Using paste as an alternative
The paste utility also provides a simple alternative for combining files line-by-line, e.g.:
$ paste f1.txt f2.txt
a d
b e
c f
In Bash, you might make usage of arrays:
echo "Alice
> Bob
> Claire" > file-1
echo "Anton
Bärbel
Charlie" > file-2
n1=($(cat file-1))
n2=($(cat file-2))
for n in {0..2}; do echo ${n1[$n]} ${n2[$n]} ; done
Alice Anton
Bob Bärbel
Claire Charlie
Getting familiar with join and nl (number lines) can't be wrong, so here is a different approach:
nl -w 1 file-1 > file1
nl -w 1 file-2 > file2
join -1 1 -2 1 file1 file2 | sed -r 's/^[0-9]+ //'
nl with put a big amount of blanks in front of the small line numbers, if we don't tell it to -w 1.
We join the files by matching line number and remove the line number afterwards with sed.
Paste is of course much more elegant. Didn't know about this.

Finding and replacing many words

I frequently need to make many replacements within files. To solve this problem, I have created two files old.text and new.text. The first contains a list of words which must be found. The second contains the list of words which should replace those.
All of my files use UTF-8 and make use of various languages.
I have built this script, which I hoped could do the replacement. First, it reads old.text one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text file.
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
However, my solution does not work well. When I run the script:
On line 6, the sed command does not know where the $number ends.
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
Do you have any suggestions?
Update:
The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.
One line 6, the sed command does not know where the $number ends.
Try quoting the variable with double quotes
linefromnewwords=$(sed -n "$number"p newwords.txt)
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Do this instead:
number=`expr $number + 1`
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using -v option.
Here is a solution in bash that would do what you need.
Bash Solution:
#!/bin/bash
while read -r sub && read -r rep <&3; do
sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text
This solution reads one line at a time from substitution file and replacement file and performs in-line sed substitution.
Why not to
paste -d/ oldwords.txt newwords.txt |\
sed -e 's#/# / #' -e 's#^#s/ #' -e 's#$# /g#' >/tmp/$$.sed
sed -f /tmp/$$.sed original >changed
rm /tmp/$$.sed
?
I love this kind of questions, so here is my answer:
First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)
hello=Bye dudes
the morNing=next Afternoon
first=last
Then you can define a proper separator in the script. (file replaceWords.sh)
#!/bin/bash
SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE
Take this example (file changeMe)
Hello, this is me.
I will be there at first time in the morning
Call it with
$ bash replaceWords.sh = changeThis changeMe
And you will get
Bye dudes, this is me.
I will be there at last time in next Afternoon
Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-
Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.
#!/bin/bash
SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
while read transline
do
origin=${transline%%${SEP}*}
dest=${transline##*${SEP}}
line=$(echo $line | sed "s/${origin}/${dest}/gI")
done < $TRANSLATION
echo $line
done < $FILE
This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.
Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout
The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.
#!/bin/env python
import sys, re
def replacer(match):
global new
return new[match.lastindex-1]
if __name__ == '__main__':
fname_old, fname_new, fname_txt = sys.argv[1:4]
#fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()
with file(fname_old) as f:
# Form regular expression that matches old words, grouped in order
old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
for word in f.read().strip().split()) + ')'
with file(fname_new) as f:
# Ordered list of replacement words
new = [word for word in f.read().strip().split()]
with file(fname_txt) as f:
# input text
txt = f.read()
# Output the new text
print( re.subn(old, replacer, txt)[0] )
I just did some stats on a ~100K byte text file:
Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%
The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.
The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).
Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.
Paddy.
A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:
my %map = (
19 => 'A',
20 => 'B',
);
my $key_regex = '(' . join('|', keys %map) . ')';
while (<>) {
s/$key_regex/$map{$1}/g;
print $_;
}
You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.
At line 6, the sed command does not know where the $number ends.
linefromnewwords=$(sed -n '${number}p' newwords.txt)
I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
Arithmetic integer evaluation in bash can be done with $(( )) and is better than eval (eval=evil).
number=$((number + 1))
In general, I would recommend using one file with
s/ ni3 / nǐ /g
s/ nei3 / neǐ /g
and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:
sed -f translate.sed input > output
So you can always easily compare the mappings.
s/\bni3\b/nǐ/g
might be prefered over blanks as explicit delimiters, because \b:=word boundary matches start/end of line and punctuation characters.
This should reduce the time by some means as this avoids unnecessary loops.
Merge two input files:
Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.
We will create a new text file which will act as a sed script to your main file using the following awk one-liner:
awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat old.text
19
20
[jaypal:~/Temp] cat new.text
A
B
[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text
[jaypal:~/Temp] cat merge.text
s/ 19 / A /g
s/ 20 / B /g
Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.
Using merged file as sed script:
Once your merged file has been created, we will use -f option of sed utility.
sed -f merge.text input_file
[jaypal:~/Temp] cat input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
[jaypal:~/Temp] sed -f merge.text input_file
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
You can redirect this into another file using the > operator.
This might work for you:
paste {old,new}words.txt |
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' |
sed -i -f - text.txt
Here is a Python 2 script that should be both space and time efficient:
import sys
import codecs
import re
sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
(line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))
regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))
for line in codecs.open("input.txt", "r", "utf-8"):
result = regexp.sub(lambda match:sub[match.group(0)], line)
sys.stdout.write(result.encode("utf-8"))
Here it is in action:
$ cat old.txt
19
20
$ cat new.txt
A
B
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$
EDIT: Hat tip to #Paddy3118 for whitespace handling.
Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.
#!/usr/bin/env perl
# usage:
# replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt
use strict;
use warnings;
sub read_words {
my $file = shift;
open my $fh, "<$file" or die "Error reading file: $file; $!\n";
my #words = <$fh>;
chomp #words;
close $fh;
return \#words;
}
sub word_map {
my ($old_words, $new_words) = #_;
if (scalar #$old_words != scalar #$new_words) {
warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
}
my $list_size = scalar #$old_words;
$list_size = scalar #$new_words if $list_size > scalar #$new_words;
my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;
return \%map;
}
sub build_regex {
my $words = shift;
my $pattern = join "|", sort { length $b <=> length $a } #$words;
return qr/$pattern/;
}
my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);
my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
s/($old_pattern)/$word_map->{$&}/g;
print;
}
close $input;
__END__
Old words file:
$ cat old.txt
19
20
New words file:
$ cat new.txt
A
B
Input file:
$ cat input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
Create output:
$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).
use warnings;
use strict;
open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");
my #hay = <>;
my #old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my #new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;
my %r;
; #r{#old} = #new;
print defined $r{$_} ? $r{$_} : $_ for split (
/(\s+)/, "#hay"
);
Use: perl script.pl /file/to/modify, result is printed to stdout.
EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)
I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...
I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.
take it on a run and let me know!
by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment:
#! /usr/bin/perl
# getting the user's input
if ($#ARGV == 1)
{
$LUT_file = shift;
$file = shift;
$outfile = $file . ".out.txt";
}
elsif ($#ARGV == 2)
{
$LUT_file = shift;
$file = shift;
$outfile = shift;
}
else { &usage; }
# opening the relevant files
open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
open FILE,"<",$file or die "can't open $file for reading!\n : $!";
open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";
# getting the lines from the text to be changed and changing them
%word_LUT = ();
WORD_EXT:while (<LUT>)
{
$_ =~ m/(\w+),(\w+)/;
$word_LUT{ $1 } = $2 ;
}
close LUT;
OUTER:while ($line = <FILE>)
{
#words = split(/\s+/,$line);
for( $i = 0; $i <= $#words; $i++)
{
if ( exists ($word_LUT { $words[$i] }) )
{
$words[$i] = $word_LUT { $words[$i] };
}
}
$newline = join(' ',#words);
print "old line - $line\nnewline - $newline\n\n";
print OUT $newline . "\n";
}
# now we have all the signals needed in the swav array, build the file.
close OUT;close FILE;
# Sub Routines
#
#
sub usage(){
print "\n\n\replacer.pl Usage:\n";
print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
print "<LUT file> - a LookUp Table of words, from the old word to the new one.
\t\t\twith the following csv format:
\t\t\told word,new word\n";
print "<Input file> - the input file\n";
print "<out file> - out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";
exit;
}

Resources