Replace string in the first file it is found - bash

I have a bunch of files named like this:
chapter1.tex
chapter2.tex
chapter3.tex
...
chapter 10.tex
chapter 11.tex
etc.
I am trying to use sed to find and replace the first instance of AAAAAA with ZZZZZZ within all of the files.
sed -i "0,/AAAAAA/s//ZZZZZZ/" chapter*.tex
I tried this above command, but there are two problems:
It finds and replaces the first instance of AAAAAA within each file. I want only the first instance among all files.
I suspect, like many Bash tools, it doesn't properly sort my files in order. E.g. if I type ls then chapter10.tex is listed before chapter1.tex. It is critical it searches the files in order of the chapters.
How to use Bash tools to find and replace first instance, from among a large list of files, so only the first instance in the first found file is replaced, while also respecting the file order (chapter1.tex is first, chapter10.tex is tenth)?

With the complete GNU toolchest you don't need a loop.
printf '%s\0' chapter*.tex \
| sort -zV \
| xargs -0 grep -FlZ 'AAAAAA' \
| head -zn1 \
| xargs -0r sed -i 's/AAAAAA/ZZZZZZ/'

Here is a bash loop based solution that will work with filenames such as chapter 10.tex i.e. filenames with spaces etc:
while IFS= read -r -d '' file; do
if grep -q 'AAAAAA' "$file"; then
echo "changing $file"
sed -i '0,/AAAAAA/s//ZZZZZZ/' "$file"
break
fi
done < <(printf '%s\0' chapter*.tex | sort -z -V)
This is assuming both sed and sort are from gnu utils.
If you have gnu awk 4+ version that supports in-place editing i.e. -i inplace then you can replace grep + sed with single awk:
while IFS= read -r -d '' file; do
awk -i inplace '!n {n=sub(/AAAAAA/, "ZZZZZZ")} 1;
END {exit !n}' "$file" && break
done < <(printf '%s\0' chapter*.tex | sort -z -V)

This might work for you (GNU sed and grep):
grep -ns 'AAAAAA' chapter{1..9999}.txt | head -1 |
sed -nE 's#([^:]*):([^:]*):.*#sed -i "\2s/AAAAAA/ZZZZZZ/" \1#e'
Use grep and bash's braces expansion to identify the one possible matching file and line number and build a sed script to update that file at that line number.
N.B. Brace expansion generates the filenames in the correct order and the -s command line option for grep suppresses the non-existent files messages.
Alternative using GNU parallel:
grep -sno 'AAAAAA' chapter{1..9999}.txt | head -1 |
parallel --colsep : sed '{2}s/{3}/ZZZZZZ/' {1}

#update
I stand on the backs of giants, lol
Kudos to #potong for a great sorting solution with brace expension! That means this whole thing can be reduced to a single-process one-liner:
sed -i '0,/^AAA/{ /^AAA/{ s/AAA/ZZZ/; h; } }; ${ x; /./{x;q;}; x; }' chapter\ {[0-9],[0-9][0-9]}.tex
#edit
As pointed out, the original solution below would process and change the first occurrence in every file, and does not correct the file order. #anubhava already provided an excellent, elegant sorting solution on which I will not try to improve.
while IFS= read -r -d '' file; do lst+=( "$file" ); done < <(printf '%s\0' chapter*.tex | sort -z -V)
This creates a list of the filenames in proper order which can be passed to a single call of sed to process them en masse.
To apply that to ordering to a sed-based solution and only hit the first occurrence in any file -
sed -i '0,/^AAA/{ /^AAA/{ s/AAA/ZZZ/; h; } }; ${ x; /./{x;q;}; x; }' "${lst[#]}"
This will look through each file and change the first occurrence it finds in that file, holding the line where it first finds it. On the last line of each file it exchanges the current line for the hold buffer and checks to see if after the swap there is anything in the pattern buffer. If there is not, it swaps it back and continues. If there is, it swaps it back and quits, skipping all subsequent files.
While somewhat complicated, this does not spawn processes for each file.
Original
Use a double condition -
sed -i '0,/AAAAAA/{ /AAAAAA/s/AAAAAA/ZZZZZZ/ }' chapter*.tex
To see the same general logic in action:
$: cat a.tex b.tex
111
AAA
BBB
AAA
222
111
AAA
BBB
AAA
222
$: sed -i '0,/^AAA/{ /^AAA/s/AAA/ZZZ/; }' *.tex
$: cat a.tex b.tex
111
ZZZ
BBB
AAA
222
111
ZZZ
BBB
AAA
222
'0,/^AAA/ is right, as it ranges from the start of the file to the first occurrance of the target string.
{ opens a block, in which we can use a second search to make sure it only affects the targeted string.
Inside the block, /^AAA/s/AAA/ZZZ/; substitutes the AAA string and ignores all the records before it. } closes the block. All records after will be untouched.

Related

How to delete a line (matching a pattern) from a text file? [duplicate]

How would I use sed to delete all lines in a text file that contain a specific string?
To remove the line and print the output to standard out:
sed '/pattern to match/d' ./infile
To directly modify the file – does not work with BSD sed:
sed -i '/pattern to match/d' ./infile
Same, but for BSD sed (Mac OS X and FreeBSD) – does not work with GNU sed:
sed -i '' '/pattern to match/d' ./infile
To directly modify the file (and create a backup) – works with BSD and GNU sed:
sed -i.bak '/pattern to match/d' ./infile
There are many other ways to delete lines with specific string besides sed:
AWK
awk '!/pattern/' file > temp && mv temp file
Ruby (1.9+)
ruby -i.bak -ne 'print if not /test/' file
Perl
perl -ni.bak -e "print unless /pattern/" file
Shell (bash 3.2 and later)
while read -r line
do
[[ ! $line =~ pattern ]] && echo "$line"
done <file > o
mv o file
GNU grep
grep -v "pattern" file > temp && mv temp file
And of course sed (printing the inverse is faster than actual deletion):
sed -n '/pattern/!p' file
You can use sed to replace lines in place in a file. However, it seems to be much slower than using grep for the inverse into a second file and then moving the second file over the original.
e.g.
sed -i '/pattern/d' filename
or
grep -v "pattern" filename > filename2; mv filename2 filename
The first command takes 3 times longer on my machine anyway.
The easy way to do it, with GNU sed:
sed --in-place '/some string here/d' yourfile
You may consider using ex (which is a standard Unix command-based editor):
ex +g/match/d -cwq file
where:
+ executes given Ex command (man ex), same as -c which executes wq (write and quit)
g/match/d - Ex command to delete lines with given match, see: Power of g
The above example is a POSIX-compliant method for in-place editing a file as per this post at Unix.SE and POSIX specifications for ex.
The difference with sed is that:
sed is a Stream EDitor, not a file editor.BashFAQ
Unless you enjoy unportable code, I/O overhead and some other bad side effects. So basically some parameters (such as in-place/-i) are non-standard FreeBSD extensions and may not be available on other operating systems.
I was struggling with this on Mac. Plus, I needed to do it using variable replacement.
So I used:
sed -i '' "/$pattern/d" $file
where $file is the file where deletion is needed and $pattern is the pattern to be matched for deletion.
I picked the '' from this comment.
The thing to note here is use of double quotes in "/$pattern/d". Variable won't work when we use single quotes.
You can also use this:
grep -v 'pattern' filename
Here -v will print only other than your pattern (that means invert match).
To get a inplace like result with grep you can do this:
echo "$(grep -v "pattern" filename)" >filename
I have made a small benchmark with a file which contains approximately 345 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.
I have tried both with and without the setting LC_ALL=C, it does not seem change the timings significantly. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.
Here are the commands and the timings:
time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt
real 0m0.711s
user 0m0.179s
sys 0m0.530s
time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt
real 0m0.105s
user 0m0.088s
sys 0m0.016s
time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )
real 0m0.046s
user 0m0.014s
sys 0m0.019s
Delete lines from all files that match the match
grep -rl 'text_to_search' . | xargs sed -i '/text_to_search/d'
SED:
'/James\|John/d'
-n '/James\|John/!p'
AWK:
'!/James|John/'
/James|John/ {next;} {print}
GREP:
-v 'James\|John'
perl -i -nle'/regexp/||print' file1 file2 file3
perl -i.bk -nle'/regexp/||print' file1 file2 file3
The first command edits the file(s) inplace (-i).
The second command does the same thing but keeps a copy or backup of the original file(s) by adding .bk to the file names (.bk can be changed to anything).
You can also delete a range of lines in a file.
For example to delete stored procedures in a SQL file.
sed '/CREATE PROCEDURE.*/,/END ;/d' sqllines.sql
This will remove all lines between CREATE PROCEDURE and END ;.
I have cleaned up many sql files withe this sed command.
echo -e "/thing_to_delete\ndd\033:x\n" | vim file_to_edit.txt
Just in case someone wants to do it for exact matches of strings, you can use the -w flag in grep - w for whole. That is, for example if you want to delete the lines that have number 11, but keep the lines with number 111:
-bash-4.1$ head file
1
11
111
-bash-4.1$ grep -v "11" file
1
-bash-4.1$ grep -w -v "11" file
1
111
It also works with the -f flag if you want to exclude several exact patterns at once. If "blacklist" is a file with several patterns on each line that you want to delete from "file":
grep -w -v -f blacklist file
to show the treated text in console
cat filename | sed '/text to remove/d'
to save treated text into a file
cat filename | sed '/text to remove/d' > newfile
to append treated text info an existing file
cat filename | sed '/text to remove/d' >> newfile
to treat already treated text, in this case remove more lines of what has been removed
cat filename | sed '/text to remove/d' | sed '/remove this too/d' | more
the | more will show text in chunks of one page at a time.
Curiously enough, the accepted answer does not actually answer the question directly. The question asks about using sed to replace a string, but the answer seems to presuppose knowledge of how to convert an arbitrary string into a regex.
Many programming language libraries have a function to perform such a transformation, e.g.
python: re.escape(STRING)
ruby: Regexp.escape(STRING)
java: Pattern.quote(STRING)
But how to do it on the command line?
Since this is a sed-oriented question, one approach would be to use sed itself:
sed 's/\([\[/({.*+^$?]\)/\\\1/g'
So given an arbitrary string $STRING we could write something like:
re=$(sed 's/\([\[({.*+^$?]\)/\\\1/g' <<< "$STRING")
sed "/$re/d" FILE
or as a one-liner:
sed "/$(sed 's/\([\[/({.*+^$?]\)/\\\1/g' <<< "$STRING")/d"
with variations as described elsewhere on this page.
cat filename | grep -v "pattern" > filename.1
mv filename.1 filename
You can use good old ed to edit a file in a similar fashion to the answer that uses ex. The big difference in this case is that ed takes its commands via standard input, not as command line arguments like ex can. When using it in a script, the usual way to accomodate this is to use printf to pipe commands to it:
printf "%s\n" "g/pattern/d" w | ed -s filename
or with a heredoc:
ed -s filename <<EOF
g/pattern/d
w
EOF
This solution is for doing the same operation on multiple file.
for file in *.txt; do grep -v "Matching Text" $file > temp_file.txt; mv temp_file.txt $file; done
I found most of the answers not useful for me, If you use vim I found this very easy and straightforward:
:g/<pattern>/d
Source

Delete the 3 last line of my txt with bash? [duplicate]

I want to remove some n lines from the end of a file. Can this be done using sed?
For example, to remove lines from 2 to 4, I can use
$ sed '2,4d' file
But I don't know the line numbers. I can delete the last line using
$sed $d file
but I want to know the way to remove n lines from the end. Please let me know how to do that using sed or some other method.
I don't know about sed, but it can be done with head:
head -n -2 myfile.txt
If hardcoding n is an option, you can use sequential calls to sed. For instance, to delete the last three lines, delete the last one line thrice:
sed '$d' file | sed '$d' | sed '$d'
From the sed one-liners:
# delete the last 10 lines of a file
sed -e :a -e '$d;N;2,10ba' -e 'P;D' # method 1
sed -n -e :a -e '1,10!{P;N;D;};N;ba' # method 2
Seems to be what you are looking for.
A funny & simple sed and tac solution :
n=4
tac file.txt | sed "1,$n{d}" | tac
NOTE
double quotes " are needed for the shell to evaluate the $n variable in sed command. In single quotes, no interpolate will be performed.
tac is a cat reversed, see man 1 tac
the {} in sed are there to separate $n & d (if not, the shell try to interpolate non existent $nd variable)
Use sed, but let the shell do the math, with the goal being to use the d command by giving a range (to remove the last 23 lines):
sed -i "$(($(wc -l < file)-22)),\$d" file
To remove the last 3 lines, from inside out:
$(wc -l < file)
Gives the number of lines of the file: say 2196
We want to remove the last 23 lines, so for left side or range:
$((2196-22))
Gives: 2174
Thus the original sed after shell interpretation is:
sed -i '2174,$d' file
With -i doing inplace edit, file is now 2173 lines!
If you want to save it into a new file, the code is:
sed -i '2174,$d' file > outputfile
You could use head for this.
Use
$ head --lines=-N file > new_file
where N is the number of lines you want to remove from the file.
The contents of the original file minus the last N lines are now in new_file
Just for completeness I would like to add my solution.
I ended up doing this with the standard ed:
ed -s sometextfile <<< $'-2,$d\nwq'
This deletes the last 2 lines using in-place editing (although it does use a temporary file in /tmp !!)
To truncate very large files truly in-place we have truncate command.
It doesn't know about lines, but tail + wc can convert lines to bytes:
file=bigone.log
lines=3
truncate -s -$(tail -$lines $file | wc -c) $file
There is an obvious race condition if the file is written at the same time.
In this case it may be better to use head - it counts bytes from the beginning of file (mind disk IO), so we will always truncate on line boundary (possibly more lines than expected if file is actively written):
truncate -s $(head -n -$lines $file | wc -c) $file
Handy one-liner if you fail login attempt putting password in place of username:
truncate -s $(head -n -5 /var/log/secure | wc -c) /var/log/secure
This might work for you (GNU sed):
sed ':a;$!N;1,4ba;P;$d;D' file
Most of the above answers seem to require GNU commands/extensions:
$ head -n -2 myfile.txt
-2: Badly formed number
For a slightly more portible solution:
perl -ne 'push(#fifo,$_);print shift(#fifo) if #fifo > 10;'
OR
perl -ne 'push(#buf,$_);END{print #buf[0 ... $#buf-10]}'
OR
awk '{buf[NR-1]=$0;}END{ for ( i=0; i < (NR-10); i++){ print buf[i];} }'
Where "10" is "n".
With the answers here you'd have already learnt that sed is not the best tool for this application.
However I do think there is a way to do this in using sed; the idea is to append N lines to hold space untill you are able read without hitting EOF. When EOF is hit, print the contents of hold space and quit.
sed -e '$!{N;N;N;N;N;N;H;}' -e x
The sed command above will omit last 5 lines.
It can be done in 3 steps:
a) Count the number of lines in the file you want to edit:
n=`cat myfile |wc -l`
b) Subtract from that number the number of lines to delete:
x=$((n-3))
c) Tell sed to delete from that line number ($x) to the end:
sed "$x,\$d" myfile
You can get the total count of lines with wc -l <file> and use
head -n <total lines - lines to remove> <file>
Try the following command:
n = line number
tail -r file_name | sed '1,nd' | tail -r
This will remove the last 3 lines from file:
for i in $(seq 1 3); do sed -i '$d' file; done;
I prefer this solution;
head -$(gcalctool -s $(cat file | wc -l)-N) file
where N is the number of lines to remove.
sed -n ':pre
1,4 {N;b pre
}
:cycle
$!{P;N;D;b cycle
}' YourFile
posix version
To delete last 4 lines:
$ nl -b a file | sort -k1,1nr | sed '1, 4 d' | sort -k1,1n | sed 's/^ *[0-9]*\t//'
I came up with this, where n is the number of lines you want to delete:
count=`wc -l file`
lines=`expr "$count" - n`
head -n "$lines" file > temp.txt
mv temp.txt file
rm -f temp.txt
It's a little roundabout, but I think it's easy to follow.
Count up the number of lines in the main file
Subtract the number of lines you want to remove from the count
Print out the number of lines you want to keep and store in a temp file
Replace the main file with the temp file
Remove the temp file
For deleting the last N lines of a file, you can use the same concept of
$ sed '2,4d' file
You can use a combo with tail command to reverse the file: if N is 5
$ tail -r file | sed '1,5d' file | tail -r > file
And this way runs also where head -n -5 file command doesn't run (like on a mac!).
#!/bin/sh
echo 'Enter the file name : '
read filename
echo 'Enter the number of lines from the end that needs to be deleted :'
read n
#Subtracting from the line number to get the nth line
m=`expr $n - 1`
# Calculate length of the file
len=`cat $filename|wc -l`
#Calculate the lines that must remain
lennew=`expr $len - $m`
sed "$lennew,$ d" $filename
A solution similar to https://stackoverflow.com/a/24298204/1221137 but with editing in place and not hardcoded number of lines:
n=4
seq $n | xargs -i sed -i -e '$d' my_file
In docker, this worked for me:
head --lines=-N file_path > file_path
Say you have several lines:
$ cat <<EOF > 20lines.txt
> 1
> 2
> 3
[snip]
> 18
> 19
> 20
> EOF
Then you can grab:
# leave last 15 out
$ head -n5 20lines.txt
1
2
3
4
5
# skip first 14
$ tail -n +15 20lines.txt
15
16
17
18
19
20
POSIX compliant solution using ex / vi, in the vein of #Michel's solution above.
#Michel's ed example uses "not-POSIX" Here-Strings.
Increment the $-1 to remove n lines to the EOF ($), or just feed the lines you want to (d)elete. You could use ex to count line numbers or do any other Unix stuff.
Given the file:
cat > sometextfile <<EOF
one
two
three
four
five
EOF
Executing:
ex -s sometextfile <<'EOF'
$-1,$d
%p
wq!
EOF
Returns:
one
two
three
This uses POSIX Here-Docs so it is really easy to modify - especially using set -o vi with a POSIX /bin/sh.
While on the subject, the "ex personality" of "vim" should be fine, but YMMV.
This will remove the last 12 lines
sed -n -e :a -e '1,10!{P;N;D;};N;ba'

Removing lines from multiple files with sed command

So, disclaimer: I am pretty new to using bash and zsh, so there is a chance the answer is really simple. Nonetheless. I checked previous postings and couldn't find anything. (edit: I have tried this in both bash and zsh shells- same problem.)
I have a directory with many files and am trying to remove the first line from each file.
So say the directory contains: file1.txt file2.txt file3.txt ... etc.
I am using the sed command (non-GNU):
sed -i -e "1d" *.txt
For some reason, this is only removing the first line of the first file. I thought that the *.txt would affect all files matching the pattern in directory. Strangely, it is creating the file duplicates with -e appended, but both the duplicate and original are the same.
I tried this with other commands (e.g. ls *.txt) and it works fine. Is there something about sed I am missing?
Thank you in advance.
Different versions of sed in differing operating systems support various parameters.
OpenBSD (5.4) sed
The -i flag is unavailable. You can use the following /bin/sh syntax:
for i in *.txt
do
f=`mktemp -p .`
sed -e "1d" "${i}" > "${f}" && mv -- "${f}" "${i}"
done
FreeBSD (11-CURRENT) sed
The -i flag requires an extension, even if it's empty. Thus must be written as sed -i "" -e "1d" *.txt
GNU sed
This looks to see if the argument following -i is another option (or possibly a command). If so, it assumes an in-place modification. If it appears to be a file extension such as ".bak", it will rename the original with the ".bak" and then modify it into the original file's name.
There might be other variations on other platforms, but those are the three I have at hand.
use it without -e !
for one file use:
sed -i '1d' filename
for all files use :
sed -i '1d' *.txt
or
files=/path/to/files/*.extension ; for var in $files ; do sed -i '1d' $var ; done
.for me i use ubuntu and debian based systems , this method is working for me 100% , but for other platformes i'm not sure , so this is other method :
replace first line with emty pattern , and remove empty lines , (double commands):
for files in $(ls /path/to/files/*.txt); do sed -i "s/$(head -1 "$files")//g" "$files" ; sed -i '/^$/d' "$files" ; done
Note: if your files contain splash '/' , then it will give error , so in this case sed command should look like this ( sed -i "s[$(head -1 "$files")[[g" )
hope that's what you're looking for :)
The issue here is that the line number isn't reset when sed opens a new file, so 1 only matches the first line of the first file.
One solution is to use a shell loop, calling sed once for each file. Gumnos' answer shows how to do this in the most widely compatible way, although if you have a version of sed supporting the -i flag, you could do this instead:
for i in *.txt; do
sed -i.bak '1d' "$i"
done
It is possible to avoid creating the backup file by passing an empty suffix but personally, I don't think it's such a bad thing. One day you'll be grateful for it!
It appears that you're not working with GNU tools but if you were, I would recommend using GNU awk for this task. The variable FNR is useful here, as it keeps track of the record number for each file individually, allowing you to do this:
gawk -i inplace 'FNR>1' *.txt
Using the inplace extension, this allows you to remove the first line from each of your files, by only printing the lines where FNR is greater than 1.
Testing it out:
$ seq 5 > file1
$ seq 5 > file2
$ gawk -i inplace 'FNR>1' file1 file2
$ cat file1
2
3
4
5
$ cat file2
2
3
4
5
The last argument you are passing to the Sed is the problem
try something like this.
var=(`find *txt`)
for file in "${var[#]}"
do
sed -i -e 1d $file
done
This did the trick for me.

Optimize shell script for multiple sed replacements

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s
You can use sed to produce correctly -formatted sed input:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.
From replace.c:
Replace strings in textfile
This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.
This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.
Here is what I would try:
store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.
patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[#]} file
This result in this command:
sed -e s/old/new/g -e s/tobereplaced/replacement/g file
You can try this.
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements. You may also want to replace awk with cut. cut may be more optimized then awk, though I am not sure about that.
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so --posix with GNU sed)
explaination
copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.
Thanks to #miku above;
I have a 100MB file with a list of 80k replacement-strings.
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".
I randomly picked 1000 replacements per file, so it all went like this:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours !
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB; how many pairs would that be for my script ?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks
to expand on chthonicdaemon's solution
live demo
#! /bin/sh
# build regex from text file
REGEX_FILE=some-patch.regex.diff
# test
# set these with "export key=val"
SOME_VAR_NAME=hello
ANOTHER_VAR_NAME=world
escape_b() {
echo "$1" | sed 's,/,\\/,g'
}
regex="$(
(echo; cat "$REGEX_FILE"; echo) \
| perl -p -0 -e '
s/\n#[^\n]*/\n/g;
s/\(\(SOME_VAR_NAME\)\)/'"$(escape_b "$SOME_VAR_NAME")"'/g;
s/\(\(ANOTHER_VAR_NAME\)\)/'"$(escape_b "$ANOTHER_VAR_NAME")"'/g;
s/([^\n])\//\1\\\//g;
s/\n-([^\n]+)\n\+([^\n]*)(?:\n\/([^\n]+))?\n/s\/\1\/\2\/\3;\n/g;
'
)"
echo "regex:"; echo "$regex" # debug
exec perl -00 -p -i -e "$regex" "$#"
prefixing lines with -+/ allows empty "plus" values, and protects leading whitespace from buggy text editors
sample input: some-patch.regex.diff
# file format is similar to diff/patch
# this is a comment
# replace all "a/a" with "b/b"
-a/a
+b/b
/g
-a1|a2
+b1|b2
/sg
# this is another comment
-(a1).*(a2)
+b\1b\2b
-a\na\na
+b
-a1-((SOME_VAR_NAME))-a2
+b1-((ANOTHER_VAR_NAME))-b2
sample output
s/a\/a/b\/b/g;
s/a1|a2/b1|b2/;;
s/(a1).*(a2)/b\1b\2b/;
s/a\na\na/b/;
s/a1-hello-a2/b1-world-b2/;
this regex format is compatible with sed and perl
since miku mentioned mysql replace:
replacing fixed strings with regex is non-trivial,
since you must escape all regex chars,
but you also must handle backslash escapes ...
naive escaper:
echo '\(\n' | perl -p -e 's/([.+*?()\[\]])/\\\1/g'
\\(\n

howto loop sed to get variable

I have a CSV file of 15000 rows. From the list I want to delete the unwanted products/manufacturers. I have a list with manufacturers and the source CSV file.
I found that sed would be appropiate but I'm hanging around the loop.
while read line
do
unwanted = $
sed "|"$unwanted|d" /home/arno/pixtmp/pixtmp.csv >/home/arno/pixtmp/pix-clean.c$
done < /home/bankey/shopimport/unwanted.txt
Any help is appreciated.
Inputfile:
CONSUMABLES;Inktpatronen voor printer;Inkt voor printer;B0137790;HP;Pakket 2 inktpatronen No339 - Zwart + Papier Goodway - 80 g/m² - A4 - 500 vel;Dit pakket van 2 inktpatronen nr 339 zijn ontworpen voor uw HP printer en leveren afdrukken van kwaliteit.;47.19;6.99;47.19;http://pan8.fotovista.com/dev/8/5/32150358/l_32150358.jpg;in stock;0.2;0.11201;9.99;;C9504EE;0;;
I'd use sed in two steps:
Create the sed script from the unwanted information.
Apply the created script to the data file.
That might be:
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed 's%.*%/,&,/d%' $unwanted > sed.script
sed -f sed.script $datafile > $cleaned
rm -f sed.script
The first invocation of sed simply replace the contents of each line describing unwanted records with a sed command that will delete it as a comma-separated field in the middle of an data line. If you have to handle unwanted fields at the beginning or the end too, then you have to work harder. You also have to work harder if there might be embedded slashes, commas, quotes etc. The second invocation of sed applies the script created by the first to the data file, generating the cleaned file.
You can improve it by ensuring the script file name is unique, and by trapping the script file if the process is interrupted:
tmp=$(mktemp /tmp/script.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15 # EXIT, HUP, INT, QUIT, PIPE, TERM
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed 's%.*%/,&,/d%' $unwanted > $tmp
sed -f $tmp $datafile > $cleaned
rm -f $tmp
trap 0 # Cancel the exit trap
With GNU sed, but not with Mac OS X (BSD) sed, you could avoid the intermediate file thus:
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed 's%.*%/,&,/d%' $unwanted |
sed -f - $datafile > $cleaned
This tells the second sed to read its script from standard input. If you have bash version 4.x (not standard on Mac OS X), you could use process substitution instead:
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed -f <(sed 's%.*%/,&,/d%' $unwanted) $datafile > $cleaned
You have to make sure that each loop cycle takes the output file from the previous cycle as the input file, otherwise you'll keep overwriting the output file with the content of the original file minus the last unwanted record.
If your sed command supports inline editing (option -i) you can do this:
cp /home/arno/pixtmp/pixtmp.csv /home/arno/pixtmp/pix-clean.csv
while read line; do
sed -i "/$line/d" /home/arno/pixtmp/pix-clean.csv
done < /home/bankey/shopimport/unwanted.txt
Otherwise you have to handle the temporary file yourself:
cp /home/arno/pixtmp/pixtmp.csv /home/arno/pixtmp/pix-clean.csv
while read line; do
sed "/$line/d" /home/arno/pixtmp/pix-clean.csv >/home/arno/pixtmp/pix-clean.c$
mv -f /home/arno/pixtmp/pix-clean.c$ /home/arno/pixtmp/pix-clean.csv
done < /home/bankey/shopimport/unwanted.txt
sed is less suited than awk. For example, assuming your input file and your list of undesired terms are space delimited, you could simply do:
awk 'NR==FNR { a[$0]++ } NR != FNR && !a[$1]' undesired input
This will print out the file 'input' file, omitting any line in which the first column matches a line in the file undesired.

Resources