How to remove duplicate lines in a file? - bash

I understand that the general approach is to use something like
$ sort file1.txt | uniq > file2.txt
But I was wondering if there was a way to do this without needing separate source and destination files, even if it means it can't be a one-liner.

Simply use the -o and -u options of sort:
sort -o file -u file
You don't need even to use a pipe for another command, such as uniq.

With GNU awk for "inplace" editing:
awk -i inplace '!seen[$0]++' file1.txt
As with all tools (except ed which requires the whole file to be read into memory first) that support "inplace" editing (sed -i, perl -i, ruby -i, etc.) this uses a temp file behind the scenes.
With any awk you can do the following with no temp files used but about twice the memory used instead:
awk '!seen[$0]++{a[++n]=$0} END{for (i=1;i<=n;i++) print a[i] > FILENAME}' file

With Perl's -i:
perl -i -lne 'print unless $seen{$_}++' original.file
-i changes the file "in place";
-n reads the input line by line, running the code for each line;
-l removes newlines from input and adds them to print;
The %seen hash idiom is described in perlfaq4.

A common idiom is:
temp=$(mktemp)
some_pipeline < original.file > "$temp" && mv "$temp" original.file
The && is important: if the pipeline fails, then the original file won't be overwritten with (perhaps) garbage.
The Linux moreutils package contains a program that encapsulates this away:
some_pipeline < original.file | sponge original.file

Related

awk command inside for loop to read and write multiple files [duplicate]

I am learning awk and I would like to know if there is an option to write changes to file, similar to sed where I would use -i option to save modifications to a file.
I do understand that I could use redirection to write changes. However is there an option in awk to do that?
In GNU Awk 4.1.0 (released 2013) and later, it has the option of "inplace" file editing:
[...] The "inplace" extension, built using the new facility, can be used to simulate the GNU "sed -i" feature. [...]
Example usage:
$ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3
To keep the backup:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") }
> { print }' file1 file2 file3
Unless you have GNU awk 4.1.0 or later...
You won't have such an option as sed's -i option so instead do:
$ awk '{print $0}' file > tmp && mv tmp file
Note: the -i is not magic, it is also creating a temporary file sed just handles it for you.
As of GNU awk 4.1.0...
GNU awk added this functionality in version 4.1.0 (released 10/05/2013). It is not as straight forwards as just giving the -i option as described in the released notes:
The new -i option (from xgawk) is used for loading awk library files. This differs from -f in that the first non-option argument
is treated as a script.
You need to use the bundled inplace.awk include file to invoke the extension properly like so:
$ cat file
123 abc
456 def
789 hij
$ gawk -i inplace '{print $1}' file
$ cat file
123
456
789
The variable INPLACE_SUFFIX can be used to specify the extension for a backup file:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{print $1}' file
$ cat file
123
456
789
$ cat file.bak
123 abc
456 def
789 hij
I am happy this feature has been added but to me, the implementation isn't very awkish as the power comes from the conciseness of the language and -i inplace is 8 characters too long i.m.o.
Here is a link to the manual for the official word.
just a little hack that works
echo "$(awk '{awk code}' file)" > file
#sudo_O has the right answer.
This can't work:
someprocess < file > file
The shell performs the redirections before handing control over to someprocess (redirections). The > redirection will truncate the file to zero size (redirecting output). Therefore, by the time someprocess gets launched and wants to read from the file, there is no data for it to read.
An alternative is to use sponge:
awk '{print $0}' your_file | sponge your_file
Where you replace '{print $0}' by your awk script and your_file by the name of the file you want to edit in place.
sponge absorbs entirely the input before saving it to the file.
following won't work
echo $(awk '{awk code}' file) > file
this should work
echo "$(awk '{awk code}' file)" > file
In case you want an awk-only solution without creating a temporary file and usable with version!=(gawk 4.1.0):
awk '{a[b++]=$0} END {for(c=0;c<=b;c++)print a[c]>ARGV[1]}' file

How to remove consecutive repeating characters from every line?

I have the below lines in a file
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;;;;
Acanthocephala;;;;;;;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Polymorphus;;
and I want to remove the repeating semi-colon characters from all lines to look like below (note- there are repeating semi-colons in the middle of some of the above lines too)
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;
Acanthocephala;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Polymorphus;
I would appreciate if someone could kindly share a bash one-liner to accomplish this.
You can use tr with "squeeze":
tr -s ';' < infile
perl -p -e 's/;+/;/g' myfile # writes output to stdout
or
perl -p -i -e 's/;+/;/g' myfile # does an in-place edit
If you want to edit the file itself:
printf "%s\n" 'g/;;/s/;\{2,\}/;/g' w | ed -s foo.txt
If you want to pipe a modified copy of the file to something else and leave the original unchanged:
sed 's/;\{2,\}/;/g' foo.txt | whatever
These replace runs of 2 or more semicolons with single ones.
could be solved easily by substitutions.
I add an awk solution by playing with the FS/OFS variable:
awk -F';+' -v OFS=';' '$1=$1' file
or
awk -F';+' -v OFS=';' '($1=$1)||1' file
Here's a sed version of alaniwi's answer:
sed 's/;\+/;/g' myfile # Write output to stdout
or
sed -i 's/;\+/;/g' myfile # Edit the file in-place

How to delete a line (matching a pattern) from a text file? [duplicate]

How would I use sed to delete all lines in a text file that contain a specific string?
To remove the line and print the output to standard out:
sed '/pattern to match/d' ./infile
To directly modify the file – does not work with BSD sed:
sed -i '/pattern to match/d' ./infile
Same, but for BSD sed (Mac OS X and FreeBSD) – does not work with GNU sed:
sed -i '' '/pattern to match/d' ./infile
To directly modify the file (and create a backup) – works with BSD and GNU sed:
sed -i.bak '/pattern to match/d' ./infile
There are many other ways to delete lines with specific string besides sed:
AWK
awk '!/pattern/' file > temp && mv temp file
Ruby (1.9+)
ruby -i.bak -ne 'print if not /test/' file
Perl
perl -ni.bak -e "print unless /pattern/" file
Shell (bash 3.2 and later)
while read -r line
do
[[ ! $line =~ pattern ]] && echo "$line"
done <file > o
mv o file
GNU grep
grep -v "pattern" file > temp && mv temp file
And of course sed (printing the inverse is faster than actual deletion):
sed -n '/pattern/!p' file
You can use sed to replace lines in place in a file. However, it seems to be much slower than using grep for the inverse into a second file and then moving the second file over the original.
e.g.
sed -i '/pattern/d' filename
or
grep -v "pattern" filename > filename2; mv filename2 filename
The first command takes 3 times longer on my machine anyway.
The easy way to do it, with GNU sed:
sed --in-place '/some string here/d' yourfile
You may consider using ex (which is a standard Unix command-based editor):
ex +g/match/d -cwq file
where:
+ executes given Ex command (man ex), same as -c which executes wq (write and quit)
g/match/d - Ex command to delete lines with given match, see: Power of g
The above example is a POSIX-compliant method for in-place editing a file as per this post at Unix.SE and POSIX specifications for ex.
The difference with sed is that:
sed is a Stream EDitor, not a file editor.BashFAQ
Unless you enjoy unportable code, I/O overhead and some other bad side effects. So basically some parameters (such as in-place/-i) are non-standard FreeBSD extensions and may not be available on other operating systems.
I was struggling with this on Mac. Plus, I needed to do it using variable replacement.
So I used:
sed -i '' "/$pattern/d" $file
where $file is the file where deletion is needed and $pattern is the pattern to be matched for deletion.
I picked the '' from this comment.
The thing to note here is use of double quotes in "/$pattern/d". Variable won't work when we use single quotes.
You can also use this:
grep -v 'pattern' filename
Here -v will print only other than your pattern (that means invert match).
To get a inplace like result with grep you can do this:
echo "$(grep -v "pattern" filename)" >filename
I have made a small benchmark with a file which contains approximately 345 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.
I have tried both with and without the setting LC_ALL=C, it does not seem change the timings significantly. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.
Here are the commands and the timings:
time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt
real 0m0.711s
user 0m0.179s
sys 0m0.530s
time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt
real 0m0.105s
user 0m0.088s
sys 0m0.016s
time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )
real 0m0.046s
user 0m0.014s
sys 0m0.019s
Delete lines from all files that match the match
grep -rl 'text_to_search' . | xargs sed -i '/text_to_search/d'
SED:
'/James\|John/d'
-n '/James\|John/!p'
AWK:
'!/James|John/'
/James|John/ {next;} {print}
GREP:
-v 'James\|John'
perl -i -nle'/regexp/||print' file1 file2 file3
perl -i.bk -nle'/regexp/||print' file1 file2 file3
The first command edits the file(s) inplace (-i).
The second command does the same thing but keeps a copy or backup of the original file(s) by adding .bk to the file names (.bk can be changed to anything).
You can also delete a range of lines in a file.
For example to delete stored procedures in a SQL file.
sed '/CREATE PROCEDURE.*/,/END ;/d' sqllines.sql
This will remove all lines between CREATE PROCEDURE and END ;.
I have cleaned up many sql files withe this sed command.
echo -e "/thing_to_delete\ndd\033:x\n" | vim file_to_edit.txt
Just in case someone wants to do it for exact matches of strings, you can use the -w flag in grep - w for whole. That is, for example if you want to delete the lines that have number 11, but keep the lines with number 111:
-bash-4.1$ head file
1
11
111
-bash-4.1$ grep -v "11" file
1
-bash-4.1$ grep -w -v "11" file
1
111
It also works with the -f flag if you want to exclude several exact patterns at once. If "blacklist" is a file with several patterns on each line that you want to delete from "file":
grep -w -v -f blacklist file
to show the treated text in console
cat filename | sed '/text to remove/d'
to save treated text into a file
cat filename | sed '/text to remove/d' > newfile
to append treated text info an existing file
cat filename | sed '/text to remove/d' >> newfile
to treat already treated text, in this case remove more lines of what has been removed
cat filename | sed '/text to remove/d' | sed '/remove this too/d' | more
the | more will show text in chunks of one page at a time.
Curiously enough, the accepted answer does not actually answer the question directly. The question asks about using sed to replace a string, but the answer seems to presuppose knowledge of how to convert an arbitrary string into a regex.
Many programming language libraries have a function to perform such a transformation, e.g.
python: re.escape(STRING)
ruby: Regexp.escape(STRING)
java: Pattern.quote(STRING)
But how to do it on the command line?
Since this is a sed-oriented question, one approach would be to use sed itself:
sed 's/\([\[/({.*+^$?]\)/\\\1/g'
So given an arbitrary string $STRING we could write something like:
re=$(sed 's/\([\[({.*+^$?]\)/\\\1/g' <<< "$STRING")
sed "/$re/d" FILE
or as a one-liner:
sed "/$(sed 's/\([\[/({.*+^$?]\)/\\\1/g' <<< "$STRING")/d"
with variations as described elsewhere on this page.
cat filename | grep -v "pattern" > filename.1
mv filename.1 filename
You can use good old ed to edit a file in a similar fashion to the answer that uses ex. The big difference in this case is that ed takes its commands via standard input, not as command line arguments like ex can. When using it in a script, the usual way to accomodate this is to use printf to pipe commands to it:
printf "%s\n" "g/pattern/d" w | ed -s filename
or with a heredoc:
ed -s filename <<EOF
g/pattern/d
w
EOF
This solution is for doing the same operation on multiple file.
for file in *.txt; do grep -v "Matching Text" $file > temp_file.txt; mv temp_file.txt $file; done
I found most of the answers not useful for me, If you use vim I found this very easy and straightforward:
:g/<pattern>/d
Source

Removing lines from multiple files with sed command

So, disclaimer: I am pretty new to using bash and zsh, so there is a chance the answer is really simple. Nonetheless. I checked previous postings and couldn't find anything. (edit: I have tried this in both bash and zsh shells- same problem.)
I have a directory with many files and am trying to remove the first line from each file.
So say the directory contains: file1.txt file2.txt file3.txt ... etc.
I am using the sed command (non-GNU):
sed -i -e "1d" *.txt
For some reason, this is only removing the first line of the first file. I thought that the *.txt would affect all files matching the pattern in directory. Strangely, it is creating the file duplicates with -e appended, but both the duplicate and original are the same.
I tried this with other commands (e.g. ls *.txt) and it works fine. Is there something about sed I am missing?
Thank you in advance.
Different versions of sed in differing operating systems support various parameters.
OpenBSD (5.4) sed
The -i flag is unavailable. You can use the following /bin/sh syntax:
for i in *.txt
do
f=`mktemp -p .`
sed -e "1d" "${i}" > "${f}" && mv -- "${f}" "${i}"
done
FreeBSD (11-CURRENT) sed
The -i flag requires an extension, even if it's empty. Thus must be written as sed -i "" -e "1d" *.txt
GNU sed
This looks to see if the argument following -i is another option (or possibly a command). If so, it assumes an in-place modification. If it appears to be a file extension such as ".bak", it will rename the original with the ".bak" and then modify it into the original file's name.
There might be other variations on other platforms, but those are the three I have at hand.
use it without -e !
for one file use:
sed -i '1d' filename
for all files use :
sed -i '1d' *.txt
or
files=/path/to/files/*.extension ; for var in $files ; do sed -i '1d' $var ; done
.for me i use ubuntu and debian based systems , this method is working for me 100% , but for other platformes i'm not sure , so this is other method :
replace first line with emty pattern , and remove empty lines , (double commands):
for files in $(ls /path/to/files/*.txt); do sed -i "s/$(head -1 "$files")//g" "$files" ; sed -i '/^$/d' "$files" ; done
Note: if your files contain splash '/' , then it will give error , so in this case sed command should look like this ( sed -i "s[$(head -1 "$files")[[g" )
hope that's what you're looking for :)
The issue here is that the line number isn't reset when sed opens a new file, so 1 only matches the first line of the first file.
One solution is to use a shell loop, calling sed once for each file. Gumnos' answer shows how to do this in the most widely compatible way, although if you have a version of sed supporting the -i flag, you could do this instead:
for i in *.txt; do
sed -i.bak '1d' "$i"
done
It is possible to avoid creating the backup file by passing an empty suffix but personally, I don't think it's such a bad thing. One day you'll be grateful for it!
It appears that you're not working with GNU tools but if you were, I would recommend using GNU awk for this task. The variable FNR is useful here, as it keeps track of the record number for each file individually, allowing you to do this:
gawk -i inplace 'FNR>1' *.txt
Using the inplace extension, this allows you to remove the first line from each of your files, by only printing the lines where FNR is greater than 1.
Testing it out:
$ seq 5 > file1
$ seq 5 > file2
$ gawk -i inplace 'FNR>1' file1 file2
$ cat file1
2
3
4
5
$ cat file2
2
3
4
5
The last argument you are passing to the Sed is the problem
try something like this.
var=(`find *txt`)
for file in "${var[#]}"
do
sed -i -e 1d $file
done
This did the trick for me.

awk execute same command on different files one by one

Hi I have 30 txt files in a directory which are containing 4 columns.
How can I execute a same command on each file one by one and direct output to different file.
The command I am using is as below but its being applied on all the files and giving single output. All i want is to call each file one by one and direct outputs to a new file.
start=$1
patterns=''
for i in $(seq -43 -14); do
patterns="$patterns /cygdrive/c/test/kpi/SIGTRAN_Load_$(exec date '+%Y%m%d' --date="-${i} days ${start}")*"; done
cat /cygdrive/c/test/kpi/*$patterns | sed -e "s/\t/,/g" -e "s/ /,/g"| awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2> /cygdrive/c/test/kpi/SIGTRAN_Load.csv
Sth like this
for fileName in /path/to/files/foo*.txt
do
mangleFile "$fileName"
done
will mangle a list of files you give via globbing. If you want to generate the file name patterns as in your example, you can do it like this:
for i in $(seq -43 -14)
do
for fileName in /cygdrive/c/test/kpi/SIGTRAN_Load_"$(exec date '+%Y%m%d' --date="-${i} days ${start}")"*
do
mangleFile "$fileName"
done
done
This way the code stays much more readable, even if shorter solutions may exist.
The mangleFile of course then will be the awk call or whatever you would like to do with each file.
Use the following idiom:
for file in *
do
./your_shell_script_containing_the_above.sh $file > some_unique_id
done
You need to run a loop on all the matching files:
for i in /cygdrive/c/test/kpi/*$patterns; do
tr '[:space:]\n' ',\n' < "$i" | awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2 > "/cygdrive/c/test/kpi/SIGTRAN_Load-$i.csv"
done
PS: I haven't tried much to refactor your piped commands that can probably be shortened too.

Resources