How to find duplicate lines in a file? - sorting

I have an input file with foillowing data:
line1
line2
line3
begin
line5
line6
line7
end
line9
line1
line3
I am trying to find all the duplicate lines , I tried
sort filename | uniq -c
but does not seem to be working for me :
It gives me :
1 begin
1 end
1 line1
1 line1
1 line2
1 line3
1 line3
1 line5
1 line6
1 line7
1 line9
the question may seem duplicate as Find duplicate lines in a file and count how many time each line was duplicated?
but nature of input data is different .
Please suggest .

use this:
sort filename | uniq -d
man uniq

try
sort -u file
or
awk '!a[$0]++' file

you'll have to modify the standard de-dupe code just a tiny bit to account for this:
if you want unique copy of the duplicates, then it's very much same idea:
{m,g}awk 'NF~ __[$_]++' FS='^$'
{m,g}awk '__[$_]++==!_'
If you want every copy printed for duplicates, then whenever the condition yields true for the first time, print 2 copies of it, plus print new matches along the way.
Usually it's waaaaaaaaay faster to first de-dupe, then sort, instead of the other way around.

Related

Move all lines between header and footer into new file in unix

I have file records like below, header, data & footer records.
I need to move only data part to another file. New file should only contain lines between Header2 and Footer1.
I have tried t head -n 30 fiename | tail 10 > newfile
as data record counts may vary .
example records from source file .
Header1
Header2
Header3
SEQ++1
line1
line2
SEQ++2
line1
SEQ++3
line1
line2
line3
Footer1
Footer2
Footer3
Output file should have:
SEQ++1
line1
line2
SEQ++2
line1
SEQ++3
line1
line2
line3
There are different ways.
grep:
grep -v -E "Header|Footer" source.txt
awk:
awk '! /Header.|Footer./ { print }' source.txt
You can replace the "Header" and "Footer" values by whatever you use to identify each lines.

Merging Two, Nearly Similar Text Files

Suppose we have ~/file1:
line1
line2
line3
...and ~/file2:
line1
lineNEW
line3
Notice that thes two files are nearly identical, except line2 differs from lineNEW.
Question: How can I merge these two files to produce one that reads as follows:
line1
line2
lineNEW
line3
That is, how can I merge the two files so that all unique lines are captured (without overlap) into a third file? Note that the order of the lines doesn't matter (as long as all unique lines are being captured).
awk '{
print
getline line < second
if ($0 != line) print line
}' second=file2 file1
will do it
Considered the command below. It is more robust since it also works for files where a new line has been added instead of replaced (see f1 and f2 below).
First, I executed it using your files. I divided the command(s) into two lines so that it fits nicely in the "code block":
$ (awk '{ print NR, $0 }' file1; awk '{ print NR, $0 }' file2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces your expected output:
line1
line2
lineNEW
line3
I also used these two extra files to test it:
f1:
line1 stuff after a tab
line2 line2
line3
line4
line5
line6
f2:
line1 stuff after a tab
lineNEW
line2 line2
line3
line4
line5
line6
Here is the command:
$ (awk '{ print NR, $0 }' f1; awk '{ print NR, $0 }' f2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces this output:
line1 stuff after a tab
line2 line2
lineNEW
line3
line4
line5
line6
When you do not care about the order, just sort them:
cat ~/file1 ~/file2 | sort -u > ~/file3

awk or sed - delete specific lines

I've got this
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10
line11
line12
line13
line14
line15
I want this
line1
line3
line5
line6
line8
line10
line11
line13
line15
As you can see , the line to be deleted are going from x+2 then x+3 , x equals the line number to be deleted.
I tried this with awk but this is not the right way.
awk '(NR)%2||(NR)%3' file > filev1
Any ideas why?
If I decipher your requirements correctly, then
awk 'NR % 5 != 2 && NR % 5 != 4' file
should do.
Based on Wintermutes logic :)
awk 'NR%5!~/^[24]$/' file
line1
line3
line5
line6
line8
line10
line11
line13
line15
or
awk 'NR%5~/^[013]$/' file
How it works.
We can see from your lines that the one with * should be removed and other kept.
line1
line2*
line3
line4*
line5
line6
line7*
line8
line9*
line10
line11
line12*
line13
line14*
line15
By grouping data inn to every 5 line NR%5,
We see that line to delete is 2 or 4 in every group.
NR%5!~/^[24]$/' This divide data inn to group of 5
Then this part /^[24]$/' tell to not keep 2 or 4
The ^ and $ is important so line 12 47 i deleted too,
since 12 contains 2. So we need to anchor it ^2$ and ^4$.
Using GNU sed, you can do the following command:
sed '2~5d;4~5d' test.txt

replacing word in shell script or sed

I am a newbie, but would like to create a script which does the following.
Suppose I have a file of the form
This is line1
This is line2
This is line3
This is line4
This is line5
This is line6
I would like to replace it in the form
\textbf{This is line1}
This is line2
This is line3
\textbf{This is line4}
This is line5
This is line6
That is, at the start of the paragraph I would like to add a text \textbf{ and end the line with }. Is there a way to search for double end of lines? I am having trouble creating such a script with sed. Thank you !
Using awk you can write something like
$ awk '!f{ $0 = "\\textbf{"$0"}"; f++} 1; /^$/{f=0}' input
\textbf{This is line1}
This is line2
This is line3
\textbf{This is line4}
This is line5
This is line6
What it does?
!f{ $0 = "\\textbf{"$0"}"; f++}
!f True if value of f is 0. For the first line, since the value of f is not set, will evaluates true. If its true, awk performs tha action part {}
$0 = "\\textbf{"$0"}" adds \textbf{ and } to the line
f++ increments the value of f so that it may not enter into this action part, unless f is set to zero
1 always True. Since action part is missing, awk performs the default action to print the entire line
/^$/ Pattern matches an empty line
{f=0} If the line is empty, then set f=0 so that the next line is modfied by the first action part to include the changes
An approach using sed
sed '/^$/{N;s/^\(\n\)\(.*\)/\1\\textbf{\2}/};1{s/\(.*\)/\\textbf{\1}/}' my_file
find all lines that only have a newline character and then add the next line to it. ==
^$/{N;s/^\(\n\)\(.*\)/\1\\textbf{\2}/}
mark the line below the blank line and modify it
find the first line in the file and do the same == 1{s/\(.*\)/\\textbf{\1}/}
Just use awk's paragraph mode:
$ awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"} {$1="\\textbf{"$1"}"} 1' file
\textbf{This is line1}
This is line2
This is line3
\textbf{This is line4}
This is line5
This is line6

bash, sed, awk: extracting lines within a range

How can I get sed to extract the lines between two patterns, write that data to a file, and then extract the lines between the next range and write that text to another file? For example given the following input:
pattern_a
line1
line2
line3
pattern_b
pattern_a
line4
line5
line6
pattern_b
I want line1 line2 and line3 to appear in one file and line4 line5 and line6 to appear in another file. I can't see a way of doing this without using a loop and maintaining some state between iterations of the loop where the state tells you where sed must start start search to looking for the start pattern (pattern_a) again.
For example, in bash-like psuedocode:
while not done
if [[ first ]]; then
sed -n -e '/pattern_a/,/pattern_b/p' > $filename
else
sed -n -e '$linenumber,/pattern_b/p' > $filename
fi
linenumber = last_matched_line
filename = new_filename
Is there a nifty way of doing this using sed? Or would awk be better?
How about this:
awk '/pattern_a/{f=1;c+=1;next}/pattern_b/{f=0;next}f{print > "outfile_"c}' input_file
This will create a outfile_x for every range.

Resources