I need script that sorts a text file and remove the duplicates.
Most, if not all, of the examples out there use the sort file1 | uniq > file2 approach.
In the man sort though, there is an -u option that does this at the time of sorting.
Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?
They should be equivalent in the simple case, but will behave differently if you're using the -k option to define only certain fields of the input line to use as sort keys. In that case, sort -u will suppress lines which have the same key even if other parts of the line differ, whereas uniq will only suppress lines that are exactly identical.
$ cat example
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping
but
$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping
I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.
Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.
If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.
To see how it works follow the link to sort's source and see the functions below this comment:
/* If uniquified output is turned on, output only the first of
an identical series of lines. */
Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.
One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.
They should be functionally equivalent, and sort -u should be more efficient.
I'm guessing the examples you're looking at simply didn't consider (or didn't have) "sort -u" as an option.
Does uniq sort?
I do not think so...
Because, at least on Ubuntu 18.04 and CentOS 6, it does not. It will just remove consecutive duplicates.
You can simply conduct a mini experiment.
Let the file sample.txt be:
a
a
a
b
b
b
a
a
a
b
b
b
cat sample.txt | uniq will output:
a
b
a
b
while cat sample.txt | sort -u will output:
a
b
sort | uniq may be functionally equivalent to sort -u.
Related
How does sort work? I have this file:
/test# cat foobar
html/lib/ORM/aaa.php
html/lib/ORMBase/ormbase_aaa.php
html/lib/ORM/zzz.php
html/lib/ORMBase/ormbase_zzz.php
And this is the output of sort:
/test# cat foobar | sort
html/lib/ORM/aaa.php
html/lib/ORMBase/ormbase_aaa.php
html/lib/ORMBase/ormbase_zzz.php
html/lib/ORM/zzz.php
I tried a lot of options: -f, -i, -t/... and I dont get it. I want to understand why sort thinks this is sorted.
NB: It works fine with this other sample:
/test# cat foobar2
a/a/a
a/ab/a
a/ab/b
a/a/ab
a/abc/a
/test# cat foobar2 | sort
a/a/a
a/a/ab
a/ab/a
a/ab/b
a/abc/a
sort tries to be clever with regard to localization. It ignores some non-alphanumeric characters like / and so on. The man page has a short sentence on that:
* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
So, to fix your issue:
$ cat foobar | LC_ALL=C sort
I have a list of files in a folder.
The names are:
1-a
100-a
2-b
20-b
3-x
and I want to sort them like
1-a
2-b
3-x
20-b
100-a
The files are always a number, followed by a dash, followed by anything.
I tried a ls with a col and sort and it works, but I wanted to know if there's a simpler solution.
Forgot to mention: This is bash running on a Mac OS X.
Some ls implementations, GNU coreutils' ls is one of them, support the -v (natural sort of (version) numbers within text) option:
% ls -v
1-a 2-b 3-x 20-b 100-a
or:
% ls -v1
1-a
2-b
3-x
20-b
100-a
Use sort to define the fields.
sort -s -t- -k1,1n -k2 filenames.txt
The -t tells sort to treat - as the field separator in input items. -k1,1n instructs sort to first sort on the first field numerically; -k2 sorts using the remaining fields as the second key in cade the first fields are equal. -s keeps the sort stable (although you could omit it since the entire input string is being used in one field or another).
(Note: I'm assuming the file names do not contain newlines, so that something like ls > filenames.txt is guaranteed to produce a file with one name per line. You could also use ls | sort ... in that case.)
I have two files with two single-column lists:
//file1 - full list of unique values
AAA
BBB
CCC
//file2
AAA
AAA
BBB
BBB
//So the result here would be:
CCC
I need to generate a list of values from file1 that have no matches in file2. I have to use bash script (preferably without special tools like awk) or DOS batch file.
Thank you.
Method 1
Looks like a job for grep's -v flag.
grep -v -F -f listtocheck uniques
Method 2
A variation to Drake Clarris's solution (that can be extended to checking using several files, which grep can't do unless they are first merged), would be:
(
sort < file_to_check | uniq
cat reference_file reference_file
) | sort | uniq -u
By doing this, any words in file_to_check will appear, in the output combined by the subshell in brackets, only once. Words in reference_file will be output at least twice, and words appearing in both files will be output at least three times - one from the first file, twice from the two copies of the second file.
There only remains to find a way to isolate the words we want, those that appear once, which is what sort | uniq -u does.
Optimization I
If reference_file contains a lot of duplicates, it might be worthwhile to run a heavier
sort < reference_file | uniq
sort < reference_file | uniq
instead of cat reference_file reference_file, in order to have a smaller output and weigh less on the final sort.
Optimization II
This would be even faster if we used temporary files, since merging already-sorted files can be done efficiently (and in case of repeated checks with different files, we could reuse again and again the same sorted reference file without need of re-sorting it); therefore
sort < file_to_check | uniq > .tmp.1
sort < reference_file | uniq > .tmp.2
# "--merge" works way faster, provided we're sure the input files are sorted
sort --merge .tmp.1 .tmp.2 .tmp.2 | uniq -u
rm -f .tmp.1 .tmp.2
Optimization III
Finally in case of very long runs of identical lines in one file, which may be the case with some logging systems for example, it may be also worthwhile to run uniq twice, one to get rid of the runs (ahem) and another to uniqueize it, since uniq works in linear time while sort is linearithmic.
uniq < file | sort | uniq > .tmp.1
For a Windows CMD solution (commonly referred to as DOS, but not really):
It should be as simple as
findstr /vlxg:"file2" "file1"
but there is a findstr bug that results in possible missing matches when there are multiple literal search strings.
If a case insensitive search is acceptable, then adding the /I option circumvents the bug.
findstr /vlixg:"file2" "file1"
If you are not restricted to native Windows commands then you can download a utility like grep for Windows. The Gnu utilities for Windows are a good source. Then you could use Isemi's solution on both Windows and 'nix.
It is also easy to write a VBScript or JScript solution for Windows.
cat file1 file2 | sort | uniq -u
I have a file with floats with exponents and I want to sort them. AFAIK 'sort -g' is what I need. But it seems like it sorts floats throwing away all the exponents. So the output looks like this (which is not what I wanted):
$ cat file.txt | sort -g
8.387280091e-05
8.391373668e-05
8.461754562e-07
8.547354437e-05
8.831553093e-06
8.936111118e-05
8.959458896e-07
This brings me to two questions:
Why 'sort -g' doesn't work as I expect it to work?
How cat I sort my file with using bash commands?
The problem is that in some countries local settings can mess this up by using , as the decimal separator instead of . on a system level. Check by typing locale in terminal. There should be an entry
LC_NUMERIC=en_US.UTF-8
If the value is anything else, change it to the above by editing the locale file
sudo gedit /etc/default/locale
That's it. You can also temporarily use this value by doing
LC_ALL=C sort -g file.dat
LC_ALL=C is shorter to write in terminal, but putting it in the locale file might not be preferable as it could alter some other system-wide behavior such as maybe time format.
Here's a neat trick:
$ sort -te -k2,2n -k1,1n test.txt
8.461754562e-07
8.959458896e-07
8.831553093e-06
8.387280091e-05
8.391373668e-05
8.547354437e-05
8.936111118e-05
The -te divides your number into two fields by the e that separates out the mantissa from the exponent. the -k2,2 says to sort by exponent first, then the -k1,1 says to sort by your mantissa next.
Works with all versions of the sort command.
Your method is absolutely correct
cat file.txt | sort -g
If the above code is not working , then try this
sed 's/\./0000000000000/g' file.txt | sort -g | sed 's/0000000000000/\./g'
Convert '.' to '0000000000000' , sort and again subsitute with '.'. I chose '0000000000000' to replace so as to avoid mismatching of the number with the inputs.
You can manipulate the number by your own.
Is there a simple way to remove duplicate contents from a large textfile? It would be great to be able to detect duplicate sentences (as separated by "." or even better to find duplicates of sentence fragments (such as 4-word pieces of text).
Removing duplicate words is easy enough, as other people have pointed out. Anything more complicated than that, and you're into Natural Language Processing territory. Bash isn't the best tool for that -- you need a slightly more elegant weapon for a civilized age.
Personally, I recommend Python and it's NLTK (natural language toolkit). Before you dive into that, it's probably worth reading up a little bit on NLP so that you know what you actually need to do. For example, the "4-word pieces of text" are known as 4-grams (n-grams in the generic case) in the literature. The toolkit will help you find those, and more.
Of course, there are probably alternatives to Python/NLTK, but I'm not familiar with any.
Remove duplicate phrases an keeping the original order:
nl -w 8 "$infile" | sort -k2 -u | sort -n | cut -f2
The first stage of the pipeline prepends every line with line number to document the original order. The second stage sorts the original data with the unique switch set.
The third restores the original order (sorting the 1. column). The final cut removes the first column.
You can remove duplicate lines (which have to be exactly equal) with uniq if you sort your textfile first.
$ cat foo.txt
foo
bar
quux
foo
baz
bar
$ sort foo.txt
bar
bar
baz
foo
foo
quux
$ sort foo.txt | uniq
bar
baz
foo
quux
Apart from that, there's no simple way of doing what you want. (How will you even split sentences?)
You can use grep with backreferences.
If you write grep "\([[:alpha:]]*\)[[:space:]]*\1" -o <filename> it will match any two identical words following one another. I.e. if the file content is this is the the test file , it will output the the.
(Explanation [[:alpha:]] matches any character a-z and A-Z, the asterisk * after it means that may appear as many times as it wants, the \(\) is used for grouping to backreference it later, then [[:space:]]* matches any number of spaces and tabs, and finally \1 matches the exact sequence that was found, that was enclosed in \(\)brackets)
Likewise, if you want to match a group of 4 words, that is repeated two times in a row, the expression will look like grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}[[:space:]]*\1" -o <filename> - it will match e.g. a b c d a b c d.
Now we need to add an arbitrary character sequence inbetween matches. In theory this should be done with inserting .* just before backreference, i.e. grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}.*\1" -o <filename>, but this doesn't seem to work for me - it matches just any string and ignores said backreference
The short answer is that there's no easy method. In general any solution needs to first decide how to split the input document into chunks (sentences, sets of 4 words each, etc) and then compare them to find duplicates. If it's important that the ordering of the non-duplicate elements by the same in the output as it was in the input then this only complicates matters further.
The simplest bash-friendly solution would be to split the input into lines based on whatever criteria you choose (e.g. split on each ., although doing this quote-safely is a bit tricky) and then use standard duplicate detection mechanisms (e.g. | uniq -c | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' and then, for each resulting line, remote the text from the input.
Presuming that you had a file that was properly split into lines per "sentence" then
uniq -c lines_of_input_file | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' | while IFS= read -r match ; do sed -i '' -e 's/'"$match"'//g' input_file ; done
Might be sufficient. Of course it will break horribly if the $match contains any data which sed interprets as a pattern. Another mechanism should be employed to perform the actual replacement if this is an issue for you.
Note: If you're using GNU sed the -E switch above should be changed to -r
I just created a script in python, that does pretty much what I wanted originally:
import string
import sys
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub)
if len(sys.argv) != 2:
sys.exit("Usage: find_duplicate_fragments.py some_textfile.txt")
file=sys.argv[1]
infile=open(file,"r")
text=infile.read()
text=text.replace('\n','') # remove newlines
table = string.maketrans("","")
text=text.translate(table, string.punctuation) # remove punctuation characters
text=text.translate(table, string.digits) # remove numbers
text=text.upper() # to uppercase
while text.find(" ")>-1:
text=text.replace(" "," ") # strip double-spaces
spaces=list(find_all(text," ")) # find all spaces
# scan through the whole text in packets of four words
# and check for multiple appearances.
for i in range(0,len(spaces)-4):
searchfor=text[spaces[i]+1:spaces[i+4]]
duplist=list(find_all(text[spaces[i+4]:len(text)],searchfor))
if len(duplist)>0:
print len(duplist),': ',searchfor
BTW: I'm a python newbie, so any hints on better python practise are welcome!