cygwin uniq not working - sorting

Given the following sorted file (myfile.txt):
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
http://quarterly.mayo.edu/photolisting/default.cfm?summary=&displaymode=&reporting_unit_code=
When I try:
uniq -c myfile.txt
I get:
21 http://quarterly.mayo.edu/photoListing/default.cfm?summary=&displaymode=&reporting_unit_code=
1 http://quarterly.mayo.edu/photolisting/default.cfm?summary=&displaymode=&reporting_unit_code=
Which I guess might indicate a hidden character or something - but when I try:
uniq -u myfile.txt
I get the expected:
http://quarterly.mayo.edu/photolisting/default.cfm?summary=&displaymode=&reporting_unit_code=
Is this a bonafide inconsistency, or am I missing something?
Thanks,
Al

uniq -u only prints unique lines. Your myfile.txt apparently has 21 identical lines followed by one unique line. uniq -u prints only that one unique line.
uniq myfile.txt should print two lines, the first corresponding to the 21 identical lines and the second corresponding to the final non-matching line.
For example:
$ ( echo foo ; echo foo ; echo bar ) | uniq -c
2 foo
1 bar
$ ( echo foo ; echo foo ; echo bar ) | uniq -u
bar
$
As for why uniq -c is producing 2 lines of output rather than 1, it's because your last line is different from the preceding 21 lines. You have photoListing (uppercase L) on lines 1..21 and photolisting (lowercase l) on line 22.
(My first thought was that you probably had some hidden characters in the file; since you're on Cygwin, inconsistent line endings are the most likely culprit. To see the hidden characters:
uniq -c myfile.txt | cat -A
But it turns out that's not the problem.)

Related

Sort file numerically and preserve blank lines between entries in Bash

I am currently struggling at sorting data. I searched online and never saw any topics mentioning my issue...
I have files with unordered data like:
1
blank line
3
blank line
2
Which have a blank line in between values. When I use my script, it effectively sorts data but blank lines are on top and values on bottom, like :
blank line
blank line
1
2
3
I would like to have an output like:
1
blank line
2
blank line
3
which preserves the structure of the input.
The command I use is: sort -nk1 filename > newfile
How can I preserve the blank lines in the right places?
Remove the empty lines, sort and add the empty lines again:
grep . filename | sort -nk1 | sed 's/$/\n/' > newfile
You can combine grep and sed
sort -nk1 filename | sed -n '/./ s/$/\n/p' > newfile
When you don't have an empty line after each data-line, you need to add some marker temporarily
tr '\n' '\r' < filename |
sed -r 's/([^\r]+)\r\r/\1\a\r/g;s/\r/\n/g' |
sort -nk1 | sed 's/\a/\n/g' > newfile

bash sort / uniq -c: how to use tab instead of space as delimiter in output?

I have a file strings.txt listing strings, which I am processing like this:
sort strings.txt | uniq -c | sort -n > uniq.counts
So the resulting file uniq.counts will list uniq strings sorted in the ascending order by their counts, so something like this:
1 some string with spaces
5 some-other,string
25 most;frequent:string
Note that strings in strings.txt may contain spaces, commas, semicolons and other separators, except for the tab. How can I get uniq.counts to be in this format:
1<tab>some string with spaces
5<tab>some-other,string
25<tab>most;frequent:string
You can do:
sort strings.txt | uniq -c | sort -n | sed -E 's/^ *//; s/ /\t/' > uniq.counts
sed will first remove all leading spaces at the beginning of the line (before counts) and then it will replace space after count to tab character.
You can simply pipe the output of the sort, etc to sed before writing to uniq.counts, e.g. add:
| sed -e 's/^\([0-9][0-9]*\)\(.*$\)/\1\t\2/' > uniq.counts
The full expression would be:
$ sort strings.txt | uniq -c | sort -n | \
sed -e 's/^\([0-9][0-9]*\)\(.*$\)/\1\t\2/' > uniq.counts
(line continuation included for clarity)
With GNU sed:
sort strings.txt | uniq -c | sort -n | sed -r 's/([0-9]) /\1\t/' > uniq.counts
Output to uniq.counts:
1 some string with spaces
5 some-other,string
25 most;frequent:string
If you want to edit your file "in place" use sed's option -i.

How do I check if the number of lines in a set of files is not equal to a certain number? (Bash)

I want to detect which one of my files is corrupt, and by corrupt it means that the file does not have 102 lines in it. I want the for loop that I'm writing to output a error message giving me the file name of the corrupt files. I have files named ethane1.log ethane2.log ethane3.log ... ethane10201.log .
for j in {1..10201}
do
if [ ! (grep 'C 2- C 5' ethane$j.log | cut -c 22- | tail -n +2 | awk '{for (i=1;i<=NF;i++) print $i}'; done | wc -l) == 102]
then echo "Ethane$j.log is corrupt."
fi
done
When the file is not corrupt, the input:
grep 'C 2- C 5' ethane$j.log | cut -c 22- | tail -n +2 | awk '{for (i=1;i<=NF;i++) print $i}'; done | wc -l
returns:
102
Or else it is another number.
Only thing is, I'm not sure the syntax for the if construct (How to create a variable from the 102 output of wc -l, and then how to check if it is equal to or not equal to 102.)
A sample output would be:
Ethane100.log is corrupt.
Ethane2010.log is corrupt.
Ethane10201.log is corrupt.
To count lines, use wc -l:
wc -l ethane*.log | grep -v '^ *102 ' | head -n-1
grep -v removes matching lines
^ matches the start of a line
space* matches any number of spaces (0 or more)
head removes some trailing lines
-n-1 removes the last line (the total)
Using gawk
awk 'ENDFILE{if(NR!=102)print NR,FILENAME}' ethane*.log
At the end of each file, checks the number of lines isn't 102 and prints the number of lines and the filename.

How to sort groups of lines?

In the following example, there are 3 elements that have to be sorted:
"[aaa]" and the 4 lines (always 4) below it form a single unit.
"[kkk]" and the 4 lines (always 4) below it form a single unit.
"[zzz]" and the 4 lines (always 4) below it form a single unit.
Only groups of lines following this pattern should be sorted; anything before "[aaa]" and after the 4th line of "[zzz]" must be left intact.
from:
This sentence and everything above it should not be sorted.
[zzz]
some
random
text
here
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
And neither should this one and everything below it.
to:
This sentence and everything above it should not be sorted.
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
[zzz]
some
random
text
here
And neither should this one and everything below it.
Maybe not the fastest :) [1] but it will do what you want, I believe:
for line in $(grep -n '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -n +$line sections.txt | head -n 5
done
Here's a better one:
for pos in $(grep -b '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
[1] The first one is something like O(N^2) in the number of lines in the file, since it has to read all the way to the section for each section. The second one, which can seek immediately to the right character position, should be closer to O(N log N).
[2] This takes you at your word that there are always exactly five lines in each section (header plus four following), hence head -n 5. However, it would be really easy to replace that with something which read up to but not including the next line starting with a '[', in case that ever turns out to be necessary.
Preserving start and end requires a bit more work:
# Find all the sections
mapfile indices < <(grep -b '^\[.*\]$' sections.txt)
# Output the prefix
head -c+${indices[0]%%:*} sections.txt
# Output sections, as above
for pos in $(printf %s "${indices[#]}" |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
# Output the suffix
tail -c+$((1+${indices[-1]%%:*})) sections.txt | tail -n+6
You might want to make a function out of that, or a script file, changing sections.txt to $1 throughout.
Assuming that other lines do not contain a [ in them:
header=`grep -n 'This sentence and everything above it should not be sorted.' sortme.txt | cut -d: -f1`
footer=`grep -n 'And neither should this one and everything below it.' sortme.txt | cut -d: -f1`
head -n $header sortme.txt #print header
head -n $(( footer - 1 )) sortme.txt | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
#cat sortme.txt | head -n $(( footer - 1 )) | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
tail -n +$footer sortme.txt #print footer
Serves the purpose.
Note that the main sort work is done by 4th command only. Other lines are to reserve header & footer.
I am also assuming that, between header & first "[section]" there are no other lines.
This might work for you (GNU sed & sort):
sed -i.bak '/^\[/!b;N;N;N;N;s/\n/UnIqUeStRiNg/g;w sort_file' file
sort -o sort_file sort_file
sed -i -e '/^\[/!b;R sort_file' -e 'd' file
sed -i 's/UnIqUeStRiNg/\n/g' file
Sorted file will be in file and original file in file.bak.
This will present all lines beginning with [ and following 4 lines, in sorted order.
UnIqUeStRiNg can be any unique string not containing a newline, e.g. \x00

Is there a shell command to pick the n-th line?

Is there a shell command to pick the n-th line of a string ?
Example:
line1
line2
line3
pick line 2.
UPDATE: Thank you so far. With your help, I came up with this solution for a string:
Pick the 2nd line:
echo -e "1\n2\n3" | head -2 | tail -1
$ head -n filename | tail -1
where 'n' is your line number. But it's a little inefficient, launching 2 processes.
Alternatively sed can do this. To print the 4th line:
$ sed -n 4p filename
This forum answer details 3 different methods for sed
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
Using gawk:
gawk -v n=3 'n==NR { print; exit }' a.txt
head -4 a.txt | tail -1
To print the 4:th line in a. txt.

Resources