How I can keep only the non repeated lines in a file? - bash

Want I want to do is simply keep the lines which are not repeated in a huge file like this:
..
a
b
b
c
d
d
..
The desired output is then:
..
a
c
..
Many thanks in advance.

uniq has arg -u
-u, --unique only print unique lines
Example:
$ printf 'a\nb\nb\nc\nd\nd\n' | uniq -u
a
c
If your data is not sorted, do sort at first
$ printf 'd\na\nb\nb\nc\nd\n' | sort | uniq -u
Preserve the order:
$ cat foo
d
c
b
b
a
d
$ grep -f <(sort foo | uniq -u) foo
c
a
greps the file for patterns obtained by aforementioned uniq. I can imagine, though, that if your file is really huge then it will take a long time.
The same without somewhat ugly Process substitution:
$ sort foo | uniq -u | grep -f- foo
c
a

This awk should work to list only lines that are not repeated in file:
awk 'seen[$0]++{dup[$0]} END {for (i in seen) if (!(i in dup)) print i}' file
a
c
Just remember that original order of lines may change due to hashing of arrays in awk.
EDIT: To preserve the original order:
awk '$0 in seen{dup[$0]; next}
{seen[$0]++; a[++n]=$0}
END {for (i=1; i<=n; i++) if (!(a[i] in dup)) print a[i]}' file
a
c
This is job that is tailor made for awk which doesn't require multiple processes, pipes and process substitution and will be more efficient for bigger files.

When your file is sorted, it's simple:
cat file.txt | uniq > file2.txt
mv file2.txt file.txt

Related

how to compare total in unix

i have a file simple.txt. with contents as below:
a b
c d
c d
I want to check which pair 'a b' or 'c d' has maximum occurrence? I have written this code which gives me output of individual occurrence of each word :
cat simple.txt | tr -cs '[:alnum:]' '[\n*]' | sort | uniq -c |
grep -E -i "\<a\>|\<b\>|\<c\>|\<d\>"
1 a
1 b
2 c
2 d
how can i total the result of this output? or can i write a different code?
If we can assume that each pair of letters is a complete line, one way to handle this would be to sort the lines, use the uniq utility to get a count of each unique line, and then reverse sort to get the count:
sort simple.txt | uniq -c | sort -rn
You may want to get rid of the empty lines using egrep:
egrep '\w' simple.txt | sort | uniq -c | sort -rn
Which should give you:
2 c d
1 a b
$ sort file |
uniq -c |
sort -nr > >(read -r count pair; echo "max count $count is for pair $pair")
sort, count numerically in descending order, read the first and print the results.
or all the above in one awk script...
$ awk '{c[$0]++}
END{n=asorti(c,ci); k=ci[n];
print "max count is " c[k] " for pair " k}' file
With single GNU awk command:
awk 'BEGIN{ PROCINFO["sorted_in"] = "#val_num_desc" }
NF{ a[$0]++ }
END{ for (i in a) { print "The pair with max occurence is:", i; break } }' file
The output:
The pair with max occurence is: c d
To get the pair that occurs most frequently:
$ sort <simple.txt | uniq -c | sort -nr | awk '{print "The pair with max occurence is",$2,$3; exit}'
The pair with max occurence is c d
This can be done entirely by awk and without any need for pipelines:
$ awk '{a[$0]++} END{for (x in a) if (a[x]>(max+0)) {max=a[x]; line=x}; print "The pair with max occurence is",line}' simple.txt
The pair with max occurence is c d

Delete rows that match pattern by id

I have the following file containing n rows:
>name.1_i4_xyz_n
>name.1_i1_xyz_n
>name.1_i1_xyz_n
>name.1_i1_xyz_m
>name.1_i2_xyz_n
>name.1_i2_xyz_m
>name.1_i7_xyz_m
>name.1_i4_xyz_n
...
I want to delete rows that ends with m. In the example the output would be:
>name.1_i4_n
>name.1_i4_n
...
Note that I've deleted i2 as it has two records and one of them ends with m. Same with i1.
Any help? I want to keep it simple and do it with just one line of code.
This is what I have so far:
$ grep "i._.*." < input.txt | sort -k 2 -t "_" | cut -d'_' -f1,2,4
>name.1_i1_m
>name.1_i1_n
>name.1_i1_n
>name.1_i2_m
>name.1_i2_n
>name.1_i4_n
>name.1_i4_n
>name.1_i7_m
...
to delete rows that ends with m:
$ grep -v m$ file
>name.1_i4_xyz_n
>name.1_i1_xyz_n
>name.1_i1_xyz_n
>name.1_i2_xyz_n
>name.1_i4_xyz_n
Another solution that handles the ids, using awk and 2 runs:
$ awk 'BEGIN { FS="_" } # set delimiter
NR==FNR { # on the first run
if($0~/m$/) # if it ends in an m
d[$2] # make a del array entry of that index
next
}
($2 in d==0)' file file # on the second run don't print if index in del array
>name.1_i4_xyz_n
>name.1_i4_xyz_n
One-liner version:
$ awk 'BEGIN{FS="_"}NR==FNR{if($0~/m$/)d[$2];next}($2 in d==0)' file file
If the i... part does not appear in any other column, you can use
grep -vFf <(grep -E 'm$' file | cut -d _ -f 2) file
The part inside <() filters out all i... that have a row ending with m. In your example: i1, i2, and i7.
The outer grep takes a list of literal search strings (inside the <()) and prints only the lines not containing any of the search strings.
You can use awk as this:
awk -F_ '{if(/m$/) a[$2]; else rows[++n]=$0}
END{for (i=1; i<=n; i++) {split(rows[i], b, FS); if (!(b[2] in a)) print}}' file
>name.1_i4_xyz_n
>name.1_i4_xyz_n
Another awk proposal.
awk '/_i4/&&!/_m$/' filterm.awk
>name.1_i4_xyz_n
>name.1_i4_xyz_n

How do I read a file into a matrix in bash?

I have a text file like this
A;green;3
B;blue;2
A;red;4
C;red;2
C;blue;3
B;green;3
I have to write a script that if started with parameter "B" gives me the color of the row with the biggest number (from the rows starting with B). In this case it would be the last line, so the output would be "green".
How do I separate the elements by ";"-s and newlines and store them into a matrix so I can work with it? Do I even need to do that, or is there an easier solution?
Thanks in advance!
awk + sort solution:
awk -v param="B" -F';' '$1==param{ print $2; exit }' <(sort -t';' -k1,1 -k3nr file.txt)
The output:
green
Or in addition to #William Pursell's answer - to extract only color value:
awk -F';' '/^B/ && $3>m{ m=$3; c=$2 }END{ print c }' file.txt
green
Via bash script:
get_max_color.sh script:
#!/bin/bash
awk -F';' -v p="$1" '$0~"^"p && $3>m{ m=$3; c=$2 }END{ print c }' "$2"
Usage:
bash get_max_color.sh B "file.txt"
green
You just need to filter out the appropriate lines and store the one with the max value seen. The obvious solution is:
awk '/^B/ && $3 > m{s=$0} END { print s}' FS=\; input
To use a parameter, do
awk "/^$1/"' && $3 > m{s=$0} END { print s}' FS=\; input
A non-awk solution, possibly less elegant and slower than the already proposed solution:
sort -r -t\; -k1,1 -k3 file | uniq -w1 | grep "B" | cut -f2 -d\;
awk to the rescue!
probably not understood what you want to achieve but
awk -v key="$c" -F\; 'm[$1]<$3{m[$1]=$3; c[$1]=$2} END{print c[key]}' file
will pick the highest coded color from the file for the key
some poor usage pattern
$ for c in A B C;
do
echo $c "->" $(awk -v key="$c" -F\; 'm[$1]<$3 {m[$1]=$3; c[$1]=$2}
END {print c[key]}' file);
done;
A -> red
B -> green
C -> blue
you can probably implement the rest of the script in awk and do this process once.
Or, perhaps you want an associative array, can be done as below:
$ declare -A colors;
while IFS=\; read k c _ ;
do
colors[$k]=$c;
done < <(sort -t\; -k1,1 -k3nr file | uniq -w1)
$ echo ${colors[A]}
red

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

Scripts for listing all the distinct characters in a text file

E.g.
Given a file input.txt, which has the following content:
He likes cats, really?
the output would be like:
H
e
l
i
k
s
c
a
t
,
r
l
y
?
Note the order of characters in output does not matter.
One way using grep -o . to put each character on a newline and sort -u to remove duplicates:
$ grep -o . file | sort -u
Or a solution that doesn't required sort -u or multiple commands written purely in awk:
$ awk '{for(i=1;i<=NF;i++)if(!a[$i]++)print $i}' FS="" file
How about:
echo "He likes cats, really?" | fold -w1 | sort -u
An awk way:
awk '{$1=$1}1' FS="" OFS="\n" file | sort -u
You can use sed as follows:
sed 's/./\0\n/g' input.txt | sort -u

Resources