Count of unique words from one column of a file in shell - bash

I was trying find out the count of unique words from one column of a file, and the words themselves, using a shell script. Here's what I was doing. Input file contains (filename: gnc.txt, new line after city name):
Male,Tyrus,Seattle
Male,Sam,Seattle
Male,Meha,Seattle
Male,John,Seattle
Male,Sam,Beijing
Male,Meha,Paris
Male,Meha,Berlin
As a first step I found out the number of unique names, which is 4 using below shell commands.
awk -F\, '{ if(!a[$2]) cnt++;a[$2]++;next}END{ print cnt }' gnc.txt
As a next step I want to get the list of unique names: i.e. Tyrus, Sam, Meha and John
Can someone help me in this on how to alter the above command?

Using this awk:
awk -F, '{c[$2]++} END{for (i in c) print i, c[i]}' file
Tyrus 1
Sam 2
John 1
Meha 3

You can also use this:
cut -d',' -f2 file | sort | uniq -c
1 John
3 Meha
2 Sam
1 Tyrus

Related

How to export original unique values using awk

This command works great for concatenating duplicates and giving only unique values:
awk '!x[$0]++' filewithdupes > newfile
However, I want to keep the original unique values.
Example:
If I have this simple set of values in a CSV column:
1
1
2
2
3
The command above outputs this:
1
2
3
But I want:
3
How can I modify this command to keep the original unique value? Or is there a command better suited to what I'm trying to do?
You may use this awk to print record that has only one occurrence:
awk '{x[$0]++} END{for (i in x) if (x[i] == 1) print i}' filewithdupes
3
if your file is already sorted as in the example, the simplest will be
$ uniq -u file
3
otherwise, a double scan algorithm
$ awk 'NR==FNR{a[$1]++; next} a[$1]==1' file{,}
3
Could you please try following.
awk 'FNR==NR{a[$0]++;next} a[$0]==1' Input_file Input_file

Counting the number of names in a category in a .csv with bash

I would like to count the number of students in a .csv file depending on the category
Category 1 is the name, Category 2 is the country, Category 3 is the city
The .csv file is displayed as such :
michael_s;jpa;NYC
john_d;chn;TXS
jim_h;usa;POP
I have tried in my .sh script but it didn't work
sort -k3 -t; students.csv
edit:
I am trying to make a bash script that counts students by city and something that can also count one city just by executing the script such as
cat students.csv | ./script.sh NYC
The terminal will only display the students from NYC
If I've understood you correctly, something like this?
cut -d";" -f3 mike.txt | sort | uniq -c
(Sorry, incorrect solution first time - updated now)
To count only one city:
cut -d";" -f3 mike.txt | grep "NYC" | wc -l
Depending on the size of the file, how often you'll be doing this etc. it may be sensible to look at other solutions, eg. awk. But this solution will work just fine.
The reason for the error message "sort: multi-character tab 'students.csv'" is you haven't given the -t option the separator character. If you add a semicolon after -t, the sort will work as expected:
sort -k3 -t';' students.csv
There is always awk:
$ awk -F\; 'a[$1]++==0{c++}END{print c}' file
3
Once you describe your requirements more throughly, (count the names but sort -k3. Update the OP, please) we can help you better.
Edited to match your update:
$ awk -F\; -v col=3 -v val=NYC '
(length(val) && $col==val) || length(val)==0 && a[$col]++==0 {
c++
}
END { print c }
' file
1
If you set -v val= with the value you are looking for and -v col= with the column number, it counts the occurrences of val in col. You you set col but not val ot counts different values in col.

Print lines where first column matches, second column different

In a text file, how do I print out only the lines where the first column is duplicate but 2nd column is different? I want to reconcile these differences. Possibly using awk/sed/bash?
Input:
Jon AAA
Jon BBB
Ellen CCC
Ellen CCC
Output:
Jon AAA
Jon BBB
Note that the real file is not sorted.
Thanks for any help.
this line should do: (I broke the one-liner into 3 lines for better reading)
awk '!($1 in a) {a[$1]=$2;next}
$1 in a && $2!=a[$1]{p[$1 FS $2];p[$1 FS a[$1]]}
END{for(x in p)print x}' file
the 1st line save $1 $2 into array, if it was checked first time
line2: for existing $1 and different $2, put them (the two lines) into an array p, so that same $1,$2 combination won't be print multiple times.
print the index of array p
sort file | uniq -u
Will only print the unique lines.
This might work for you:
sort file | uniq -u | rev | uniq -Df1 | rev
This sorts the file, removes any duplicate lines, reverses the line, removes and unique lines that don't have the same key (keeps duplicates where the 2nd field is the same) and the reverses the line to its original position.
This will drop duplicate lines and lines with singleton keys.
Just a normal unique sort should work
awk '!a[$0]++' test

Linux bash grouping

I have this file:
count,name
1,B1
1,B1
1,B3
1,B3
1,B2
1,B2
1,B2
and I routinely have to get counters on the total per group. The first number is always one. The only important thing is the group. I wrote a java program to do it for me. The output would be
B1: 2
B2: 3
B3: 2
The format is not important, just the counters per group name.
I was wondering, can this be done in bash? awk? sed?
Well, it is very simple to solve with sort and uniq:
$ sort file | uniq -c
2 1,B1
3 1,B2
2 1,B3
Then, if you need the proper formatting, you may use cut to strip the first column, and awk to print the result:
$ cut -d ',' -f 2 file | sort | uniq -c | awk '{printf "%s: %d\n", $2, $1}'
B1: 2
B2: 3
B3: 2
With awk, I would write
awk -F, 'NR>1 {n[$2]++} END {OFS=":";for (x in n) print x, n[x]}' file
assuming you actually have a header line in the file.

extracting values from text file using awk

I have 100 text files which look like this:
File title
4
Realization number
variable 2 name
variable 3 name
variable 4 name
1 3452 4538 325.5
The first number on the 7th line (1) is the realization number, which SHOULD relate to the file name. i.e. The first file is called file1.txt and has realization number 1 (as shown above). The second file is called file2.txt and should have realization number 2 on the 7th line. file3.txt should have realization number 3 on the 7th line, and so on...
Unfortunately every file has realization=1, where they should be incremented according to the file name.
I want to extract variables 2, 3 and 4 from the 7th line (3452, 4538 and 325.5) in each of the files and append them to a summary file called summary.txt.
I know how to extract the information from 1 file:
awk 'NR==7,NR==7{print $2, $3, $4}' file1.txt
Which, correctly gives me:
3452 4538 325.5
My first problem is that this command doesn't seem to give the same results when run from a bash script on multiple files.
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR=7,NR==7{print $2, $3, $4}' File$((i)).txt
done
I get multiple lines being printed to the screen when I use the above script.
Secondly, I would like to output those values to the summary file along with the CORRECT preceeding realization number. i.e. I want a file that looks like this:
1 3452 4538 325.5
2 4582 6853 158.2
...
100 4865 3589 15.15
Thanks for any help!
You can simplify some things and get the result you're after:
#!/bin/bash
for ((i=1;i<=100;i++))
do
echo $i $(awk 'NR==7{print $2, $3, $4}' File$i.txt)
done
You really don't want to assign to NR=7 (as you did) and you don't need to repeat the NR==7,NR==7 either. You also really don't need the $((i)) notation when $i is sufficient.
If all the files are exactly 7 lines long, you can do it all in one awk command (instead of 100 of them):
awk 'NR%7==0 { print ++i, $2, $3, $4}' Files*.txt
Notice that you have only one = in your bash script. Does all the files have exactly 7 lines? If you are only interested in the 7th line then:
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR==7{print $2, $3, $4}' File$((i)).txt
done
Since your realization number starts from 1, you can simply add that using nl command.
For example, if your bash script is called s.sh then:
./s.sh | nl > summary.txt
will get you the result with the expected lines in summary.txt
Here's one way using awk:
awk 'FNR==7 { print ++i, $2, $3, $4 > "summary.txt" }' $(ls -v file*)
The -v flag simply sorts the glob by version numbers. If your version of ls doesn't support this flag, try: ls file* | sort -V instead.

Resources