check if column has more than one value in unix [duplicate] - shell

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l

awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv

You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l

Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold

# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l

Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done

This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Related

how to use cut command -f flag as reverse

This is a text file called a.txt
ok.google.com
abc.google.com
I want to select every subdomain separately
cat a.txt | cut -d "." -f1 (it select ok From left side)
cat a.txt | cut -d "." -f2 (it select google from left side)
Is there any way, so I can get result from right side
cat a.txt | cut (so it can select com From right side)
There could be few ways to do this, one way which I could think of right now could be using rev + cut + rev solution. Which will reverse the input by rev command and then set field separator as . and print fields as per they are from left to right(but actually they are reversed because of the use of rev), then pass this output to rev again to get it in its actual order.
rev Input_file | cut -d'.' -f 1 | rev
You can use awk to print the last field:
awk -F. '{print $NF}' a.txt
-F. sets the record separator to "."
$NF is the last field
And you can give your file directly as an argument, so you can avoid the famous "Useless use of cat"
For other fields, but counting from the last, you can use expressions as suggested in the comment by #sundeep or described in the users's guide under
4.3 Nonconstant Field Numbers. For example, to get the domain, before the TLD, you can substract 1 from the Number of Fields NF :
awk -F. '{ print $(NF-1) }' a.txt
You might use sed with a quantifier for the grouped value repeated till the end of the string.
( Start group
\.[^[:space:].]+ Match 1 dot and 1+ occurrences of any char except a space or dot
){1} Close the group followed by a quantifier
$ End of string
Example
sed -E 's/(\.[^[:space:].]+){1}$//' file
Output
ok.google
abc.google
If the quantifier is {2} the output will be
ok
abc
Depending on what you want to do after getting the values then you could use bash for splitting your domain into an array of its components:
#!/bin/bash
IFS=. read -ra comps <<< "ok.google.com"
echo "${comps[-2]}"
# or for bash < 4.2
echo "${comps[${#comps[#]}-2]}"
google

awk to do group by sum of column

I have this csv file and I am trying to write shell script to calculate sum of column after doing group by on it. Column number is 11th (STATUS)
My script is
awk -F, 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' $f > $parentdir/outputfile.csv;
File output expected is
COMMITTED 2
but actual output is just 2.
It prints only count and not group by sum. If I delete any other columns and run same query then it works fine but not with below sample data.
FILE NAME;SEQUENCE NR;TRANSACTION ID;RUN NUMBER;START EDITCREATION;END EDITCREATION;END COMMIT;EDIT DURATION;COMMIT DURATION;HAS DEPENDENCY;STATUS;DETAILS
Buldhana_Refinesource_FG_IW_ETS_000001.xml;1;4a032127-b20d-4fa8-9f4d-7f2999c0c08f;1;20180831130210345;20180831130429638;20180831130722406;140;173;false;COMMITTED;
Buldhana_Refinesource_FG_IW_ETS_000001.xml;2;e4043fc0-3b0a-46ec-b409-748f98ce98ad;1;20180831130722724;20180831130947144;20180831131216693;145;150;false;COMMITTED;
change the FS to ; in your script
awk -F';' 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' file
COMMITTED 2
You're using wrong field separator. Use
awk -F\;
; must be escaped to use it as a literal. Except this, your approach seems OK.
Besides awk, you may also use
tail -n +2 $f | cut -f11 -d\; | sort | uniq -c
or
datamash --header-in -t \; -g 11 count 11 < $f
to do the same thing.

Counting the number of names in a category in a .csv with bash

I would like to count the number of students in a .csv file depending on the category
Category 1 is the name, Category 2 is the country, Category 3 is the city
The .csv file is displayed as such :
michael_s;jpa;NYC
john_d;chn;TXS
jim_h;usa;POP
I have tried in my .sh script but it didn't work
sort -k3 -t; students.csv
edit:
I am trying to make a bash script that counts students by city and something that can also count one city just by executing the script such as
cat students.csv | ./script.sh NYC
The terminal will only display the students from NYC
If I've understood you correctly, something like this?
cut -d";" -f3 mike.txt | sort | uniq -c
(Sorry, incorrect solution first time - updated now)
To count only one city:
cut -d";" -f3 mike.txt | grep "NYC" | wc -l
Depending on the size of the file, how often you'll be doing this etc. it may be sensible to look at other solutions, eg. awk. But this solution will work just fine.
The reason for the error message "sort: multi-character tab 'students.csv'" is you haven't given the -t option the separator character. If you add a semicolon after -t, the sort will work as expected:
sort -k3 -t';' students.csv
There is always awk:
$ awk -F\; 'a[$1]++==0{c++}END{print c}' file
3
Once you describe your requirements more throughly, (count the names but sort -k3. Update the OP, please) we can help you better.
Edited to match your update:
$ awk -F\; -v col=3 -v val=NYC '
(length(val) && $col==val) || length(val)==0 && a[$col]++==0 {
c++
}
END { print c }
' file
1
If you set -v val= with the value you are looking for and -v col= with the column number, it counts the occurrences of val in col. You you set col but not val ot counts different values in col.

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

how awk takes the result of a unix command as a parameter?

Say there is an input file with tabs delimited field, the first field is integer
1 abc
1 def
1 ghi
1 lalala
1 heyhey
2 ahb
2 bbh
3 chch
3 chchch
3 oiohho
3 nonon
3 halal
3 whatever
First, i need to compute the counts of the unique values in the first field, that will be:
5 for 1, 2 for 2, and 6 for 3
Then I need to find the max of these counts, in this case, it's 6.
Now i need to pass "6" to another awk script as a parmeter.
I know i can use command below to get a list of count:
cut -f1 input.txt | sort | uniq -c | awk -F ' ' '{print $1}' | sort
but how do i get the first count number and pass it to the next awk command as a parameter not as an input file?
This is nothing very specific for awk.
Either a program can read from stdin, then you can pass the input with a pipe:
prg1 | prg2
or your program expects input as parameter, then you use
prg2 $(prg1)
Note that in both cases prg1 is processed before prg2.
Some programs allow both possibilities, while a huge amount of data is rarely passed as argument.
This AWK script replaces your whole pipeline:
awk -v parameter="$(awk '{a[$1]++} END {for (i in a) {if (a[i] > max) {max = a[i]}}; print max}' inputfile)" '{print parameter}' otherfile
where '{print parameter}' is a standin for your other AWK script and "otherfile" is the input for that script.
Note: It is extremely likely that the two AWK scripts could be combined into one which would be less of a hack than doing it in a way such as that outlined in your question (awk feeding awk).
You can use the shell's $() command substitution:
awk -f script -v num=$(cut -f1 input.txt | sort | uniq -c | awk -F ' ' '{print $1}' | sort | tail -1) < input_file
(I added the tail -1 to ensure that at most one line is used.)

Resources