Getting unique values from column in a csv file [duplicate] - shell

This question already has answers here:
awk to remove duplicate rows totally based on a particular column value
(6 answers)
Closed 4 years ago.
I have the following input:
no,zadrar,MENTOR,rossana#xt.com,AGRATE
no,mittalsu,MENTOR,rossana#xt.com,GREATER NOIDA
no,abousamr,CADENCE,selim#xt.com,CROLLES
no,lokinsks,MENTOR,sergey#xt.com,CROLLES
no,billys,MENTOR,billy#xt.com,CROLLES
no,basiles1,CADENCE,stephane#xt.com,CASTELLETTO
no,cesaris1,CADENCE,stephane#xt.com,CROLLES
I want to get only the lines where column 4 is unique:
no,abousamr,CADENCE,selim#xt.com,CROLLES
no,lokinsks,MENTOR,sergey#xt.com,CROLLES
no,billys,MENTOR,billy#xt.com,CROLLES
I tried with:
awk -F"," '{print $4}' $vendor.csv | sort | uniq -u
But I get:
selim#xt.com
sergey#xt.com
billy#xt.com

You can use simply the options provided by the sort command:
sort -u -t, -k4,4 file.csv
As you can see in the man page, option -u stands for "unique", -t for the field delimiter, and -k allows you to select the location (key).

Could you please try following(reading Input_file 2 times).
awk -F',' 'FNR==NR{a[$4]++;next} a[$4]==1' Input_file Input_file

Related

I need to filter only duplicated lines from many files using bash [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 2 years ago.
I have the following three files
filea
a
bc
cde
fileb
a
bc
cde
frtdff
filec
a
bc
cddeeer
erer34
I am able to filter by the duplicated lines from these three files.
I am using the following command
ls file* | wc -l
which returns 3. Then, I am launching
sort file* | uniq --count --repeated | awk '{ if ($1 == 3) { print $2} }'
The last command returns precisely what I need, only in case I am not creating more files starting with "file".
But, in case I have thousands of files that need to be created during the time a script is running , I should get an exact number of files coming retrieved from this command
n=`ls file* | wc -l`
sort file* | uniq --count --repeated | awk '{ if ($1 == $n) { print $2} }'
Unfortunately, variable n is not accepted inside the if condition of the awk command.
My issue is that I am not able to use the value of the variable n as a comparison criteria inside an if conditional that is part of awk command.
You can use:
awk '!line[$0]++' file*
This will print only once any string even if present in several files and or in same file.

How to iterate through line and check needed part? [duplicate]

This question already has an answer here:
How can I retrieve an entry from /etc/passwd for a given username?
(1 answer)
Closed 5 years ago.
I have this line
Username:x:120:101:somethingsomething
and I need to get the '101' part after the third ':', how can I do that?
do I use grep or sed?
cut -d':' -f4 /etc/passwd
awk, only with string:
mstr="Username:x:120:101:somethingsomething"; awk -F: '{print $4}' <<< "$mstr"

How do I get the total number of distinct values in a column in a CSV?

I have a CSV file named test.csv. It looks like this:
1,Color
1,Width
2,Color
2,Height
I want to find out how many distinct values are in the first column. The shell script should return 2 in this case.
I tried running sort -u -t, -k2,2 test.csv, which I saw on another question, but it printed out far more info than I need.
How do I write a shell script that prints the number of distinct values in the first column of test.csv?
Using awk you can do:
awk -F, '!seen[$1]++{c++} END{print c}' file
2
This awk command uses key $1, and stores them in an array seen. Value of which is incremented to 1 when a key is populated first time. Every time we get a unique key we increment count c and print it in the end.
Or
cut -d, -f1 file | sort -u | wc -l
Use cut to extract the first column, then sort to get the unique values, then wc to count them.
#List the first column of the CSV, then sort and filter uniq then take count.
awk -F, '{print $1}' test.csv |sort -u |wc -l
To ignore header:
awk -F, 'NR>1{print $1}' test.csv |sort -u |wc -l

Unique entry set in the first column of all csv files under directory [duplicate]

This question already has answers here:
Is there a way to 'uniq' by column?
(8 answers)
Closed 7 years ago.
I have a list of comma separated files under the directory. There are no headers, and unfortunately they are not even the same length for each row.
I want to find the unique entry in the first column across all files.
What's the quickest way of doing it in shell programming?
awk -F "," '{print $1}' *.txt | uniq
seems to only get uniq entries of each files. I want all files.
Shortest is still using awk (this will print the row)
awk -F, '!a[$1]++' *.txt
to get just the first field
awk -F, '!a[$1]++ {print $1}' *.txt

Sorting with unix tools and multiple columns

I am looking for the easiest way to solve this problem. I have a huge data set that i cannot load into excel of this type of format
This is a sentence|10
This is another sentence|5
This is the last sentence|20
What I want to do is sort this from least to greatest based on the number.
cat MyDataSet.txt | tr "|" "\t" | ???
Not sure what the best way is to do this, I was thinking about using awk to switch the columns and the do a sort, but I was having trouble doing it.
Help me out please
sort -t\| -k +2n dataset.txt
Should do it. field separator and alternate key selection
You usually don't need cat to send the file to a filter. That said, you can use the sort filter.
sort -t "|" -k 2 -n MyDataSet.txt
This sorts the MyDataSet.txt file using the | character as field separator and sorting numerically according to the second field (the number).
have you tried sort -n
$ sort -n inputFile
This is another sentence|5
This is a sentence|10
This is the last sentence|20
you could switch the columns with awk too
$ awk -F"|" '{print $2"|"$1}' inputFile
10|This is a sentence
5|This is another sentence
20|This is the last sentence
combining awk and sort:
$ awk -F"|" '{print $2"|"$1}' inputFile | sort -n
5|This is another sentence
10|This is a sentence
20|This is the last sentence
per comments
if you have numbers in the sentence
$ sort -n -t"|" -k2 inputFile
This is another sentence|5
This is a sentence|10
This is the last sentence|20
this is a sentence with a number in it 2|22
and of course you could redirect it to a new file:
$ awk -F"|" '{print $2"|"$1}' inputFile | sort -n > outFile
Try this sort command:
sort -n -t '|' -k2 file.txt
Sort by number, change the separator and grab the second group using sort.
sort -n -t'|' -k2 dataset.txt

Resources