uniq -u with specific column - shell

I want to print lines based on value in particular column that appear only once. In example below, val2 and val3 appear only once.
Input
val1,1
val2,2
val1,3
val3,4
Output
val2,2
val3,4
uniq -u does not seem to have option of specifying a column. I also tried sort -t, -k1,1 -u but that prints every row once.

awk -F, '{c[$1]++; t[$1]=$0} END {for(k in c) {if (c[k]==1) print t[k]}}'

Sounds like a problem for awk, assume that the command that produces
val1,1
val2,2
val1,3
val3,2
Is called foo, then pipe it into awk like so:
foo | awk -F, '$2 == 2 {print}'

Related

sort and uniq a file with a specific column and only keep 1st value from given file

I have sample file
$ cat a.csv
a,1,c
b,1,d
d,3,a
s,2,c
a,3,s
Required
a,1,c
s,2,c
a,3,s
It must remove all other value after uniq but only keep 1st value
sort and uniq
a,1,c
s,2,c
a,3,s
I try sort -k2 -n a.csv but gave me this result
a,1,c
a,3,s
b,1,d
d,3,a
s,2,c
when I try sort -k2 -n a.csv | uniq -d I got blank result
$ sort -t, -u -k2,2 a.csv
a,1,c
s,2,c
d,3,a
-t, to specify , as delimiter
-u to get only unique entries
-k2,2 use second column as criteria for sorting
Another in awk:
$ awk -F, '{if(!($2 in a)||$0<a[$2])a[$2]=$0}END{for(i in a)print a[i]}' file
Output (in awk default order):
a,1,c
s,2,c
a,3,s
Explained:
$ awk -F, ' # fields comma-separated
{
if(!($2 in a) || $0<a[$2]) # if $2 unseen or record < stored record
a[$2]=$0 # store it to a hash
}
END { # after processing the file
# PROCINFO["sorted_in"]="#ind_num_desc" # sort output on $2 if using GNU awk
for(i in a) # iterate all stored instances in a
print a[i] # and output
}' file
The output order will be awk default ie. may appear random. If you want the output sorted, you need to use sort or if you are using GNU awk, uncomment the PROCINFO["sorted_in"]="#ind_num_desc" in the explained version (or add the line to the one-liner).

Awk to (random) sample a file by id-uniques criteria

I'm learning AWK to read a big file which format is similar to this MasterFile:
Beth|4.00|0|
Dan|3.75|0|
Kathy|4.00|10|
Mark|5.00|20|
Mary|5.50|22|
Susie|4.25|18|
Jise|5.62|0|
Mark|5.60|23.3|
Mary|8.50|42|
Susie|8.75|8.8|
Jise|3.62|0.8|
Beth|3.21|10|
Dan|8.39|20|
I would like to sample by unique values (size K) from the first column with size N (I choose it).
What I have done is following: I select unique values from first column and save it as IDfile.txt. Later, I take K random values from that archive and I match it with the MasterFile. I mean:
awk -F\| 'BEGIN{srand()}{print rand() " " $0}' IDfile | sort -n | tail -n K| awk -F'[[:blank:]|]+' 'BEGIN{OFS="|"}{$1="";sub(/\|/,"")}'1>tmp | awk -F\| 'NR==FNR{a[$1];next} {for (i in a) if(index($0,i)) print $0}' tmp MasterFile
But the output has repeated values and the result that I'd like to get is like to (assuming that K=3):
Beth|4.00|0|
Mark|5.60|23.3|
Mary|5.50|22|
I know that my code is far from efficient [or nice] and I'm open to suggestions [].
Thanks!
this is the one of the right ways to do this
$ sort -t'|' -u -k1,1 file | shuf -n3
Mark|5.00|20|
Kathy|4.00|10|
Jise|5.62|0|
change -n3 to whatever number of unique entries you need.

awk or shell command to count occurence of value in 1st column based on values in 4th column

I have a large file with records like below :
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
I want to find the no of person (names in col 1) have apple and oranges both. And the command should take as less memory as possible and should be fast. Any help appreciated!
Output :
awk/sed file => 2 (jon and tom)
Using awk is pretty easy:
awk -F, \
'$4 == "apple" { apple[$1]++ }
$4 == "oranges" { orange[$1]++ }
END { for (name in apple) if (orange[name]) print name }' data
It produces the required output on the sample data file:
jon
tom
Yes, you could squish all the code onto a single line, and shorten the names, and otherwise obfuscate the code.
Another way to do this avoids the END block:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && orange[$1]) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && apple[$1]) print $1 }' data
When it encounters an apple entry for the first time for a given name, it checks to see if the name also (already) has an entry for oranges and prints it if it has; likewise and symmetrically, if it encounters an orange entry for the first time for a given name, it checks to see if the name also has an entry for apple and prints it if it has.
As noted by Sundeep in a comment, it could use in:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && $1 in orange) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && $1 in apple) print $1 }' data
The first answer could also use in in the END loop.
Note that all these solutions could be embedded in a script that would accept data from standard input (a pipe or a redirected file) — they have no need to read the input file twice. You'd replace data with "$#" to process file names if they're given, or standard input if no file names are specified. This flexibility is worth preserving when possible.
With awk
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){print $1}' ip.txt ip.txt
jon
tom
This processes the input twice
In first pass, add key to an array if last field is apple (-F, would set , as input field separator)
In second pass, check if last field is oranges and if first field is a key of array a
To print only number of matches:
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){c++} END{print c}' ip.txt ip.txt
2
Further reading: idiomatic awk for details on two file processing and awk idioms
I did a work around and used only grep and comm commands.
grep "apple" file | cut -d"," -f1 | sort > file1
grep "orange" file | cut -d"," -f1 | sort > file2
comm -12 file1 file2 > names.having.both.apple&orange
comm -12 shows only the common names between the 2 files.
Solution from Jonathan also worked.
For the input:
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
the command:
sed -n "/apple\|oranges/p" inputfile | cut -d"," -f1 | uniq -d
will output a list of people with both apples and oranges:
jon
tom
Edit after comment: For an for input file where lines are not ordered by 1st column and where each person can have two or more repeated fruits, like:
jon,1,2,apple
fred,1,2,apple
fred,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
jon,1,2,oranges
tom,1,2,apple
mary,1,2,apple
tom,1,2,oranges
This command will work:
sed -n "/\(apple\|oranges\)$/ s/,.*,/,/p" inputfile | sort -u | cut -d, -f1 | uniq -d

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

bash sort on multiple fields and deduplicating

I want to sort data like the below content on the first field first and then on the date in the third field. Then keep only the latest for each ID(field 1) - irrespective of the second field.
id1,description1,2013/11/20
id2,description2,2013/06/11
id2,description3,2012/10/28
id2,description4,2011/12/04
id3,description5,2014/02/09
id3,description6,2013/12/05
id4,description7,2013/12/05
id5,description8,2013/08/14
So the expected output will be
id1,description1,2013/11/20
id2,description2,2013/06/11
id3,description5,2014/02/09
id4,description7,2013/12/05
id5,description8,2013/08/14
Thanks
Jomon
You can use this awk:
> cat file
id1,description1,2013/11/20
id1,description1,2013/11/19
id2,description2,2013/06/11
id2,description3,2012/10/28
id2,description4,2011/12/04
id3,description5,2014/02/09
id3,description6,2013/12/05
id4,description7,2013/12/05
id5,description8,2013/08/14
> sort -t, -k1,1 -k3,3r file | awk -F, '!a[$1]++'
id1,description1,2013/11/20
id2,description2,2013/06/11
id3,description5,2014/02/09
id4,description7,2013/12/05
id5,description8,2013/08/14
Call sort twice; the first time, sort by the date. On the second call, sort uniquely on the first field, but do so stably so that items with the same id remain sorted by date.
sort -t, -k3,3r data.txt | sort -t, -su -k1,1
Try this:
cat file |sort -u|awk -F, '{if(map[$1] == ""){print $0; map[$1]="printed"}}'
Explanation:
I use sort to sort (well could not be more simple)
And I use awk to store in a map if the first column item was already printed.
If not (map[$1] == "") I print and store "printed" into map[$1] (so next time it won't be equal to "" for the current value of $1).

Resources