Why uniq -c output with space instead of \t? - shell

I use uniq -c some text file.
Its output like this:
123(space)first word(tab)other things
2(space)second word(tab)other things
....
So I need extract total number(like 123 and 2 above), but I can't figure out how to, because if I split this line by space, it will like this ['123', 'first', 'word(tab)other', 'things'].
I want to know why doesn't it output with tab?
And how to extract total number in shell? ( I finally extract it with python, WTF)
Update: Sorry, I didn't describe my question correctly. I didn't want to sum the total number, I just want to replace (space) with (tab), but it doesn't effect the space in words, because I still need the data after. Just like this:
123(tab)first word(tab)other things
2(tab)second word(tab)other things

Try this:
uniq -c | sed -r 's/^( *[^ ]+) +/\1\t/'

Try:
uniq -c text.file | sed -e 's/ *//' -e 's/ /\t/'
That will remove the spaces prior to the line count, and then replace only the first space with a tab.
To replace all spaces with tabs, use tr:
uniq -c text.file | tr ' ' '\t'
To replace all continuous runs of tabs with a single tab, use -s:
uniq -c text.file | tr -s ' ' '\t'

You can sum all the numbers using awk:
awk '{s+=$1}END{print s}'

$ cat <file> | uniq -c | awk -F" " '{sum += $1} END {print sum}'

One possible solution to getting tabs after counts is to write a uniq -c-like script that formats exactly how you want. Here's a quick attempt (that seems to pass my minute or so of testing):
awk '
(NR == 1) || ($0 != lastLine) {
if (NR != 1) {
printf("%d\t%s\n", count, lastLine);
}
lastLine = $0;
count = 1;
next;
}
{
count++;
}
END {
printf("%d\t%s\n", count, lastLine);
}
' yourFile.txt

Another solution. This is equivalent to the earlier sed solution, but it does use awk as requested / tagged!
cat yourFile.txt \
| uniq -c \
| awk '{
match($0, /^ *[^ ]* /);
printf("%s\t%s\n", $1, substr($0, RLENGTH + 1));
}'

Based on William Pursell answer , if you like Perl compatible regular expressions (PCRE) maybe a more elegant and modern way would be
perl -pe 's/ *(\d+) /$1\t/'
Options are to execute (-e) and print (-p).

Related

Extracting unique columns from a file into a comma separated list with a particular order

I have a .csv file with these values
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
and I want to make a file with comma separated rows in sorted order except "negative" always is at the end.
So I want
["brand","positive","product","negative"]
I was not able to automate this process so what I did was
awk -F ',' '{print $1}' file.csv | sort | uniq -c > file2.txt
awk '{if(NR>1) printf ", ";printf("\"%s\"",$0)} END {print ""}' file2.txt > file3.txt
I get "brand","negative","positive","product"
Then I manually move "negative" to the end and also append [ and ] to front and back to get
["brand","positive","product","negative"]
Is there a way to make it more efficient and automate the process?
another solution with easy to understand steps
$ awk -F, '{print ($1=="negative"?1:0) "\t\"" $1 "\""}' file | # mark negatives
sort | cut -f2 | uniq | # sort, cut, uniq
paste -sd, | sed 's/^/[/;s/$/]/' # serialize, add brackets
["brand","positive","product","negative"]
Here is a single gnu awk command to make it work:
awk -F, '{
a[$1] = ($1 == "negative" ? "~" : "") $1
}
END {
n = asort(a)
printf "["
for (i = 1; i <= n; i++) {
sub(/^~/, "", a[i])
printf "\"%s\"%s", a[i], (i < n ? ", " : "]\n")
}
}' file.csv
["brand", "positive", "product", "negative"]
There are lots of ways to approach this. Do you really want the result as what looks like a JSON array, with square brackets and quotation marks around the column names? If so, then jq is probably a good tool to use to generate it. Something like this will do it all as a single jq program:
jq -csR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)' file.csv
Which outputs this:
["brand","positive","product","negative"]
If you just want the headings separated by commas in a line without the other punctuation, suitable for heading up a CSV file, you can use more traditional text-manipulation commands:
cut -d, -f1 file.csv |
sed 's/negative/zzz&/' |
sort -u |
sed 's/zzz//' |
paste -d, -s -
Or you can slightly modify the jq command by adding the -r flag and another pipe at the end:
jq -csrR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)|
join(",")' file.csv
Either of which outputs this:
brand,positive,product,negative
Using Perl one-liner
$ cat unique.txt
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
$ perl -F, -lane ' { $x=$F[0];$x=~s/^(negative)/z\1/g;$rating{$x}++ } END {$q="\x22";$y=join("$q,$q",sort keys %rating) ; $y=~s/${q}z/$q/g; print "[$q$y$q]" }' unique.txt
["brand","positive","product","negative"]
$
This worked for me:
cut -d, -f1 file.csv | sort -u | sed "/^negative/d" | tr '\n' ',' | sed -e 's/^/["/' -e 's/,/","/g' -e 's/$/negative"]/'

uniq sort parsing

I have one file with field separated by ";", like this:
test;group;10.10.10.10;action2
test2;group;10.10.13.11;action1
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test5;group2;10.10.10.12;action5
test6;group4;10.10.13.11;action8
I would like to identify all non-unique IP addresses (3rd column). With the example the extract should be:
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
Sorted by IP address (3rd column).
Ssing simple commands like cat, uniq, sort, awk (not Perl, not Python, only shell).
Any idea?
$ awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file|sort -t";" -k3
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
awk picks all duplicated ($3) lines
sort sorts by ip
You can also try this solution using grep, cut, sort, uniq, and a casual process substitution in the middle.
grep -f <(cut -d ';' -f3 file | sort | uniq -d) file | sort -t ';' -k3
It is not really elegant (I actually prefer the awk answer given above), but I think worth sharing, since it accomplishes what you want.
here is another awk assisted pipeline
$ awk -F';' '{print $0 "\t" $3}' file | sort -sk2 | uniq -Df1 | cut -f1
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
single pass, so special caching; also keeps the original order (stable sorting). Assumes tab doesn't appear in the fields.
This is very similar to Kent's answer, but with a single pass through the file. The tradeoff is memory: you need to store the lines to keep. This uses GNU awk for the PROCINFO variable.
awk -F';' '
{count[$3]++; lines[$3] = lines[$3] $0 ORS}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (key in count)
if (count[key] > 1)
printf "%s", lines[key]
}
' file
The equivalent perl
perl -F';' -lane '
$count{$F[2]}++; push #{$lines{$F[2]}}, $_
} END {
print join $/, #{$lines{$_}}
for sort grep {$count{$_} > 1} keys %count
' file
awk + sort + uniq + cut:
$ awk -F ';' '{print $0,$3}' <file> | sort -k2 | uniq -D -f1 | cut -d' ' -f1
sort + awk
$ sort -t';' -k3,3 | awk -F ';' '($3==k){c++;b=b"\n"$0}($3!=k){if (c>1) print b;c=1;k=$3;b=$0}END{if(c>1)print b}
awk
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{for (i in k) if(k[i]>1) for(j=1;j<=k[i];j++) print b[i"_"j] } <file>
This buffers the full file (same as sort does) and keeps track how many times a key k is appearing. At the end, if the key appears more then ones, print the full set.
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
If you want it sorted :
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{ asorti(k,l);
for (i in l) if(k[l[i]]>1) for(j=1;j<=k[l[i]];j++) print b[l[i]"_"j] } <file>

awk command fails with command substitution

Running this command fails:
$(printf "awk '{%sprint}'" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')) file.txt
It gives the following error:
awk: '
awk: ^ invalid char ''' in expression
However, if I run the substituted command, I get this:
awk '{gsub("ACB",1);gsub("ASW",2);gsub("BEB",3);gsub("CDX",4);gsub("CEU",5);gsub("CHB",6);gsub("CHS",7);gsub("CLM",8);gsub("ESN",9);gsub("FIN",10);gsub("GBR",11);gsub("GIH",12);gsub("GWD",13);gsub("IBS",14);gsub("ITU",15);gsub("JPT",16);gsub("KHV",17);gsub("LWK",18);gsub("MSL",19);gsub("MXL",20);gsub("PEL",21);gsub("PJL",22);gsub("PUR",23);gsub("STU",24);gsub("TSI",25);gsub("YRI",26);print}'
which I can run like so:
awk '{gsub("ACB",1);gsub("ASW",2);gsub("BEB",3);gsub("CDX",4);gsub("CEU",5);gsub("CHB",6);gsub("CHS",7);gsub("CLM",8);gsub("ESN",9);gsub("FIN",10);gsub("GBR",11);gsub("GIH",12);gsub("GWD",13);gsub("IBS",14);gsub("ITU",15);gsub("JPT",16);gsub("KHV",17);gsub("LWK",18);gsub("MSL",19);gsub("MXL",20);gsub("PEL",21);gsub("PJL",22);gsub("PUR",23);gsub("STU",24);gsub("TSI",25);gsub("YRI",26);print}' file.txt
And it works perfectly. What am I doing wrong?
#ChrisLear gave me a working solution, but I still don't quite understand what the command solution is doing. Here's the working code:
$(printf "awk {%sprint}" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')) file.txt
The single quotes around {%sprint} are removed. Why do those single quotes break the command substitution?
edit: changed backtick to $(...) notation. Also added solution I don't understand.
Try removing the quotes from the command being generated.
`printf "awk {%sprint}" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')` file.txt
For an explanation, see the accepted answer at Why does command substitution change how quoted arguments work?
It looks like you're trying to take a bunch of unique 2nd fields from a file starting at line 2 and map those to numbers based on their alphabetic ordering, then apply the change to the same file. If so then with GNU awk for sorted_in and inplace editing that'd be:
awk -i inplace '
NR==FNR {
if (NR>1) {
map[$2]
}
next
}
FNR==1 {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (str in map) {
map[str] = ++i
}
}
{
$2 = map[$2]
print
}
' file.txt
If that's not what you need then edit your question to show concise, testable sample input and expected output.

How can I use bash to split only some elements of a text file?

I'm trying to figure out how to make a .txt file (myGeneFile.txt) of IDs and genes that looks like this:
Probe Set ID Gene Symbol
1007_s_at DDR1 /// MIR4640
1053_at RFC2
117_at HSPA6
121_at PAX8
1255_g_at GUCA1A
1294_at MIR5193 /// UBA7
into this:
DDR1
MIR4640
RFC2
HSPA6
PAX8
GUCA1A
MIR5193
UBA
First I tried doing this:
cat myGeneFile.txt | tail -n +2 | awk '{split($2,a,"///"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5];}' > test.txt
(i.e., I removed the top (header) line of the file, I tried splitting the second line along the delimiter ///, and then printing any genes that might appear)
Then, I tried doing this:
cat myGeneFile.txt | tail -n +2 | awk '{print $2}' | grep -o -E '\w+' > test.txt
(literally listing out all of the words in the second column)
I got the same output in both cases - a long list of just the first gene in each row (e.g. so MIR4640 and UBA7 were mising)
Any ideas?
EDIT: Thanks #CodeGnome for your help. I ended up using that code and modifying it because I discovered that my file had between 1 and 30 different gene names on each row. So, I used:
awk 'NR == 1 {next}
{
sub("///", "")
print $2 }
{ for (i=3; i<=30; i++)
if ($i) {print $i}
}' myGeneFile.txt > test2.txt
#GlenJackson also had a solution that worked really well:
awk 'NR>1 {for (i=2; i<=NF; i++) if ($i != "///") print $i}' file
My awk take:
awk 'NR>1 {for (i=2; i<=NF; i++) if ($i != "///") print $i}' file
or sed
sed '
1d # delete the header
s/[[:blank:]]\+/ /g # squeeze whitespace
s/^[^ ]\+ // # remove the 1st word
s| ///||g # delete all "///" words
s/ /\n/g # replace spaces with newlines
' file
Use Conditional Print Statements Inside an AWK Action
The following gives the desired output by removing unwanted characters with sub(), and then using multiple print statements to create the line breaks. The second print statement is conditional, and only triggers when the third field isn't empty; this avoids creating extraneous empty lines in the output.
$ awk 'NR == 1 {next}
{
sub("///", "")
print $2
if ($3) {print $3}
}' myGeneFile.txt
DDR1
MIR4640
RFC2
HSPA6
PAX8
GUCA1A
MIR5193
UBA7
This will work:
tail -n+2 tmp | sed -E 's/ +/ /' | cut -d' ' -f2- | sed 's_ */// *_\n_'
Here's what is happening:
tail -n+2 Strip off the header
sed -E 's/ +/ /' Condense the whitespace
cut -d' ' -f2- Use cut to select all fields but the first, using a single space as the delimiter
sed 's_ */// *_\n_' Convert all /// (and any surrounding whitespace) to a newline
You don't need the initial cat, it's usually better to simply pass the input file as an argument to the first command. If you want the file name in a place that is easy to change, this is a better option as it avoids the additional process (and I find it easier to change the file if it's at the end):
(tail -n+2 | sed -E 's/ +/ /' | cut -d' ' -f2- | sed 's_ */// *_\n_') < tmp
Given the existing input and the modified requirement (from the comment on Morgen's answer) the following should do what you want (for any number of gene columns).
awk 'NR > 1 {
p=0
for (i = 2; i <= NF; i++) {
if ($i == "///") {
p=1
continue
}
printf "%s%s\n", p?"n":"", $i
}
}' input.txt
Your criteria for selecting which strings to output is not entirely clear, but here's another command that at least produces your expected output:
tail -n +2 myGeneFile.txt | grep -oE '\<[A-Z][A-Z0-9]*\>'
It basically just 1) skips the first line and 2) finds all other words (delimited by non-word characters and/or start/end of line) that consist entirely of uppercase letters or digits, with the first being a letter.

finding pattern in a file

I have a txt file of 500 rows and one column.
The column in each row appears some what like this (as an example I am pasting two rows):
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB,chr22:49368010-49368760_NM_152247_CPT1B,chr22:49368010-49368760_NM_152253_CHKB
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB
Want I want to extract from each row is the values starting from NM_ or NR_
like
row 1 has NR_021492 NM_005198 NM_152247 NM_152253
row 2 has NR_021492 NM_005198
...
in tab delimited file
any suggestions for a bash command line?
Try:
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g'
Presuming GNU sed.
So
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g' your_file > tab_delimited_file
EDIT: Updated to not leave a trailing tab character on each row.
EDIT 2: Updated again to work for any chr-then-number sequence.
grep "NM" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NM_/'
grep "NR" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NR_/'
cat file|sed s/$.*!(NR)//;
Use a regular expression to remove everything before the NR
awk -F '[,:_-]' '{
for (i=1; i<NF; i++)
if ($i == "NR" || $i == "NM")
printf("%s_%s ", $i, $(i+1))
print ""
}'
This will also work, but will print each match on its own line: egrep -o 'N[RM]_[0-9]+

Resources