Awk is overwriting letters when printing reversed order, why? - bash

I'm currently using awk to replicate the function uniq -c with commas as delimiters.
This gives correct output:
$ cut --delimiter=, -s -f2 wordlist.csv | awk '{ cnts[$0] += 1 } END { for (v in cnts) print cnts[v], v}' OFS="," | head
2,laecherlichen
111,doctrine
1,cremonas
1,embedding
1,conincks
2,similiter
1,mitgesellen
1,hysnelement
1,geringem
1,aquarian
However, if I reverse the awk command print cnts[v], v into print v, cnts[v], I get a messed up output:
$ cut --delimiter=, -s -f2 wordlist.csv | awk '{ cnts[$0] += 1 } END { for (v in cnts) print v, cnts[v]}' OFS="," | head
,2echerlichen
,111rine
,1emonas
,1bedding
,1nincks
,2militer
,1tgesellen
,1snelement
,1ringem
,1uarian
I'm confused by this output, because I'm expecting something like word,1 as output. What is the problem?

Most likely you have DOS line feed characters i.e. \r before end of line \n. You can use RS variable in awk to ignore this:
cut --delimiter=, -s -f2 wordlist.csv | awk -v RS='\r|\n' '{
cnts[$0] += 1 } END { for (v in cnts) print cnts[v], v}' OFS="," | head
However if you show your csv file I believe even cut and head can be removed from above commands.
PS: Thanks to #Bammar you can also run:
dos2unix file.csv
to convert your csv file to unix compatible file.

Related

Extracting unique columns from a file into a comma separated list with a particular order

I have a .csv file with these values
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
and I want to make a file with comma separated rows in sorted order except "negative" always is at the end.
So I want
["brand","positive","product","negative"]
I was not able to automate this process so what I did was
awk -F ',' '{print $1}' file.csv | sort | uniq -c > file2.txt
awk '{if(NR>1) printf ", ";printf("\"%s\"",$0)} END {print ""}' file2.txt > file3.txt
I get "brand","negative","positive","product"
Then I manually move "negative" to the end and also append [ and ] to front and back to get
["brand","positive","product","negative"]
Is there a way to make it more efficient and automate the process?
another solution with easy to understand steps
$ awk -F, '{print ($1=="negative"?1:0) "\t\"" $1 "\""}' file | # mark negatives
sort | cut -f2 | uniq | # sort, cut, uniq
paste -sd, | sed 's/^/[/;s/$/]/' # serialize, add brackets
["brand","positive","product","negative"]
Here is a single gnu awk command to make it work:
awk -F, '{
a[$1] = ($1 == "negative" ? "~" : "") $1
}
END {
n = asort(a)
printf "["
for (i = 1; i <= n; i++) {
sub(/^~/, "", a[i])
printf "\"%s\"%s", a[i], (i < n ? ", " : "]\n")
}
}' file.csv
["brand", "positive", "product", "negative"]
There are lots of ways to approach this. Do you really want the result as what looks like a JSON array, with square brackets and quotation marks around the column names? If so, then jq is probably a good tool to use to generate it. Something like this will do it all as a single jq program:
jq -csR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)' file.csv
Which outputs this:
["brand","positive","product","negative"]
If you just want the headings separated by commas in a line without the other punctuation, suitable for heading up a CSV file, you can use more traditional text-manipulation commands:
cut -d, -f1 file.csv |
sed 's/negative/zzz&/' |
sort -u |
sed 's/zzz//' |
paste -d, -s -
Or you can slightly modify the jq command by adding the -r flag and another pipe at the end:
jq -csrR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)|
join(",")' file.csv
Either of which outputs this:
brand,positive,product,negative
Using Perl one-liner
$ cat unique.txt
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
$ perl -F, -lane ' { $x=$F[0];$x=~s/^(negative)/z\1/g;$rating{$x}++ } END {$q="\x22";$y=join("$q,$q",sort keys %rating) ; $y=~s/${q}z/$q/g; print "[$q$y$q]" }' unique.txt
["brand","positive","product","negative"]
$
This worked for me:
cut -d, -f1 file.csv | sort -u | sed "/^negative/d" | tr '\n' ',' | sed -e 's/^/["/' -e 's/,/","/g' -e 's/$/negative"]/'

Print common values in columns using bash

I have file with two columns
apple apple
ball cat
cat hat
dog delta
I need to extract values that are common in two columns (occur in both columns) like
apple apple
cat cat
There is no ordering in items in each column.
Could you please try following and let me know if this helps you.
awk '
{
col1[$1]++;
col2[$2]++;
}
END{
for(i in col1){
if(col2[i]){
while(++count<=(col1[i]+col2[i])){
printf("%s%s",i,count==(col1[i]+col2[i])?ORS:OFS)}
count=""}
}
}' Input_file
NOTE: It will print the values if found in both the columns exactly number of times they are occurring in both the columns too.
$ awk '{a[$1];b[$2]} END{for(k in a) if(k in b) print k}' file
apple
cat
to print the values twice change to print k,k
with sort/join
$ join <(cut -d' ' -f1 file | sort) <(cut -d' ' -f2 file | sort)
apple
cat
perhaps,
$ function f() { cut -d' ' -f"$1" file | sort; }; join <(f 1) <(f 2)
Assuming I can use unix commands:
cut -d' ' -f2 fil | egrep `cut -d' ' -f1 < fil | paste -sd'|'` -
Basically what this does is this:
The second cut command collects all the words in the first column. The paste command joins them with a pipe (i.e. dog|cat|apple).
The first cut command takes the second column of words in the list and pipes them into a regexp-enabled egrep command.
Here is the closest I could get. Maybe you could loop through whole file and print when it reaches another occurrence.
Code
cat file.txt | gawk '$1==$2 {print $1,"=",$2}'
or
gawk '$1==$2 {print $1,"=",$2}' file.txt

uniq sort parsing

I have one file with field separated by ";", like this:
test;group;10.10.10.10;action2
test2;group;10.10.13.11;action1
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test5;group2;10.10.10.12;action5
test6;group4;10.10.13.11;action8
I would like to identify all non-unique IP addresses (3rd column). With the example the extract should be:
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
Sorted by IP address (3rd column).
Ssing simple commands like cat, uniq, sort, awk (not Perl, not Python, only shell).
Any idea?
$ awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file|sort -t";" -k3
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
awk picks all duplicated ($3) lines
sort sorts by ip
You can also try this solution using grep, cut, sort, uniq, and a casual process substitution in the middle.
grep -f <(cut -d ';' -f3 file | sort | uniq -d) file | sort -t ';' -k3
It is not really elegant (I actually prefer the awk answer given above), but I think worth sharing, since it accomplishes what you want.
here is another awk assisted pipeline
$ awk -F';' '{print $0 "\t" $3}' file | sort -sk2 | uniq -Df1 | cut -f1
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
single pass, so special caching; also keeps the original order (stable sorting). Assumes tab doesn't appear in the fields.
This is very similar to Kent's answer, but with a single pass through the file. The tradeoff is memory: you need to store the lines to keep. This uses GNU awk for the PROCINFO variable.
awk -F';' '
{count[$3]++; lines[$3] = lines[$3] $0 ORS}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (key in count)
if (count[key] > 1)
printf "%s", lines[key]
}
' file
The equivalent perl
perl -F';' -lane '
$count{$F[2]}++; push #{$lines{$F[2]}}, $_
} END {
print join $/, #{$lines{$_}}
for sort grep {$count{$_} > 1} keys %count
' file
awk + sort + uniq + cut:
$ awk -F ';' '{print $0,$3}' <file> | sort -k2 | uniq -D -f1 | cut -d' ' -f1
sort + awk
$ sort -t';' -k3,3 | awk -F ';' '($3==k){c++;b=b"\n"$0}($3!=k){if (c>1) print b;c=1;k=$3;b=$0}END{if(c>1)print b}
awk
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{for (i in k) if(k[i]>1) for(j=1;j<=k[i];j++) print b[i"_"j] } <file>
This buffers the full file (same as sort does) and keeps track how many times a key k is appearing. At the end, if the key appears more then ones, print the full set.
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
If you want it sorted :
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{ asorti(k,l);
for (i in l) if(k[l[i]]>1) for(j=1;j<=k[l[i]];j++) print b[l[i]"_"j] } <file>

awk command fails with command substitution

Running this command fails:
$(printf "awk '{%sprint}'" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')) file.txt
It gives the following error:
awk: '
awk: ^ invalid char ''' in expression
However, if I run the substituted command, I get this:
awk '{gsub("ACB",1);gsub("ASW",2);gsub("BEB",3);gsub("CDX",4);gsub("CEU",5);gsub("CHB",6);gsub("CHS",7);gsub("CLM",8);gsub("ESN",9);gsub("FIN",10);gsub("GBR",11);gsub("GIH",12);gsub("GWD",13);gsub("IBS",14);gsub("ITU",15);gsub("JPT",16);gsub("KHV",17);gsub("LWK",18);gsub("MSL",19);gsub("MXL",20);gsub("PEL",21);gsub("PJL",22);gsub("PUR",23);gsub("STU",24);gsub("TSI",25);gsub("YRI",26);print}'
which I can run like so:
awk '{gsub("ACB",1);gsub("ASW",2);gsub("BEB",3);gsub("CDX",4);gsub("CEU",5);gsub("CHB",6);gsub("CHS",7);gsub("CLM",8);gsub("ESN",9);gsub("FIN",10);gsub("GBR",11);gsub("GIH",12);gsub("GWD",13);gsub("IBS",14);gsub("ITU",15);gsub("JPT",16);gsub("KHV",17);gsub("LWK",18);gsub("MSL",19);gsub("MXL",20);gsub("PEL",21);gsub("PJL",22);gsub("PUR",23);gsub("STU",24);gsub("TSI",25);gsub("YRI",26);print}' file.txt
And it works perfectly. What am I doing wrong?
#ChrisLear gave me a working solution, but I still don't quite understand what the command solution is doing. Here's the working code:
$(printf "awk {%sprint}" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')) file.txt
The single quotes around {%sprint} are removed. Why do those single quotes break the command substitution?
edit: changed backtick to $(...) notation. Also added solution I don't understand.
Try removing the quotes from the command being generated.
`printf "awk {%sprint}" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')` file.txt
For an explanation, see the accepted answer at Why does command substitution change how quoted arguments work?
It looks like you're trying to take a bunch of unique 2nd fields from a file starting at line 2 and map those to numbers based on their alphabetic ordering, then apply the change to the same file. If so then with GNU awk for sorted_in and inplace editing that'd be:
awk -i inplace '
NR==FNR {
if (NR>1) {
map[$2]
}
next
}
FNR==1 {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (str in map) {
map[str] = ++i
}
}
{
$2 = map[$2]
print
}
' file.txt
If that's not what you need then edit your question to show concise, testable sample input and expected output.

Why uniq -c output with space instead of \t?

I use uniq -c some text file.
Its output like this:
123(space)first word(tab)other things
2(space)second word(tab)other things
....
So I need extract total number(like 123 and 2 above), but I can't figure out how to, because if I split this line by space, it will like this ['123', 'first', 'word(tab)other', 'things'].
I want to know why doesn't it output with tab?
And how to extract total number in shell? ( I finally extract it with python, WTF)
Update: Sorry, I didn't describe my question correctly. I didn't want to sum the total number, I just want to replace (space) with (tab), but it doesn't effect the space in words, because I still need the data after. Just like this:
123(tab)first word(tab)other things
2(tab)second word(tab)other things
Try this:
uniq -c | sed -r 's/^( *[^ ]+) +/\1\t/'
Try:
uniq -c text.file | sed -e 's/ *//' -e 's/ /\t/'
That will remove the spaces prior to the line count, and then replace only the first space with a tab.
To replace all spaces with tabs, use tr:
uniq -c text.file | tr ' ' '\t'
To replace all continuous runs of tabs with a single tab, use -s:
uniq -c text.file | tr -s ' ' '\t'
You can sum all the numbers using awk:
awk '{s+=$1}END{print s}'
$ cat <file> | uniq -c | awk -F" " '{sum += $1} END {print sum}'
One possible solution to getting tabs after counts is to write a uniq -c-like script that formats exactly how you want. Here's a quick attempt (that seems to pass my minute or so of testing):
awk '
(NR == 1) || ($0 != lastLine) {
if (NR != 1) {
printf("%d\t%s\n", count, lastLine);
}
lastLine = $0;
count = 1;
next;
}
{
count++;
}
END {
printf("%d\t%s\n", count, lastLine);
}
' yourFile.txt
Another solution. This is equivalent to the earlier sed solution, but it does use awk as requested / tagged!
cat yourFile.txt \
| uniq -c \
| awk '{
match($0, /^ *[^ ]* /);
printf("%s\t%s\n", $1, substr($0, RLENGTH + 1));
}'
Based on William Pursell answer , if you like Perl compatible regular expressions (PCRE) maybe a more elegant and modern way would be
perl -pe 's/ *(\d+) /$1\t/'
Options are to execute (-e) and print (-p).

Resources