Switching the format of this output? - bash

I have this script written to print the distribution of words in one or more files:
cat "$#" | tr -cs '[:alpha:]' '\n' |
tr '[:upper:]' '[:lower:]' | sort |
uniq -c | sort -n
Which gives me an output such as:
1 the
4 orange
17 cat
However, I would like to change it so that the word is listed first (I'm assuming sort would be involved so its alphabetical) , not the number, like so:
cat 17
orange 4
the 1
Is there just a simple option I would need to switch this? Or is it something more complicated?

Pipe the output to
awk '{print $2, $1}'
or you can use awk for the complete task:
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
usage:
awk -f wordfreq.awk input

Related

how to use values with sed in shell scripting?

i am trying te write a shell script in alphametic ,
i have 5 parameters like this
$alphametic 5790813 BEAR RARE ERE RHYME
to get
ABEHMRY -> 5790813
i tried this :
#!/bin/bash
echo "$2 $3 $4 $5" | sed 's/ //g ' | sed 's/./&\n/g' | sort -n | sed '/^$/d' | uniq -i > testing
paste -sd '' testing > testing2
sed "s|^\(.*\)$|\1 -> ${1}|" testing2
but i get error (with the last command sed), i dont know where is the problem .
Another approach:
chars=$(printf '%s' "${#:2}" | fold -w1 | sort -u | paste -sd '')
echo "$chars -> $1"
sort's -n does't make sense here: these are letters, not numbers.
One idea using awk for the whole thing:
arg1="$1"
shift
others="$#"
awk -v arg1="${arg1}" -v others="${others}" '
BEGIN { n=split(others,arr,"") # split into into array of single characters
for (i=1;i<=n;i++) # loop through indices of arr[] array
letters[arr[i]] # assign characters as indices of letters[] array; eliminates duplicates
delete letters[" "] # delete array index for "<space>"
PROCINFO["sorted_in"]="#ind_str_asc" # sort array by index
for (i in letters) # loop through indices
printf "%s", i # print index to stdout
printf " -> %s\n", arg1 # finish off line with final string
}
'
NOTE: requires GNU awk for the PROCINFO["sorted_in"] (to sort the indices of the letters[] array)
This generates:
ABEHMRY -> 5790813

Extracting unique columns from a file into a comma separated list with a particular order

I have a .csv file with these values
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
and I want to make a file with comma separated rows in sorted order except "negative" always is at the end.
So I want
["brand","positive","product","negative"]
I was not able to automate this process so what I did was
awk -F ',' '{print $1}' file.csv | sort | uniq -c > file2.txt
awk '{if(NR>1) printf ", ";printf("\"%s\"",$0)} END {print ""}' file2.txt > file3.txt
I get "brand","negative","positive","product"
Then I manually move "negative" to the end and also append [ and ] to front and back to get
["brand","positive","product","negative"]
Is there a way to make it more efficient and automate the process?
another solution with easy to understand steps
$ awk -F, '{print ($1=="negative"?1:0) "\t\"" $1 "\""}' file | # mark negatives
sort | cut -f2 | uniq | # sort, cut, uniq
paste -sd, | sed 's/^/[/;s/$/]/' # serialize, add brackets
["brand","positive","product","negative"]
Here is a single gnu awk command to make it work:
awk -F, '{
a[$1] = ($1 == "negative" ? "~" : "") $1
}
END {
n = asort(a)
printf "["
for (i = 1; i <= n; i++) {
sub(/^~/, "", a[i])
printf "\"%s\"%s", a[i], (i < n ? ", " : "]\n")
}
}' file.csv
["brand", "positive", "product", "negative"]
There are lots of ways to approach this. Do you really want the result as what looks like a JSON array, with square brackets and quotation marks around the column names? If so, then jq is probably a good tool to use to generate it. Something like this will do it all as a single jq program:
jq -csR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)' file.csv
Which outputs this:
["brand","positive","product","negative"]
If you just want the headings separated by commas in a line without the other punctuation, suitable for heading up a CSV file, you can use more traditional text-manipulation commands:
cut -d, -f1 file.csv |
sed 's/negative/zzz&/' |
sort -u |
sed 's/zzz//' |
paste -d, -s -
Or you can slightly modify the jq command by adding the -r flag and another pipe at the end:
jq -csrR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)|
join(",")' file.csv
Either of which outputs this:
brand,positive,product,negative
Using Perl one-liner
$ cat unique.txt
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
$ perl -F, -lane ' { $x=$F[0];$x=~s/^(negative)/z\1/g;$rating{$x}++ } END {$q="\x22";$y=join("$q,$q",sort keys %rating) ; $y=~s/${q}z/$q/g; print "[$q$y$q]" }' unique.txt
["brand","positive","product","negative"]
$
This worked for me:
cut -d, -f1 file.csv | sort -u | sed "/^negative/d" | tr '\n' ',' | sed -e 's/^/["/' -e 's/,/","/g' -e 's/$/negative"]/'

Counting palindromes in a text file

Having followed this thread BASH Finding palindromes in a .txt file I can't figure out what am I doing wrong with my script.
#!/bin/bash
search() {
tr -d '[[:punct:][:digit:]#]' \
| sed -E -e '/^(.)\1+$/d' \
| tr -s '[[:space:]]' \
| tr '[[:space:]]' '\n'
}
search "$1"
paste <(search <"$1") <(search < "$1" | rev) \
| awk '$1 == $2 && (length($1) >=3) { print $1 }' \
| sort | uniq -c
All im getting from this script is output of the whole text file. I want to only output palindromes >=3 and count them such as
425 did
120 non
etc. My textfile is called sample.txt and everytime i run the script with: cat sample.txt | source palindrome I get message 'bash: : No such file or directory'.
Using awk and sed
awk 'function palindrome(str) {len=length(str); for(k=1; k<=len/2+len%2; k++) { if(substr(str,k,1)!=substr(str,len+1-k,1)) return 0 } return 1 } {for(i=1; i<=NF; i++) {if(length($i)>=3){ gsub(/[^a-zA-Z]/,"",$i); if(length($i)>=3) {$i=tolower($i); if(palindrome($i)) arr[$i]++ }} } } END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
Tested on 1.2GB file and execution time was ~4m 40s (i5-6440HQ # 2.60GHz/4 cores/16GB)
Explanation :
awk '
function palindrome(str) # Function to check Palindrome
{
len=length(str);
for(k=1; k<=len/2+len%2; k++)
{
if(substr(str,k,1)!=substr(str,len+1-k,1))
return 0
}
return 1
}
{
for(i=1; i<=NF; i++) # For Each field in a record
{
if(length($i)>=3) # if length>=3
{
gsub(/[^a-zA-Z]/,"",$i); # remove non-alpha character from it
if(length($i)>=3) # Check length again after removal
{
$i=tolower($i); # Covert to lowercase
if(palindrome($i)) # Check if it's palindrome
arr[$i]++ # and store it in array
}
}
}
}
END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
sed -E '/^[0-9]+ (.)\1+$/d' : From the final result check which strings are composed of just repeated chracters like AAA, BBB etc and remove them.
Old Answer (Before EDIT)
You can try below steps if you want to :
Step 1 : Pre-processing
Remove all unnecessary chars and store the result in temp file
tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
tr -dc 'a-zA-Z\n\t ' This will remove all except letters,\n,\t, space
tr ' ' '\n' This will convert space to \n to separate each word in newlines
Step-2: Processing
grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
grep -wof temp <(rev temp) This will give you all palindromes
-w : Select only those lines containing matches that form whole words.
For example : level won't match with levelAAA
-o : Print only the matched group
-f : To use each string in temp file as pattern to search in <(rev temp)
sed -E -e '/^(.)\1+$/d': This will remove words formed of same letters like AAA, BBBBB
awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' : This will filter words having length>=3 and counts their frequency and finally prints the result
Example :
Input File :
$ cat file
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
Output:
$ tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
$ grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
3 dad
3 kayak
3 bob
Just a quick Perl alternative:
perl -0nE 'for( /(\w{3,})/g ){ $a{$_}++ if $_ eq reverse($_)}
END {say "$_ $a{$_}" for keys %a}'
in Perl, $_ should be read as "it".
for( /(\w{3,})/g ) ... for all relevant words (may need some work to reject false positives like "12a21")
if $_ eq reverse($_) ... if it is palindrome
END {say "$_ $a{$_}" for...} ... tell us all the its and its number
\thanks{sokowi,batMan}
Running the Script
The script expects that the file is given as an argument. The script does not read stdin.
Remove the line search "$1" in the middle of the script. It is not part of the linked answer.
Make the script executable using chmod u+x path/to/palindrome.
Call the script using path/to/palindrome path/to/sample.txt. If all the files are in the current working directory, then the command is
./palindrome sample.txt
Alternative Script
Sometimes the linked script works and sometimes it doesn't. I haven't found out why. However, I wrote an alternative script which does the same and is also a bit cleaner:
#! /bin/bash
grep -Po '\w{3,}' "$1" | grep -Evw '(.)\1*' | sort > tmp-words
grep -Fwf <(rev tmp-words) tmp-words | uniq -c
rm tmp-words
Save the script, make it executable, and call it with a file as its first argument.

bash awk first 1st column and 3rd column with everything after

I am working on the following bash script:
# contents of dbfake file
1 100% file 1
2 99% file name 2
3 100% file name 3
#!/bin/bash
# cat out data
cat dbfake |
# select lines containing 100%
grep 100% |
# print the first and third columns
awk '{print $1, $3}' |
# echo out id and file name and log
xargs -rI % sh -c '{ echo %; echo "%" >> "fake.log"; }'
exit 0
This script works ok, but how do I print everything in column $3 and then all columns after?
You can use cut instead of awk in this case:
cut -f1,3- -d ' '
awk '{ $2 = ""; print }' # remove col 2
If you don't mind a little whitespace:
awk '{ $2="" }1'
But UUOC and grep:
< dbfake awk '/100%/ { $2="" }1' | ...
If you'd like to trim that whitespace:
< dbfake awk '/100%/ { $2=""; sub(FS "+", FS) }1' | ...
For fun, here's another way using GNU sed:
< dbfake sed -r '/100%/s/^(\S+)\s+\S+(.*)/\1\2/' | ...
All you need is:
awk 'sub(/.*100% /,"")' dbfake | tee "fake.log"
Others responded in various ways, but I want to point that using xargs to multiplex output is rather bad idea.
Instead, why don't you:
awk '$2=="100%" { sub("100%[[:space:]]*",""); print; print >>"fake.log"}' dbfake
That's all. You don't need grep, you don't need multiple pipes, and definitely you don't need to fork shell for every line you're outputting.
You could do awk ...; print}' | tee fake.log, but there is not much point in forking tee, if awk can handle it as well.

get Nth line in file after parsing another file

I have one of my large file as
foo:43:sdfasd:daasf
bar:51:werrwr:asdfa
qux:34:werdfs:asdfa
foo:234:dfasdf:dasf
qux:345:dsfasd:erwe
...............
here 1st column foo, bar and qux etc. are file names. and 2nd column 43,51, 34 etc. are line numbers. I want to print Nth line(specified by 2nd column) for each file(specified in 1st column).
How can I automate above in unix shell.
Actually above file is generated while compiling and I want to print warning line in code.
-Thanks,
while IFS=: read name line rest
do
head -n $line $name | tail -1
done < input.txt
while IFS=: read file line message; do
echo "$file:$line - $message:"
sed -n "${line}p" "$file"
done <yourfilehere
awk 'NR==4 {print}' yourfilename
or
cat yourfilename | awk 'NR==4 {print}'
The above one will work for 4th line in your file.You can change the number as per your requirement.
Just in awk, but probably worse performance than answers by #kev or #MarkReed.
However it does process each file just once. Requires GNU awk
gawk -F: '
BEGIN {OFS=FS}
{
files[$1] = 1
lines[$1] = lines[$1] " " $2
msgs[$1, $2] = $3
}
END {
for (file in files) {
split(lines[file], l, " ")
n = asort(l)
count = 0
for (i=1; i<=n; i++) {
while (++count <= l[i])
getline line < file
print file, l[i], msgs[file, l[i]]
print line
}
close(file)
}
}
'
This might work for you:
sed 's/^\([^,]*\),\([^,]*\).*/sed -n "\2p" \1/' file |
sort -k4,4 |
sed ':a;$!N;s/^\(.*\)\(".*\)\n.*"\(.*\)\2/\1;\3\2/;ta;P;D' |
sh
sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt
qux 34
-n no output by default,
-r regular expressions (simplifies using the parens)
in line 3 do {...;p} (print in the end)
s ubstitute foobarbaz with foo bar
So to work with the values:
fnUln=$(sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt)
fn=$(echo ${fnUln/ */})
ln=$(echo ${fnUln/* /})
sed -n "${ln}p" "$fn"

Resources