I want to count the amount of the same words in a text file and display them in descending order.
So far I have :
cat sample.txt | tr ' ' '\n' | sort | uniq -c | sort -nr
Which is mostly giving me satisfying output except the fact that it includes special characters like commas, full stops, ! and hyphen.
How can I modify existing command to not include special characters mentioned above?
You can use tr with a composite string of the letters you wish to delete.
Example:
$ echo "abc, def. ghi! boss-man" | tr -d ',.!'
abc def ghi boss-man
Or, use a POSIX character class knowing that boss-man for example would become bossman:
$ echo "abc, def. ghi! boss-man" | tr -d [:punct:]
abc def ghi bossman
Side note: You can have a lot more control and speed by using awk for this:
$ echo "one two one! one. oneone
two two three two-one three" |
awk 'BEGIN{RS="[^[:alpha:]]"}
/[[:alpha:]]/ {seen[$1]++}
END{for (e in seen) print seen[e], e}' |
sort -k1,1nr -k2,2
4 one
4 two
2 three
1 oneone
How about first extracting words with grep:
grep -o "\w\+" sample.txt | sort | uniq -c | sort -nr
I have a big txt file which I want to edit in pipeline. But on same place in pipeline I want to set number of lines in variable $nol. I just want to see sintax how could I set variable in pipeline like:
cat ${!#} | tr ' ' '\n'| grep . ; $nol=wc -l | sort | uniq -c ...
That after second pipe is very wrong, but how can I do it in bash?
One of solutions is:
nol=$(cat ${!#} | tr ' ' '\n'| grep . | wc -l)
pipeline all from the start again
but I don't want to do script the same thing twice, bec I have more pipes then here.
I musn't use awk or sed...
You can use a tee and then write it to a file which you use later:
tempfile="xyz"
tr ' ' '\n' < "${!#}" | grep '.' | tee > "$tempfile" | sort | uniq -c ...
nol=$(wc -l "$tempfile")
Or you can use it the other way around:
nol=$(tr ' ' '\n' < "${!#}" | grep '.' \
| tee >(sort | uniq -c ... > /dev/tty) | wc -l
You can set a variable in a particular link of a pipeline, but that's not very useful since only that particular link will be affected by it.
I recommend simply using a temporary file.
set -e
trap 'rm -f "$tmpf"' EXIT
tmpf=`mktemp`
cat ${!#} | tr ' ' '\n'| grep . | sort > "$tmpf"
nol="$(wc "$tmpf")"
< "$tmpf" uniq -c ...
You can avoid the temporary file with tee and a named pipe, but it probably won't perform much better (it may even perform worse).
UPDATE:
Took a minute but I got it...
cat ${!#} | tr ' ' '\n'| tee >(nol=$(wc -l)) | sort | uniq -c ...
PREVIOUS:
The only way I can think to do this is storing in variables and calling back. You would not execute the command more than one time. You would just store the output in variables along the way.
aCommand=($(cat ${!#} | tr ' ' '\n'));sLineCount=$(echo ${#aCommand[#]});echo ${aCommand[#]} | sort | uniq -c ...
aCommand will store the results of the first set of commands in an array
sLineCount will count the elements (lines) in the array
;... echo the array elements and continue the commands from there.
Looks to me like you're asking how to avoid stepping through your file twice, just to get both word and line count.
Bash lets you read variables, and wc can produce all the numbers you need at once.
NAME
wc -- word, line, character, and byte count
So to start...
read words line chars < <( wc < ${!#} )
This populates the three variables based on input generated from process substitution.
But your question includes another partial command line which I think you intend as:
nol=$( sort -u ${!#} | wc -l )
This is markedly different from the word count of your first command line, so you can't use a single wc instance to generate both. Instead, one option might be to put your functionality into a script that does both functions at once:
read words uniques < <(
awk '
{
words += NF
for (i=1; i<=NF; i++) { unique[$i] }
}
END {
print words,length(unique)
}
' ${!#}
)
I have two strings:
l1='a1 a2 b1 b2 c1 c2'
l2='a1 b3 c1'
And I want to check if each element of string l2 exists in l1, and then remove it from l1.
Is it possible to do that without a for loop?
You can do this:
l1=$(comm -23 <(echo "$l1" | tr ' ' '\n' | sort) <(echo "$l2" | tr ' ' '\n' | sort) | tr '\n' ' ')
The comm compares lines and outputs the lines that are unique to the first input, unique to the second input, and common to both. The -23 option suppresses the second two sets of outputs, so it just reports the lines that are unique to the first input.
Since it requires the input to be sorted lines, I first pipe the variables to tr to put each word on its own line, and then sort to sort it. <(...) is a common shell extension called process substitution that allows a command to be used where a filename is expected; it's available in bash and zsh, for example (see here for a table that lists which shells have it).
At the end I use tr again to translate the newlines back to spaces.
DEMO
If you don't have process substution, you can emulate it with named pipes:
mkfifo p1
echo "$l1" | tr ' ' '\n' | sort > p1 &
mkfifo p2
echo "$l2" | tr ' ' '\n' | sort > p2 &
l1=$(comm p1 p2 | tr '\n' ' ')
rm p1 p2
I have two files and I need to print the words only (not complete lines) that are in the first file not in the second file. I have tried wdiff but it prints complete lines and is not useful.
Sample of file:
وكان مكنيل وقتها رئيس رابطة مؤرخي أمريكا ـ
كما فهمت - من شاهد الحادثة. ثم يصف كيف قدم
مكنيل الرجلين الخصمين, فكانت له صرامته, إذ
حدد عشرين دقيقة فقط لكل منهما أن يقدم رأيه
وحجته, ثم وقت للرد, ثم يجيبان عن أسئلة قليلة
من القاعة, والمناقشة في وقت محدد.
Make two files that contain each word on its own line, and sort them. Then use comm:
$ cat fileA
ﻮﻛﺎﻧ ﻢﻜﻨﻴﻟ ﻮﻘﺘﻫﺍ ﺮﺌﻴﺳ ﺭﺎﺒﻃﺓ ﻡﺅﺮﺨﻳ ﺄﻣﺮﻴﻛﺍ ـ
ﻚﻣﺍ ﻒﻬﻤﺗ - ﻢﻧ ﺵﺎﻫﺩ ﺎﻠﺣﺍﺪﺛﺓ. ﺚﻣ ﻲﺼﻓ ﻚﻴﻓ ﻕﺪﻣ
$ cat fileB
ﻮﻘﺘﻫﺍ ﺮﺌﻴﺳ ﺭﺎﺒﻃﺓ ﺄﻣﺮﻴﻛﺍ ـ
ﻚﻣﺍ ﻒﻬﻤﺗ - ﻢﻧ ﺵﺎﻫﺩ ﻲﺼﻓ ﻚﻴﻓ ﻕﺪﻣ
$ tr ' ' '\n' < fileA | sort > fileA-sorted
$ tr ' ' '\n' < fileB | sort > fileB-sorted
$ comm -23 fileA-sorted fileB-sorted
ﺎﻠﺣﺍﺪﺛﺓ.
ﺚﻣ
ﻢﻜﻨﻴﻟ
ﻡﺅﺮﺨﻳ
ﻮﻛﺎﻧ
$
This can also be written on a single line in bash:
comm -23 <(tr ' ' '\n' < fileA | sort) <(tr ' ' '\n' < fileB | sort)
This is not an answer, but a comment too long to be a comment. I'm sorry - I don't yet know the etiquette in this case, so please let me know if there's a better way to do this.
I thought both the approaches given in other answers were interesting, but was concerned that the grep version would require m * n comparisons, where m and n are the numbers of words in each file respectively.
I'm running bash on OSX and ran the following smoke test to compare:
Grab two random selections of 10K words from my dictionary:
gsort -R /usr/share/dict/words | head -n 10000 > words1
gsort -R /usr/share/dict/words | head -n 10000 > words2
Compare the running time for each solution:
Using comm:
time comm -23 <(tr ' ' '\n' < words1 | sort) <(tr ' ' '\n' < words2 | sort)
Result:
real 0m0.143s
user 0m0.225s
sys 0m0.018s
Using grep:
time grep -wf <(tr ' ' '\n' < words1) <(tr ' ' '\n' < words2)
Result:
real 1m25.988s
user 1m25.925s
sys 0m0.063s
I'm not sure about memory complexity. I'd be interested in any criticism of this analysis, or commentary on how to evaluate which solution is better?
You can avoid sorting (specially if input files are pretty huge) using grep:
grep -wf <(tr ' ' '\n' < file1) <(tr ' ' '\n' < file2)
The given file is in the below format.
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
I need to take out duplicates and count(each duplicates categorized by f1,2,5,14). Then insert into database with the first duplicate occurence record entire fields and tag the count(dups) in another column. For this I need to cut all the 4 mentioned fields and sort and find the dups using uniq -d and for counts I used -c. Now again coming back after all sorting out of dups and it counts I need the output to be in the below form.
3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
Whereas three being the number of repeated dups for f1,2,5,14 and rest of the fields can be from any of the dup rows.
By this way dups should be removed from the original file and show in the above format.
And the remaining in the original file will be uniq ones they go as it is...
What I have done is..
awk '{printf("%5d,%s\n", NR,$0)}' renewstatus_2012-04-19.txt > n_renewstatus_2012-04-19.txt
cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c
but this needs a point back again to the original file to get the lines for the dup occurences. ..
let me not confuse.. this needs a different point of view.. and my brain is clinging on my approach.. need a cigar..
Any thots...??
sort has an option -k
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
uniq has an option -f
-f, --skip-fields=N
avoid comparing the first N fields
so sort and uniq with field numbers(count NUM and test this cmd yourself, plz)
awk -F"," '{print $0,$1,$2,...}' file.txt | sort -k NUM,NUM2 | uniq -f NUM3 -c
Using awk's associative arrays is a handy way to find unique/duplicate rows:
awk '
BEGIN {FS = OFS = ","}
{
key = $1 FS $2 FS $5 FS $14
if (key in count)
count[key]++
else {
count[key] = 1
line[key] = $0
}
}
END {for (key in count) print count[key], line[key]}
' filename
SYNTAX :
awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],","count[i] > file}}' renewstatus_2012-04-19.txt
Calculation:
sym#localhost:~$ cut -f16 -d',' uniq | sort | uniq -d -c
124275 1 -----> SUM OF UNIQ ( 1 )ENTRIES
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -d -c
3860 2
850 3
71 4
7 5
3 6
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -u -c
1 7
10614 ------> SUM OF DUPLICATE ENTRIES MULTIPLIED WITH ITS COUNTS
sym#localhost:~$ wc -l renewstatus_2012-04-19.txt
134889 renewstatus_2012-04-19.txt ---> TOTAL LINE COUNTS OF THE ORIGINAL FILE, MATCHED EXACTLY WITH (124275+10614) = 134889