number of unique words in a document - bash

I have a very large txt file (500GiB), and I want to get the number of its unique words. I tried this, but it seems to be very slow as it does sort:
grep -o -E '\w+' temp | sort -u -f | wc -l
Is there any better way of doing this?

awk to the rescue!
$ awk -v RS=" " '{a[$0]++} END{for(k in a) sum++; print sum}' file
UPDATE:
It's probably better to do preprocessing with tr and let the awk do the counting economically. You may want to delimit the words with spaces or new lines.
For example:
$ tr ':;,?!\"' ' ' < file | tr -s ' ' '\n' | awk '!a[$0]++{c++} END{print c}'

You can rely on awk's default behavior to split lines into words by runs of whitespace, and use its associative arrays:
awk '{ for (i=1; i<=NF; ++i) a[tolower($i)]++ } END { print length(a) }' file
Update: As #rici points out in a comment, white-space separated tokens may include punctuation other than _ and other characters, and are thus not necessarily the same as grep's \w+ construct. #4ae1e1 therefore suggests using a field separator of along the lines of '[^[:alnum:]_]'. Note that this will result in each component of a hyphenated word to be counted separately; similarly, ' separates words.
awk -F '[^[:alnum:]_]+' '{ for (i=1; i<=NF; ++i) { a[tolower($i)]++ } }
END { print length(a) - ("" in a) }' file
Associative array a is built in a way that counts the occurrence of each distinct word encountered in the input, converted to lowercase first so as to ignore differences in case - if you do NOT want to ignore case differences, simply remove the tolower() call.
CAVEAT: It seems that Mawk and BSD Awk aren't locale-aware, so tolower() won't work properly with non-ASCII characters.
On having processed all words, the number of elements of a equals the number of unique words.
NOTE: The POSIX-compliant reformulation of print length(a) is: for (k in a) ++count; print count
The above will work with GNU Awk, Mawk (1.3.4+), and BSD Awk, even though it isn't strictly POSIX-compliant (POSIX defines the length function only for strings, not arrays).

An important feature of sorting is that it is locale-aware, and therefore much more expensive in any locale other than C. Since you don't really care about the order here, you might as well tell sort to ignore the locale by using LC_ALL=C sort -u -f. If your locale is set to something else, that will probably cut your execution time in half.
The original version of this answer suggested that you should only do this if you don't care about non-ascii characters. However, if you are using Gnu coreutils, it turns out that none of this stuff will work in UTF-8 locales. While gnu sort will do a locale-aware string comparison in any locale (using the strxfrm standard library function), sort -f only does locale-aware case-folding in single-byte locales. Gnu uniq -i has the same problem. And tr only translates single-byte characters (by design, afaik); in theory [:alpha:] is locale-aware, but only for characters representable as single bytes.
In short, if you want to use sort -u -f, you might as well specify the C locale. That is no less broken for non-English letters, but at least the breakage doesn't waste time.
Gnu awk's tolower() function does apparently work on multibyte locales. So check out one of the awk answers if you need this to work in a UTF-8 locale.

Related

How many times has the letter "N" or its repeat(eg: "NNNNN") been found in a text file?

I am given a file.txt (text file) with a string of data. Example contents:
abcabccabbabNababbababaaaNNcacbba
abacabababaaNNNbacabaaccabbacacab
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
aaababababababacacacacccbababNbNa
abababbacababaaacccc
To find the number of distinct repeated patterns of "N" (repeated one or more times) that are present in the file using unix commands.
I am unsure on what commands to use even after trying a range of different commands.
$ grep -E -c "(N)+" file.txt
the output must be 6
One way:
$ sed 's/[^N]\{1,\}/\n/g' file.txt | grep -c N
6
How it works:
Replace all sequences of one or more non-N characters in the input with a newline.
This turns strings like abcabccabbabNababbababaaaNNcacbba into
N
NN
Count the number of lines with at least one N (Ignoring the empty lines).
Regular-expression free alternative:
$ tr -sc N ' ' < file.txt | wc -w
6
Uses tr to replace all runs of non-N characters with a single space, and counts the remaining words (Which are the N sequences). Might not even need the -s option.
Using GNU awk (well, just tested with gawk, mawk, busybox awk and awk version 20121220 and it seemed to work with all of them):
$ gawk -v RS="^$" -F"N+" '{print NF-1}' file
6
It reads in the whole file as a single record, uses regex N+ as field separator and outputs the field count minus one. For other awks:
$ awk -v RS="" -F"N+" '{c+=NF-1}END{print c}' file
It reads in empty line separated blocks of records, counts and sums fields.
Here is an awk that should work on most system.
awk -F'N+' '{a+=NF-1} END {print a}' file
6
It splits the line by one or more N and then count number of fields-1 pr line.
If you have a text file, and you want to count the number times a sequence of letters of N appear, you can do:
awk '{a+=gsub(/N+/,"")}END{print a}' file
This, however, will distinguish sequences that are split over multiple lines. Example:
abcNNN
NNefg
If you want this to be counted as a single sequence, you should do:
awk 'BEGIN{RS=OFS=""}{$1=$1}{a+=gsub(/N+/,"")}END{print a}' file

Sed command to replace numbers between space and :

I have a file with a records like the below
FIRST 1: SECOND 2: THREE 4: FIVE 255: SIX 255
I want to remove values between space and :
FIRST:SECOND:THREE:FIVE:SIX
with code
awk -F '[[:space:]]*,:*' '{$1=$1}1' OFS=, file
tried on gnu awk:
awk -F' [0-9]*(: *|$)' -vOFS=':' '{print $1,$2,$3,$4,$5}' file
tried on gnu sed:
sed -E 's/\s+[0-9]+(:|$)\s*/\1/g' file
Explanation of awk,
regex , a space, followed by [0-9]+ one or more number followed by literal : followed by one or more space: *, if all such matched, then collect everything else than this matched pattern, ie. FIRST, SECOND,... so on because -F option determine it as field separator (FS) and $1, $2 .. so on is always else than FS. But the output needs nice look ie. has FS so that'd be : and it'd be awk variable definition -vOFS=':'
You can add [[:digit:]] also with a ending asterisk, and leave only a space just after OFS= :
$ awk -F '[[:space:]][[:digit:]]*' '{$1=$1}1' OFS= file
FIRST:SECOND:THREE:FIVE:SIX
To get the output we want in idiomatic awk, we make the input field separator (with -F) contain all the stuff we want to eliminate (anchored with :), and make the output field separator (OFS) what we want it replaced with. The catch is that this won't eliminate the space and numbers at the end of the line, and for this we need to do something more. GNU’s implementation of awk will allow us to use a regular expression for the input record separator (RS), but we could just do a simple sub() with POSIX complaint awk as well. Finally, force recalculation via $1=$1... the side effects for this pattern/statement are that the buffer will be recalculated doing FS/RS substitution for us, and that non-blank lines will take the default action -- which is to print.
gawk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: -v RS='[[:space:]]*[[:digit:]]*\n' '$1=$1' file
Or:
awk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: '{ sub(/[[:space:]]*[[:digit:]]*$/, “”) } $1=$1' file
A sed implementation is fun but probably slower (because current versions of awk have better regex implementations).
sed 's/[[:space:]]*[[:digit:]]*:[[:space:]]/:/g; s/[[:space:]]*[[:digit:]]*[[:space:]]*$//' file
Or if POSIX character classes are not available...
sed 's/[\t ]*[0-9]*:[\t ]/:/g; s/[\t ]*[0-9]*[\t ]*$//' file
Something tells me that your “FIRST, SECOND, THIRD...” might be more complicated, and might contain digits... in this case, you might want to experiment with replacing * with + for awk or with \+ for sed.

How can I determine the number of fields in a CSV, from the shell?

I have a well-formed CSV file, which may or may not have a header line; and may or may not have quoted data. I want to determine the number of columns in it, using the shell.
Now, if I can be sure there are no quoted commas in the file, the following seems to work:
x=$(tail -1 00-45-19-tester-trace.csv | grep -o , | wc -l); echo $((x + 1))
but what if I can't make that assumption? That is, what if I can't assume a comma is always a field separator? How do I do it then?
If it helps, you're allowed to make the assumption of there being no quoted quotes (i.e. \"s between within quoted strings); but better not to make that one either.
If you cannot make any optimistic assumptions about the data, then there won't be a simple solution in Bash. It's not trivial to parse a general CSV format with possible embedded newlines and embedded separators. You're better off not writing that in bash, but using an existing proper CSV parse. For example Python has one built in its standard library.
If you can assume that there are no embedded newlines and no embedded separators, than it's simple to split in commas using awk:
awk -F, '{ print NF; exit }' input.csv
-F, tells awk to use comma as the field separator, and the automatic NF variable is the number of fields on the current line.
If you want to allow embedded separators, but you can assume no embedded double quotes, then you can eliminate the embedded separators with a simple filter, before piping to the same awk as earlier:
head -n 1 input.csv | sed -e 's/"[^"]*"//g' | awk ...
Note that both of these examples use the first line to decide the number of fields. If the input has a header line, this should work quite well, as the header should not contain embedded newlines
count fields in first row, then verify all rows have same number
CNT=$(head -n1 hhdata.csv | awk -F ',' '{print NF}')
cat hhdata.csv | awk -F ',' '{print NF}' | grep -v $CNT
Doesn't cope with embedded commas but will highlight if they exist
If File has not double quotes then use below command:
awk -F"," '{ print NF }' filename| sort -u
If File has every column enclosed with double quotes then use below command:
awk -F, '{gsub(/"[^"]*"/,x);print NF}' filename | sort -u

apply logic condition to decimal values in awk [duplicate]

I want gawk to parse number using comma , as the decimal point character.
So I set LC_NUMERIC to fr_FR.utf-8 but it does not work:
echo 123,2 | LC_NUMERIC=fr_FR.utf-8 gawk '{printf ("%.2f\n", $1 + 0) }'
123.00
The solution is to specify option --posix or export POSIXLY_CORRECT=1 but in this case the GNU awk extensions are not available, for example delete or the gensub function:
echo 123,2 | LC_NUMERIC=fr_FR.utf-8 gawk --posix '{printf ("%.2f\n", $1 + 0) }'
123,20
Is it possible to have gawk parsing number with , as decimal point without specifying
POSIX option?
The option your are looking for is:
--use-lc-numeric
This forces gawk to use the locale's decimal point character when parsing input data. Although the POSIX standard requires this
behavior, and gawk does so when --posix is in effect, the default is
to follow traditional behavior and use a period as the decimal
point, even in locales where the period is not the decimal point
character. This option overrides the default behavior, without the
full draconian strictness of the --posix option.
Demo:
$ echo 123,2 | LC_NUMERIC=fr_FR.utf-8 awk --use-lc-numeric '{printf "%.2f\n",$1}'
123,20
Notes: printf is statement not a function so the parenthesis are not required and I'm not sure why you are adding zero here?

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. Voilà!
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

Resources