Counting words from a mixed-language document - utf-8

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.

You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

Related

Does it exist some kind of sorting convention?

Does it exist some established convention of sorting lines (characters)? Some convention which should play the similar role as PCRE for regular expressions.
For example, if you try to sort 0A1b-a2_B (each character on its own line) with Sublime Text (Ctrl-F9) and Vim (:%sort), the result will be the same (see below). However, I'm not sure it will be the same with another editors and IDEs.
-
0
1
2
A
B
_
a
b
Generally, characters are sorted based on their numeric value. While this used to only be applied to ASCII characters, this has also been adopted by unicode encodings as well. http://www.asciitable.com/
If no preference is given to the contrary, this is the de facto standard for sorting characters. Save for the actual alphabetical characters, the ordering is somewhat arbitrary.
There are two main ways of sorting character strings:
Lexicographic: numeric value of either the codepoint values or the code unit values or the serialized code unit values (bytes). For some character encodings, they would all be the same. The algorithm is very simple but this method is not human-friendly.
Culture/Locale-specific: an ordinal database for each supported culture is used. For the Unicode character set, it's called the CLDR. Also, in applying sorting for Unicode, sorting can respect grapheme clusters. A grapheme cluster is a base codepoint followed by a sequence of zero or more non-spacing (applied as extensions of the previous glyph) marks.
For some older character sets with one encoding, designed for only one or two scripts, the two methods might amount to the same thing.
Sometimes, people read a format into strings, such as a sequence of letters followed by a sequence of digits, or one of several date formats. These are very specialized sorts that need to be applied where users expect. Note: The ISO 8601 date format for the Julian calendar sorts correctly regardless of method (for all? character encodings).

Better algorithm for shortening English words

I have some unique codes that are generated from strings (ex: website host names) in various independent components of my application.
These codes are meant to be used by machines only so i would like to keep them as short as possible.
The below algorithm would be applied to every word in the string. The output words would be concatenated with a dash to generate the unique code.
The current algorithm I have used:
- Skip word if length is less than 6
- Leave first character as is
- Remove every wowel in the word from the second character onwards
architectural digest eu => archtctrl-dgst-eu
arizona foothills magazine => arzn-fthlls-mgzn
Is there a better way to shorten an English word leaving it as recognisable as possible to a human reader?
The output should be deterministic and produce the same shortened version whenever it is run on the same input.
A good algorithm should also minimise the number of clashes for similarly spelt words.
I have some unique codes that are generated from strings
I am afraid that is not true. There are many English words that will reduce to the same 'code word' when stripped of their vowels. For example, 'leaving' -> 'living' Given, this is fairly rare, it could still cause issues.
How important is it that these 'code words' remain human-readable if as you say, they are meant to be used by machines only? If its not that important, I'd suggest looking into some simpler compression algorithms like Huffman Coding or LZW Compression. Then if the user needs to see the translation of the code word, just uncompress it.
If you must keep it human-readable, I'm not sure that there is much more you can do to shorten it. You could take a look at specific latin + greek roots, and determine if you can shorten those any more by hand, and then just substitute those out automatically.
Alternatively, you could turn to a phonetic approach. Automatically search the pronunciation of the word, and then see if that is any shorter (or itself can be compressed, taking 'cee' to 'C', or 'kay' to 'K'). This would be much more time and CPU intensive, but its still an option if you really, really need short but yet readable codes.
What you're generating sounds like what's called a "slug". There are many libraries to handle this for blogs or site generators that should suit your purposes. Here's a usage example from a Python library called slugify:
txt = "___This is a test ---"
r = slugify(txt)
self.assertEqual(r, "this-is-a-test")
Slug libraries generally work like this:
replacing non-ascii linguistic characters via a mapping (ex: 影師嗎 -> ying-shi-ma)
replace accented latin letters with ascii equivalents via a mapping (ex: C'est déjà l'été. -> c-est-deja-l-ete)
remove beginning and trailing spaces/punctuation
convert remaining spaces and punctuation to dashes, collapsing multiple dashes in a row to a single dash
If you want to make slugs shorter you could remove vowels or, more simply, use a maximum length.

Detecting "noise" in text extracted from documents

I am working on retrieving the readable content (i.e. text) from PDF documents, most of which are scientific journal articles.
I am using the Poppler text utilities to convert the PDF to text format.
The text is extracted nicely, but unfortunately so are other components of the articles (e.g. numerical tables), which cannot be rendered properly in plain text.
For example, I might get the following output in the middle of the article:
Character distributions random Hmax
1 2 3 4
Organization c) (of characters over species
A
B
A 0 0 0 + C
B + + + +
C + + + + A
B 4+
H Character distributions nonrandom Hobs
Entropy
3+ 2+ 1+
(diversity of characters over species
My question is: how would I identify such "noise" and differentiate it from normal blocks of text? Are there any existing algorithms? I am working in Ruby, but code in any language will help.
You could use a Naive Bayes Classifier to model valid vs. non-valid lines.
Here's an article on one in Ruby; there's a good implementation in Python's nltk.
To set it up you would need to give it examples, for example by filling one file with good lines and one with bad ones. This is the same model used by spam filters.
One trick for this use case is that many basic Naive Bayes Classifiers word using a word-occurrence model for features, whereas here it's not the vocabulary that's significant. You may with to use line length, percent spaces (rounded to 5% or 10% intervals), or percent of various punctuation marks (rounded but with higher precision). Hopefully your classifier will learn that "lines with no periods and 30% spaces are bad" or "lines with no punctuation where every word begins with a capital letter are bad".
Based on just your examples above, though, you could probably reject any line with too high a ratio of spaces or those completely lacking in sentence punctuation such as commas and periods.

Is there a good two way hash to convert an email address to a predictable, readable, unix username?

We are working with a number of unix based filesystems, all of which share a similar set of restrictions on that certain characters can't be used in the username fields. One of those restrictions is no "#" , "_", or "." in the names. Being unix there are a number of other restrictions.
So the question is if there is a good known algorithm that can take an email address and turn that into a predictable unix filename. We would need to reverse this at some point to get the email.
I've considered doing thing like "."->"DOT", "#"->"AT", etc. But there are size limitations and other things that are generally problematic. I could also optimize by being able to map the #xyz.com part of the email to a special char or something. Each implementation would only have at most 3 domains it would need to support. I'm hoping someone has found a solution without a huge number of tradeoffs.
UPDATE:
-The two target filesystems are AFS and NFS.
-Base64 doesn't work as it has not compatible characters. "/"
-Readable is preferable.
Seems like the best answer would be to replace the #xyz.com domain to a single non-standard character, and then have a function that could shrink the first part of a name to something that fits in the username length restrictions of the various filesystems. But what is a good function for that?
You could try a modified version of the URL percent (%) encoding scheme used on for URIs.
If the percent symbol isn't allowed on your particular filesystem(s), simply replace it with a different, allowed character (and remember to encode any occurrences of that character properly).
Using this method:
mail.address#server.com
Would become:
mail%2Eaddress%40server%2Ecom
Or, if you had to substitute (for example), the letter a instead of the % symbol:
ma61ila2Ea61ddressa40servera2Ecom
Not exactly humanly-readable perhaps, but easily enough processed through an encoding algorithm. For the best space efficiency, your escape character should be a character allowed by the filesystem, yet one that is not likely to appear frequently in an address.
This encoding scheme has the advantage that there is no size increase for most normal characters. The string length will ONLY go up for characters not supported by the filesystem.
Check out base64. Encoding and decoding is well defined.
I'd prefer this over rolling my own format any day.
Hmm, from your question I'm not totally clear on this point, but since you wanted some conversion I'm assuming that you want something that is at least human readable?
Each OS may have different restrictions, but are you close enough to the platforms that you would be able to find out/test what is acceptable in a username? If you could find three 'special' characters that you could use just to do a replace on '#', '.', '_' you would be good to go. (Is that comprehensive? if not you would need to make sure you know all of them otherwise you could clash.) I searched a bit trying to find whether there was a POSIX standard, but wasn't able to find anything, so that's why I think if you can just test what's valid that would be the most direct route.
With even one special character, you could do URL encoding, either with '%' if it's available, or whatever you choose if not, say '!", then { '#'->'!40", '_'->'!5F', '.'-> '!2E' }. (The spec [RFC1738] http://www.rfc-editor.org/rfc/rfc1738.txt) defines the characters as US-ASCII so you can just find a table, e.g. in wikipedia's ASCII article and look up the correct hex digits there.) Or, you could just do your own simple mapping since you don't need the whole ASCII set, you could just do a map with two characters per escaped character and have, say, '!a','!u','!p' for at, underscore, period.
If you have two special characters, say, '%', and '!', you could delimit text that represents the character, say, %at!, &us!, and '&pd!'. (This is pretty much html-style encoding, but instead of '&' and ';' you are using the available ones, and you're making up your own mnemonics.) Another idea is that you could use runs of a symbol to determine the translated character, where each new character flops which symbol is being used. (This conveniently stops the run if we need to put two of the disallowed characters next to each other.) So assume '%' and '!', with period being 1, underscore 2, and at-sign being three, 'mickey._sample_#fake.out' would become 'mickey%!!sample%%!!!fake%out'. There are other variations but this one is easy to code.
If none of this is an option (e.g. no symbols at all, just [a-zA-Z0-9]), then really I think the Base64 answer sounds about right. Really once we're getting to anything other than a simple replacement (and even that) it's already getting hard to type if that's the goal. But if you really need to try to keep the email mostly readable, what you do is implement some sort of escaping. I'm thinking use '0' as your escape character, so now '0' becomes '00', '#' becomes '01', '.' becomes '02', and '_' becomes '03'. So now, 'mickey01._sample_#fake.out'would become 'mickey0010203sample0301fake02out'. Not beautiful but it should work; since we escaped any raw 0's, just always make sure you define a mapping for whatever you choose as your escape char and you should be fine..
That's all I can think of atm. :) Definitely if there's no need for these usernames to be readable in the raw it seems like apparently Base64 won't work, since it can produce slashes. Heck, ok, just the 2-digit US-ASCII hex value for each character and you're done...] is a good way to go; there's lots of nice debugged, heavily field-tested code out there for it and it solves your problem quite handily. :)
Given...
- the limited set of characters allowed in various file systems
- the desire to keep the encoded email address short (both for human readability and for possible concerns with file system limitations)
...a possible approach may be a two steps encoding logic whereby the email is
first compressed using a lossless compression algorithm such as Lempel-Ziv, effectively turning it into a "binary" form, stored in a shorter array of bytes
then this array of bytes is encoded using a Base64-like algorithm
The idea is to minimize the size of the binary representation, so that the expansion associated with the storage inefficiency of the encoding -which can only store roughly 6 bits (and probably a bit less) per character-, doesn't cause the encoded string to be too long.
Without getting overly sophisticated for the compression nor the encoding, such a system would likely produce encoded strings that are maybe 4/5 of the input string size (the email address): the compression should easily half the size, but the encoding, say Base32, would grow the binary form size by 8/5.
Efforts in improving the compression ratio may allow the selection of more "wasteful" encoding schemes (with smaller character sets) and this may help making the output more human-readable and also more broadly safe on various flavors of file systems. For example whereby a Base64 seems optimal. space-wise, using only uppercase letter (base 26) may ensure portability of the underlying scheme to file systems where the file names are not case sensitive.
Another benefit of the initial generic compression is that few, if any, assumptions need to be made about the syntax of valid input key (email addresses here).
Ideas for compression:
LZ seems like a good choice, 'though one may consider primin its initial buffer with common patterns found in email addresses (example ".com" or even "a.com", "b.com" etc.). This initial buffer would ensure several instances of "citations" per compressed email address, hence a better compression ratio overall). To further squeeze a few bytes, maybe LZH or other LZ-variations could be used.
Aside from the priming of the buffer mentioned above, another customization may be to use a shorter buffer than typical LZ algorithms, since the string we have to compress (email address instances) are themselves very short and would not benefit from say a 512 bytes buffer. (Shorter buffer sizes allow shorter codes for the citations)
Ideas for encoding:
Base64 is not suitable as-is because of the slash (/), plus (+) and equal (=) characters. Alternate characters could be used to replace these; dash (-) comes to mind, but finding three charcters, allowed by all "flavors" of the targeted file systems may be a stretch.
Never the less, Base64 and its 4 output characters per 3 payload bytes ratio provide what is probably the barely achievable upper limit of storage efficiency [for an acceptable character set].
At the lower end of this efficiency, is maybe an ASCII representation of the Hexadeciamal values of the bytes in the array. This format with a doubling of the payload bytes may be acceptable, length-wise, and is interesting because of its simplicity (there is a direct and simple relation between each nibble (4 bits) in the input and characters in the encoded string.
Base32 whereby A thru Z encode 0 thru 25 and 0 thru 5 encode 26 thru 31, respectively, essentially variation of Base64 with an 8 output characters per 5 payload bytes ratio may be a very viable compromise.

Password validation

I need to validate a password with the following requirements:
1. Be at least seven characters long
2. Contain at least one letter (a-z or A-Z)
3 Contain at least one number (0-9)
4 Contain at least one symbol (#, $, %, etc)
Can anyone give me the correct expression?
/.{7,}/
/[a-zA-Z]/
/[0-9]/
/[-!##$%^...]/
For a single regex, the most straightforward way to check all of the requirements would be with lookaheads:
/(?=.*[a-zA-Z])(?=.*\d)(?=.*[^a-zA-Z0-9\s]).{7,}/
Breaking it down:
.{7,} - at least seven characters
(?=.*[a-zA-Z]) - a letter must occur somewhere after the start of the string
(?=.*\d) - ditto 2, except a digit
(?=.*[^a-zA-Z0-9\s]) - ditto 2, except something not a letter, digit, or whitespace
However, you might choose to simply utilize multiple separate regex matches to keep things even more readable - chances are you aren't validating a ton of passwords at once, so performance isn't really a huge requirement.

Resources