How to count by formula the length of the uppercase letters of a cell in Apache OpenOffice Calc? - char

The following formula converts the characters to upper case and then calculates the number of all upper case letters.
tesT // A1
=LEN(SUBSTITUTE(UPPER(A1);"";"")) // A2: 4 instead of 1

The UPPER function converts lower case to upper case, so it won't work this way. Sadly, i fear there's no solution for Apache OpenOffice. However, you could do this with LibreOffice Calc, using the REGEX function:
=LEN(REGEX(A1;"[a-z0-9]";"";"g"))
removes all lower-case characters and numbers, assuming there are no other text content (special characters or something like that). However, a bullet-proof solution would require writing a custom function.

Related

How is the soundex code for burroughs is B622 instead of B620?

Steps of Soundex Algorithm
Site For Checking Soundex code
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
Soundex Algo
Dropping vowels and letter 'h' and 'w' Burroughs will convert to its numeric form as B6622 Applying adjacent same number rule will give code B62 .Since we need three numbers we append zero at the end.
So finally code should be B620!??
Yes, the Soundex code for Burroughs is B620.
Confirmed by manually working through the "alternate" version of the American Soundex algorithm listed on Wikipedia.
https://en.wikipedia.org/wiki/Soundex
Burroughs:
Burrougs
1u66ou22
1u6ou2
162
B62
B620
Also verified with another soundex calculator.
http://resources.rootsweb.ancestry.com/cgi-bin/soundexconverter
The original Soundex function checker web site you used does seem to be wrong - it is not applying the "h/w" part of the algorithm you highlighted in your question. That web site gives the same value "B622" for "Burroughs" and "Burrouges", which it should not.

Simple compression algorithm - how to avoid marker-strings?

I was trying to come up with a string compression algorithm for plaintext, e.g.
AAAAAAAABB -> A#8BB
where n symbols y are written out like
y#n
The problem is: what if I need to compress the string "A#8" ? That would confuse the decompression algorithm into thinking that the original input was "AAAAAAAA" instead of just "A#8".
How can I solve this problem? I was thinking of using a "marker" character instead of the #, but what if I wanted the algorithm to work with binary data? There is no marker character that can be used in that case I suppose
A simple solution is escaping: you could represent each # in the source by ##.
Everytime you encounter an #, you look one character ahead and find either a number (repeat previous character) or another # (its literally #).
A variant would be encoding each # as ##1, which would fit nicely into your current scheme and allows encoding n consecutive # as ##n.

Counting words from a mixed-language document

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.
You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

decoding algorithm wanted

I receive encoded PDF files regularly. The encoding works like this:
the PDFs can be displayed correctly in Acrobat Reader
select all and copy the test via Acrobat Reader
and paste in a text editor
will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it's basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
Thank you!
edit: thanks for the feedback. I'd try to break the question into smaller questions:
Part 1: How to detect identical part(s) inside string?
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello
Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?
If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.
Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.
The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.
Edit - saw comment about character swapping
In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.
Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.
Hmmm, thats a tough one.
The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.
But I cannot see a solution that will decode everything for you with the scenario you describe.
Why don't you paste some sample input and we can have ago at decoding it.
It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).
also this question : How would I reverse engineer a cryptographic algorithm? have some advices.
Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

Is there any logic behind ASCII codes' ordering?

I was teaching C to my younger brother studying engineering. I was explaining him how different data-types are actually stored in the memory. I explained him the logistics behind having signed/unsigned numbers and floating point bit in decimal numbers. While I was telling him about char type in C, I also took him through the ASCII code system and also how char is also stored as 1 byte number.
He asked me why 'A' has been given ASCII code 65 and not anything else? Similarly why 'a' is given the code 97 specifically? Why is there a gap of 6 ASCII codes between the range of capital letters and small letters? I had no idea of this. Can you help me understand this, since this has created a great curiosity to me as well. I've never found any book so far that has discussed this topic.
What is the reason behind this? Are ASCII codes logically organized?
There are historical reasons, mainly to make ASCII codes easy to convert:
Digits (0x30 to 0x39) have the binary prefix 110000:
0 is 110000
1 is 110001
2 is 110010
etc.
So if you wipe out the prefix (the first two '1's), you end up with the digit in binary coded decimal.
Capital letters have the binary prefix 1000000:
A is 1000001
B is 1000010
C is 1000011
etc.
Same thing, if you remove the prefix (the first '1'), you end up with alphabet-indexed characters (A is 1, Z is 26, etc).
Lowercase letters have the binary prefix 1100000:
a is 1100001
b is 1100010
c is 1100011
etc.
Same as above. So if you add 32 (100000) to a capital letter, you have the lowercase version.
This chart shows it quite well from wikipedia: Notice the two columns of control 2 of upper 2 of lower, and then gaps filled in with misc.
Also bear in mind that ASCII was developed based on what had passed before. For more detail on the history of ASCII, see this superb article by Tom Jennings, which also includes the meaning and usage of some of the stranger control characters.
Here is very detailed history and description of ASCII codes: http://en.wikipedia.org/wiki/ASCII
In short:
ASCII is based on teleprinter encoding standards
first 30 characters are "nonprintable" - used for text formatting
then they continue with printable characters, roughly in order they are placed on keyboard. Check your keyboard:
space,
upper case sign on number caps: !, ", #, ...,
numbers
signs usually placed at the end of keyboard row with numbers - upper case
capital letters, alphabetically
signs usually placed at the end of keyboard rows with letters - upper case
small letters, alphabetically
signs usually placed at the end of keyboard rows with letters - lower case
The distance between A and a is 32. That's quite round number, isn't it?
The gap of 6 characters between capital letters and small letters is because (32 - 26) = 6. (Note: there are 26 letters in the English alphabet).
If you look at the binary representations for 'a' and 'A', you'll see that they only differ by 1 bit, which is pretty useful (turning upper case to lower case or vice-versa is just a matter of flipping a bit). Why start there specifically, I have no idea.
'A' is 0x41 in hexidecimal.
'a' is 0x61 in hexidecimal.
'0' thru '9' is 0x30 - 0x39 in hexidecimal.
So at least it is easy to remember the numbers for A, a and 0-9. I have no idea about the symbols. See The Wikipedia article on ASCII Ordering.
Wikipedia:
The code itself was structured so that
most control codes were together, and
all graphic codes were together. The
first two columns (32 positions) were
reserved for control characters.[14]
The "space" character had to come
before graphics to make sorting
algorithms easy, so it became position
0x20.[15] The committee decided it was
important to support upper case
64-character alphabets, and chose to
structure ASCII so it could easily be
reduced to a usable 64-character set
of graphic codes.[16] Lower case
letters were therefore not interleaved
with upper case. To keep options open
for lower case letters and other
graphics, the special and numeric
codes were placed before the letters,
and the letter 'A' was placed in
position 0x41 to match the draft of
the corresponding British
standard.[17] The digits 0–9 were
placed so they correspond to values in
binary prefixed with 011, making
conversion with binary-coded decimal
straightforward.

Resources