Simple to enter Unicode character that would sort after Z in most cases? - sorting

As you probably know most symbols are sorted before the alphabetical letters.
I am looking for one character that is easy to enter from the keyboard that would be sorted after "z" by most sort implementations.
If this is also an ASCII character, the better :)
Any ideas?

On a Mac, these are the only characters I can type using a US keyboard (with and without shift and option modifiers) that sort below Z and z:
Ω (option+z)
π (option+p)
µ (option+m)
 (shift+option+k)
It seems like omega and then pi are the best options for cross-platform compatibility.

On Windows, none of these options work because they all sort before A.
A solution I ended up using is an Arabic character:
ٴ This folder comes after z in windows
Source

A Tilde '~' is ASCII code 126.
This comes after all the standard English usage characters and would therefore out-sort a 'Z' of any case.
It would not out-sort other special characters, however ASCII or unicode sequencing is not sufficient to cover international sorts in any context.
Example: internationisation in javascript

Xi "Ξ" works nicely!
On Mac: Ctrl+Cmd+Space, then type "xi".

The answers provided here that worked for me:
Ξ _Greek capital letter XI (per tonystar's answer);
π _Greek small letter PI (per DaveC's answer);
Ω _Greek capital letter OMEGA (per DaveC's answer);
µ _international symbol for micrometre, previously and AKA micron (per DaveC's answer);
ٴ _Arabic letter ٴ(unidentified) (per degenerate's answer);
ﻩ _Arabic letter HEH isolated form (per Bytee's answer);
Notes:
using macOS 10.14.2.
a tilde ~ always displays before numbers in an ascending sort.
In macOS Numbers (spreadsheet) app, sort (ascending) displays as follows:
0
9
a
z
µ
Ξ
π
Ω
ٴ
ﻩ
Perhaps worth mentioning that the last two Arabic letters ٴ (unknown) and ﻩ (HEH) are difficult to edit (not as expected) in Numbers.
In macOS Finder, sort (ascending) displays as follows:
ٴ (appears as a narrow 'blank' space at the beginning of the file name)
0
9
a
z
µ
Ξ
π
Ω
ﻩ (appears at the end of the file name in display, at the beginning during edit)

Late to the party, but I was tearing my hair out to find a character that sorted last that wouldn't tweak my OCD either. Finally found this Arabic character "ﻩ" sorts after z. Putting one on either side of the folder name like so...
ﻩ Odds & Ends ﻩ
...looks rather pretty to me, so maybe it'll work for you all too!

If you want to do it somehow invisible you can use no-break-space ascii code:
Windows: ALT+0160 (only works with numpad)

I'm trying to do this to my Amazon Wishlists. None of the suggestions here have worked (I have tried, Ω, Ξ, ~).
I ended up using zzz_

It looks like a sort with LC_ALL=C sorts by ascii value, so {|}~ and DEL will come after z.
% echo $'a\nA\n1\n#\n~' | LC_ALL=C sort
1
#
A
a
~
It appears this is default when LC_ALL is not set on mac/bsd sort but must be explicitly set for gnu sort.

Related

Escape sequences \033[01;36m\] vs. \033[1;36m\] in PS1 in .bashrc: why the zero?

I've just compared the $PS1 prompts in .bashrc on two of my Debian machines:
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;36m\]\u\[\033[0;90m\]#\[\033[0;32m\]\h\[\033[0;90m\]:\[\033[01;34m\]\w\[\033[0;90m\]\$\[\033[0m\] '
PS1='${debian_chroot:+($debian_chroot)}\[\033[1;36m\]\u\[\033[0;37m\]#\[\033[0;32m\]\h\[\033[0;37m\]:\[\033[01;34m\]\w\[\033[0;37m\]\$\[\033[0m\] '
As you see, the first sequence says \033[01;, whereas the second has \033[1; on the same position. Do both mean the same (I guess, bold) or do they mean something different? Any idea why the zero has appeared or disappeared? I have no recollection of having introduced/removed this zero myself. A Web search returns numerous occurrences both with and without zero.
"ANSI" numeric parameters are all decimal integers (see ECMA-48, section 5.4.1 Parameter Representation). In section 5.4.2, it explains
A parameter string consists of one or more parameter sub-strings, each of which represents a number
in decimal notation.
A leading zero makes no difference. Someone noticed the unnecessary character and trimmed it.
the ESC[#;#m escape is for the console font color. I've seen many subtle variations on escape implementations, so I'm not surprised. Regardless I think both should be interpreted the same way

How to count by formula the length of the uppercase letters of a cell in Apache OpenOffice Calc?

The following formula converts the characters to upper case and then calculates the number of all upper case letters.
tesT // A1
=LEN(SUBSTITUTE(UPPER(A1);"";"")) // A2: 4 instead of 1
The UPPER function converts lower case to upper case, so it won't work this way. Sadly, i fear there's no solution for Apache OpenOffice. However, you could do this with LibreOffice Calc, using the REGEX function:
=LEN(REGEX(A1;"[a-z0-9]";"";"g"))
removes all lower-case characters and numbers, assuming there are no other text content (special characters or something like that). However, a bullet-proof solution would require writing a custom function.

Counting words from a mixed-language document

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.
You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

decoding algorithm wanted

I receive encoded PDF files regularly. The encoding works like this:
the PDFs can be displayed correctly in Acrobat Reader
select all and copy the test via Acrobat Reader
and paste in a text editor
will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it's basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
Thank you!
edit: thanks for the feedback. I'd try to break the question into smaller questions:
Part 1: How to detect identical part(s) inside string?
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello
Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?
If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.
Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.
The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.
Edit - saw comment about character swapping
In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.
Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.
Hmmm, thats a tough one.
The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.
But I cannot see a solution that will decode everything for you with the scenario you describe.
Why don't you paste some sample input and we can have ago at decoding it.
It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).
also this question : How would I reverse engineer a cryptographic algorithm? have some advices.
Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

Is there any logic behind ASCII codes' ordering?

I was teaching C to my younger brother studying engineering. I was explaining him how different data-types are actually stored in the memory. I explained him the logistics behind having signed/unsigned numbers and floating point bit in decimal numbers. While I was telling him about char type in C, I also took him through the ASCII code system and also how char is also stored as 1 byte number.
He asked me why 'A' has been given ASCII code 65 and not anything else? Similarly why 'a' is given the code 97 specifically? Why is there a gap of 6 ASCII codes between the range of capital letters and small letters? I had no idea of this. Can you help me understand this, since this has created a great curiosity to me as well. I've never found any book so far that has discussed this topic.
What is the reason behind this? Are ASCII codes logically organized?
There are historical reasons, mainly to make ASCII codes easy to convert:
Digits (0x30 to 0x39) have the binary prefix 110000:
0 is 110000
1 is 110001
2 is 110010
etc.
So if you wipe out the prefix (the first two '1's), you end up with the digit in binary coded decimal.
Capital letters have the binary prefix 1000000:
A is 1000001
B is 1000010
C is 1000011
etc.
Same thing, if you remove the prefix (the first '1'), you end up with alphabet-indexed characters (A is 1, Z is 26, etc).
Lowercase letters have the binary prefix 1100000:
a is 1100001
b is 1100010
c is 1100011
etc.
Same as above. So if you add 32 (100000) to a capital letter, you have the lowercase version.
This chart shows it quite well from wikipedia: Notice the two columns of control 2 of upper 2 of lower, and then gaps filled in with misc.
Also bear in mind that ASCII was developed based on what had passed before. For more detail on the history of ASCII, see this superb article by Tom Jennings, which also includes the meaning and usage of some of the stranger control characters.
Here is very detailed history and description of ASCII codes: http://en.wikipedia.org/wiki/ASCII
In short:
ASCII is based on teleprinter encoding standards
first 30 characters are "nonprintable" - used for text formatting
then they continue with printable characters, roughly in order they are placed on keyboard. Check your keyboard:
space,
upper case sign on number caps: !, ", #, ...,
numbers
signs usually placed at the end of keyboard row with numbers - upper case
capital letters, alphabetically
signs usually placed at the end of keyboard rows with letters - upper case
small letters, alphabetically
signs usually placed at the end of keyboard rows with letters - lower case
The distance between A and a is 32. That's quite round number, isn't it?
The gap of 6 characters between capital letters and small letters is because (32 - 26) = 6. (Note: there are 26 letters in the English alphabet).
If you look at the binary representations for 'a' and 'A', you'll see that they only differ by 1 bit, which is pretty useful (turning upper case to lower case or vice-versa is just a matter of flipping a bit). Why start there specifically, I have no idea.
'A' is 0x41 in hexidecimal.
'a' is 0x61 in hexidecimal.
'0' thru '9' is 0x30 - 0x39 in hexidecimal.
So at least it is easy to remember the numbers for A, a and 0-9. I have no idea about the symbols. See The Wikipedia article on ASCII Ordering.
Wikipedia:
The code itself was structured so that
most control codes were together, and
all graphic codes were together. The
first two columns (32 positions) were
reserved for control characters.[14]
The "space" character had to come
before graphics to make sorting
algorithms easy, so it became position
0x20.[15] The committee decided it was
important to support upper case
64-character alphabets, and chose to
structure ASCII so it could easily be
reduced to a usable 64-character set
of graphic codes.[16] Lower case
letters were therefore not interleaved
with upper case. To keep options open
for lower case letters and other
graphics, the special and numeric
codes were placed before the letters,
and the letter 'A' was placed in
position 0x41 to match the draft of
the corresponding British
standard.[17] The digits 0–9 were
placed so they correspond to values in
binary prefixed with 011, making
conversion with binary-coded decimal
straightforward.

Resources