Load a text file containing both numbers and letters 2 - text-files

This question is related to Load a text file containing both numbers and letters but I ask for the opposite
blabla<tab>1
blabla<tab>2
...
Do I need fscanf in this case also?
EDIT: the answer to the question mentioned seams to be for treating individual characters as cols. In my case, I have strings of different length.

I found the textread function. It solved the problem.

Related

In a trie for inserting and searching normal paths, are ascii 1-31 worth considering?

I am working on a trie data structure which inserts and searches for normal paths.
A path can contain any character from unicode, so in order to represent it completely in utf-8, the array in trie needs to contain next nodes for all 256 ascii.
But I am also concerned about the space and insertion time taken by trie.
The conditions under which my trie is setup rarely would insert a character of unicode(I mean 128-255 ascii). So I just put an if condition to reject paths which contain above ascii 127. I don’t think the ascii 1-31 are relevant either, although I am unsure about this. As 1-31 chars are like carriage return, esc etc, can I simply continue the loop without inserting them? Like is it possible to encounter paths that are actually differentiable because of ascii 1-31 in a real scenario?
Answering this old question, on macOS ascii 13 is used to represent custom icons which may appear in many paths. Thanks to #EricPostpischil who told that in comments.
All other characters ranging between 1-31 appear pretty less in paths.
Also, macOS users mostly have a case-insensitive path, so generally considering both lowercase and uppercase is also useless.
PS:
Although this question seems to be opinion based, but it actually isn't because it can be answered quite concisely. It attempts to ask for frequency of appearance of characters in paths on macOS. (sorry for the confusing title, I was a noob that time, changing it now will make all comments on it absurd)

Algorithm/Hashing/Creative Way To Map Beyond 2 Alphanumeric Characters Combinations

I have a system that is confined to two alphanumeric characters. Some simple math shows that we get 1,296 combinations if we use all possible permutations 0-9 and a-z. Lower case letters cannot be distinguished from upper case, special characters (including a blank character) cannot be used.
Is there any creative mapping, perhaps to an external reference, to create a way to take this two character field significantly beyond 1,296 combinations?
Examples of identifers would be `00, OO, AZ, Z4, etc.'
Thanks!
I'm afraid not, no more than you could get a 3 bit number to represent more than 8 different numbers. If you're interested in the details you can look up information theory or Kolmogorov complexity. Essentially with only 1,296 combinations then you can only label 1,296 possible pieces of information.
As an example, consider if you had 1,297 things. All of those two letter combinations would take up the first 1,296 so what combination would be associated with the next one? It would have to be a repeat of something which you had earlier.
Shor also has some good material on this, and the implications of that sort of thing form the basis for a lot of file compression systems.
You could maybe squeeze out one more combination if you cheat, and allow a 'null' value to represent a different possibility, but thats not totally relevant to the idea of the question.
If you are restricted to two characters taken from an alphabet of 36, then you are limited to 36² distinct symbols, that's it.
More context is required to find workarounds, like stealing bits elsewhere, using symbols in pairs, breaking the case limitation, exploiting the history of transations...
The precise meaning of "a system that is confined to two alphanumeric characters" needs to be known to be able to suggest a workaround. Is that a space constraint? Do you need the restriction to 2 chars for efficiency? Does it need to work with other code that accepts or generates 2 char indexes?
If you have up to 1295 identifiers that are used often, and some others that occur only occasionally, you could choose an identifier, e.g. "ZZ", to indicate that another identifier is following. So "00" through to "ZY" would be 1295 simple 2-char identifiers, and "ZZ00" though to "ZZZZ" would be a further 1296 combined 4-char identifiers. (Or "ZZ0000" through to "ZZZZZZ" for a further 1296*1296 identifiers ...)
This could work for space constraints. For efficiency, it depends on whether the additional check to see if the identifier is "ZZ" is too expensive or not.

decoding algorithm wanted

I receive encoded PDF files regularly. The encoding works like this:
the PDFs can be displayed correctly in Acrobat Reader
select all and copy the test via Acrobat Reader
and paste in a text editor
will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it's basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
Thank you!
edit: thanks for the feedback. I'd try to break the question into smaller questions:
Part 1: How to detect identical part(s) inside string?
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello
Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?
If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.
Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.
The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.
Edit - saw comment about character swapping
In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.
Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.
Hmmm, thats a tough one.
The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.
But I cannot see a solution that will decode everything for you with the scenario you describe.
Why don't you paste some sample input and we can have ago at decoding it.
It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).
also this question : How would I reverse engineer a cryptographic algorithm? have some advices.
Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

How to find "equivalent" texts?

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.
The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent.
This ends up as 2 parts, find the longest such strings in a corpus, and get that corpus.
The first part seems to me to be amiable to some sort of attack with a B-tree keyed off the string after a substitution that makes the sequence of first occurrences sequential.
HELLOWORLDTHISISIT
1233454637819a9b98
A little optimization based on knowing the maximum value and length of the string based on each depth of the tree and the rest is just coding.
The Other part would be quite a bit more involved; how to generate a large corpus of text to search? some kind of internet spider would seem to be the ideal approach as it would have access to the largest amount of text but how to strip it to just the text?
The question is; Any ideas on how to do this better?
Edit: the cipher that was being used is an insanely basic 26 letter substitution cipher.
p.s. this is more a thought experiment then a probable real project for me.
There are 26! different substitution ciphers. That works out to a bit over 88 bits of choice:
>>> math.log(factorial(26), 2)
88.381953327016262
The entropy of English text is something like 2 bits per character at least. So it seems to me you can't reasonably expect to find passages of more than 45-50 characters that are accidentally equivalent under substitution.
For the large corpus, there's the Gutenberg Project and Wikipedia, for a start. You can download an dump of all the English Wikipedia's XML files from their website.
I think you're asking a bit much to generate a substitution that is also "coherent". That is an AI problem for the encryption algorithm to figure out what text is coherent. Also, the longer your text is the more complicated it will be to create a "coherent" result... quickly approaching a point where you need a "key" as long as the text you are encrypting. Thus defeating the purpose of encrypting it at all.

How to elegantly compute the anagram signature of a word in ruby?

Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.

Resources