Asterisk is character 42 - ascii

Just realized asterisk (*) is character 42 (0x2A, 052).
Is this related to the significance of the number 42 in Hitchhiker's guide to the galaxy? Considering also how asterisk is traditionally associated to a star1, 2, 3.
1: A "star diagram" will often be exemplified with a diagram vaguely resembling a five or six edges asterisk.
2: The search algorithm A* is commonly called "A star", will be found in code references as astar.
3: Everybody uses an asterisk to represent Kleene's star operator.

There is a long story to the asterisk (*) being ASCII 42 and is used as:
Its a Wild Card
There are 42 dots on a pair of dice
The sentence "the meaning of life, the universe, and everything" has 42 letters/symbols
and so on...
Anyways, He explains this now culturally entrenched "Answer to The Great Question of Life, The Universe, and Everything" as just a number he came up with while sitting in his garden. And that's just fine with me. The number is unimportant, it's the effect of the joke that matters.

Related

In BCPL what does "of" do?

I am trying to understand some ancient code from a DEC PDP10 written in BCPL. A sample of the code is as follows:
test scanner()=S.DOTNAME then
$( word1:=checklook.up(scan.info,S.SFUNC,"unknown Special function [:s]")
D7 of temp:=P1 of word1
scanner()
$) or D7 of temp:=SF.ACTION
What do the "D7 of temp" and "P1 of word1" constructs do in this case?
The unstoppable Martin Richards is continuing to add features to the BCPL language(a), despite the fact that so few people are aware of it(b). Only seven or so questions are tagged bcpl on Stack Overflow but don't get me wrong: I liked this language and I have fond memories of using it back in the '80s.
Some of the things added since the last time I used it are the sub-field operators SLCT and OF. As per the manual on Martin's own site:
An expression of the form K OF E accesses a field of consecutive bits in memory. K must be a manifest constant equal to SLCT length:shift:offset and E must yield a pointer, p say.
The field is contained entirely in the word at position p + offset. It has a bit length of length and is shift bits from the right hand end of the word. A length of zero is interpreted as the longest length possible consistent with shift and the word length of the implementation.
Hence it's a more fine-grained way of accessing parts of memory than just the ! "dereference entire word" operator in that it allows you to get at specific bits within a word.
(a) Including, apparently, a version for the Raspberry PI, which may finally give me an excuse to break out all those spare PIs I have lying around, and educate the kids about the "good old days".
(b) It was used for at least one MC6809 embedded system I worked on, and formed a non-trivial part of AmigaDOS many moons ago.

What exactly is an n Gram?

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct:
Sentence: "I live in NY."
word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"
When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:
word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]
Someone in the answer section confirmed this was correct, but unfortunately I'm a bit lost beyond that as I didn't fully understand everything else that was said! I'm using LingPipe and following a tutorial which stated I should choose a value between 7 and 12 - but without stating why.
What is a good nGram value and how should I take it into account when using a tool like LingPipe?
Edit: This was the tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html
Usually a picture is worth thousand words.
Source: http://recognize-speech.com/language-model/n-gram-model/comparison
N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox. You may also count the word boundary – that would expand the list of 2-grams to #f, fo, ox, and x#, where # denotes a word boundary.
You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hello, hello world, world #.
The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.
An n-gram is a n-tuple or group of n words or characters (grams, for pieces of grammar) which follow one another. So an n of 3 for the words from your sentence would be like "# I live", "I live in", "live in NY", "in NY #". This is used to create an index of how often words follow one another. You can use this in a Markov Chain to create something that will be similar to language. As you populate a mapping of the distributions of word groups or character groups, you can recombine them with the probability that the output will be close to natural, the longer the n-gram is.
Too high of a number, and your output will be a word for word copy of the original, too low of a number, and the output will be too messy.

Converting PI digits into text strings

It's kind of interesting that pi's decimal representation never ends and never settles into a permanent repeating pattern. Meaning it's highly possible that pi contains every possible combination of numbers.
This guy calculated 5 trillions 5x(10^12) numbers of pi :D
http://www.numberworld.org/misc_runs/pi-5t/details.html
From the internet: "Converted into ASCII text, somewhere in that infinite string of digits is the name of every person you will ever love, the date, time and manner of your death, and the answers to all the great questions of the universe."
Wondering if somebody has already converted and analyzed the resulting string for known sequences of letters (words/sentences)?
Check out this page: http://pi.nersc.gov/.
It allows you to search for both character strings and hexadecimal sequences. Note that this search engine only has indexed the first 4 billion decimals of pi, and uses a formula for arbitrarily positioned binary or hexadecimal digits after those indexed.
The idea that Pi contains everything ever is a nice idea, but if it's correct, that means there is also an infinite amount of false things about everything ever. For example, if Pi contains a list of all the people you will ever love, then it will also have a list of people that seems that it is a list of people you will love, but in reality it's just a mix of names in a pattern that makes it look legit.
Following the same idea, the date, time, and manner of your death could also be "falsified". For example, let's say you are a man named Jason Delara, and you die at the age of 83 at 11:35 PM in your sleep. In Pi somewhere it can say in ASCII text "Jason Delara will die at age 83, 11:35 PM, passed in his sleep." It would also say somewhere else that "Jason Delara will die at age 35, at 6:00 AM, passed in a car accident." There could be an "infinite" amount of these false predictions.
There's also the fact that, if following the idea from above, all but one the answers to one of the great questions of the universe in that digit are wrong, even if many of the answers make sense. I've thought about this a lot, and I thought "What if there's part of the digit that states which facts are correct and which are not?" The answer is "Then there is an infinite amount of false lists in the digit claiming to do the same as the real list." In short, it would be pointless to convert Pi to ASCII text to try and figure everything out.
I know I'm a little late the party, but I wrote this for anybody who comes here looking for the answers to the universe in an endless, non-repeating decimal.
It is massively convenient that pi is an irrational number we're still finding digits for as if you can't find what you want in the sequence then by definition it just happens to be later on.
As for it containing hidden information - if you create any random sequence long enough, you'll be able to create simple words from the resulting output.
Conspiracy theorists just love to see patterns where there are none. They forget the other noise and are endlessly fascinated by mere coincidences.
Would just like to provide further context this question. Yes, the point is that PI goes on infinitely. That means there are endless possibilities for sentence structure and letter combination. This means every single combination of letters will happen and is happening in PI. So technically, everything in PI could apply to everything in the observable world around us.

What is a sensible algorithm to generate the letters in Letterpress iOS game?

I've been playing Letterpress a lot lately.
The aim of the game is pretty much to score as many blue tiles as possible by making words out of the letters on the board. When you play your word, letters composing the word will turn blue unless the letter was surrounded by red tiles.
A regular Letterpress board looks like this:
I realised that the letters on the board must be generated with some sort of rules, otherwise it will be really hard to play the game with some boards. I could only think of the rule where there must be a number of vowels. I wonder if there are other rules in place.
Additionally, I wonder if this will be anything similar to generating Boggle dices.
I decided to hack together a solution based on user166390's suggestion. The frequencies you see are for the English language, taken from Wikipedia. Running the program a few times and just eyeballing the results, they look pretty playable to me. I can generally find a few four- or five-letter words at least, and I'm not even very good at the game! Anyway, here's the code:
#!/usr/bin/env python
from random import random
from bisect import bisect_left
letters = [c for c in "abcdefghijklmnopqrstuvwxyz"]
frequencies = [8.167, 1.492, 2.782, 4.253, 12.702, 2.228, 2.015, 6.094, 6.966,
0.153, 0.772, 4.025, 2.406, 6.749, 7.507, 1.929, 0.095, 5.987,
6.327, 9.056, 2.758, 0.978, 2.360, 0.150, 1.974, 0.074]
cumulative_frequencies = [sum(frequencies[0:i+1]) for i in xrange(len(frequencies))]
for i in xrange(5):
line = ""
for j in xrange(5):
line += letters[bisect_left(cumulative_frequencies, random() * cumulative_frequencies[-1])] + " "
print line
The idea is, for each letter to be generated, use the roulette wheel algorithm to choose it randomly with probability proportional to the frequencies given.
I've heard Loren Brichter, the dev, talk about it, bit I can't for the life of remember where. I think it was on Guy Ritchie's Debug podcast, but I'm not sure. I remember a few things.
He guarantees at least a certain number of vowels.
The consonants are generated separately from the vowels. This implies separate letter distributions.
He did his own analysis of the dictionary behind the game to come up with the letter distribution.
If a Q is chosen an I is guaranteed so a word is always possible with the Q.
I play a lot. I have never had a game end for any reason but all letters being used. I don't know if it's guaranteed a word is always possible with every letter, but it sure seems it's practically true even if not enforced.

find some sentences

I'd like to find good way to find some (let it be two) sentences in some text. What will be better - use regexp or split-method? Your ideas?
As requested by Jeremy Stein - there are some examples
Examples:
Input:
The first thing to do is to create the Comment model. We’ll create this in the normal way, but with one small difference. If we were just creating comments for an Article we’d have an integer field called article_id in the model to store the foreign key, but in this case we’re going to need something more abstract.
First two sentences:
The first thing to do is to create the Comment model. We’ll create this in the normal way, but with one small difference.
Input:
Mr. T is one mean dude. I'd hate to get in a fight with him.
First two sentences:
Mr. T is one mean dude. I'd hate to get in a fight with him.
Input:
The D.C. Sniper was executed was executed by lethal injection at a Virginia prison. Death was pronounced at 9:11 p.m. ET.
First two sentences:
The D.C. Sniper was executed was executed by lethal injection at a Virginia prison. Death was pronounced at 9:11 p.m. ET.
Input:
In her concluding remarks, the opposing attorney said that "...in this and so many other instances, two wrongs won’t make a right." The jury seemed to agree.
First two sentences:
In her concluding remarks, the opposing attorney said that "...in this and so many other instances, two wrongs won’t make a right." The jury seemed to agree.
Guys, as you can see - it's not so easy to determine two sentences from text. :(
As you've noticed, sentence tokenizing is a bit tricker than it first might seem. So you may as well take advantage of existing solutions. The Punkt sentence tokenizing algorithm is popular in NLP, and there is a good implementation in the Python Natural Language Toolkit which they describe the use of here. They also describe another approach here.
There's probably other implementations around, or you could also read the original paper describing the Punkt algorithm: Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
You can also read another Stack Overflow question about sentence tokenizing here.
your_string = "First sentence. Second sentence. Third sentence"
sentences = your_string.split(".")
=> ["First sentence", " Second sentence", " Third sentence"]
No need to make simple code complicated.
Edit: Now that you've clarified that the real input is more complex that your initial example you should disregard this answer as it doesn't consider edge cases. An initial look at NLP should show you what you're getting into though.
Some of the edge cases that I've seen in the past to be a bit complicated are:
Dates: Some regions use dd.mm.yyyy
Quotes: While he was sighing — "Whatever, do it. Now. And by the way...". This was enough.
Units: He was going at 138 km. while driving on the freeway.
If you plan to parse these texts you should stay away from splits or regular expressions.
This will usually match sentences.
/\S(?:(?![.?!]+\s).)*[.?!]+(?=\s|$)/m
For your example of two sentences, take the first two matches.
irb(main):005:0> a = "The first sentence. The second sentence. And the third"
irb(main):006:0> a.split(".")[0...2]
=> ["The first sentence", " The second sentence"]
irb(main):007:0>
EDIT: here's how you handle the "This is a sentence ...... and another . And yet another ..." case :
irb(main):001:0> a = "This is the first sentence ....... And the second. Let's not forget the third"
=> "This is the first sentence ....... And the second. Let's not forget the thir
d"
irb(main):002:0> a.split(/\.+/)
=> ["This is the first sentence ", " And the second", " Let's not forget the thi rd"]
And you can apply the same range operator ... to extract the first 2.
You will find tips and software links on the sentence boundary detection Wikipedia page.
If you know what sentences to search, Regex should do well searching for
((YOUR SENTENCE HERE)|(YOUR OTHER SENTENCE)){1}
Split would probably use up quite a lot of memory, as it also saves the things you don't need (the whole text that's not your sentence) as Regex only saves the sentence you searched (if it finds it, of course)
If you're segmenting a piece of text into sentences, then what you want to do is begin by determining which punction marks can separate sentences. In general, this is !, ? and . (but if all you care about is a . for the texts your processing, then just go with that).
Now since these can appear inside quotations, or as parts of abbreviations, what you want to do is find each occurrence of these punctuation marks and run some sort of machine learning classifier to determine whether that occurance starts a new sentence, or whether it does something else. This involves training data and a properly-constructed classifier. And it won't be 100% accurate, because there's probably no way to be 100% accurate.
I suggest looking in the literature for sentence segmentation techniques, and have a look at the various natural language processing toolkits that are out there. I haven't really found one for Ruby yet, but I happen to like OpenNLP (which is in Java).

Resources