Related
I'm reading a note about the definition of algorithm, it has two requirements that I don't know what's the differences between them
Definiteness: Every instruction should be clear and unambiguous. (I found a source with exactly the same statement)
From the resource I have there are 5 requirements: Input, Output, Definiteness, Finiteness, Effectiveness. I can understand the other 4 except the Definiteness. Can anyone provide some better definition if the above is not precise?
From the above I only suspect that there are at least two subtleties should be considered...
For conclusion from answers below: definiteness = defined(clear) + only_one(unambiguous).
Algorithm should be clear and unambiguous. Each of its steps (or phases), and their inputs/outputs should be clear and must lead to only one meaning.
For example, if one step is to add two integers, we must define both “integers” as well as the “add” operation: we cannot for example use the same symbol to mean addition in one place and multiplication somewhere else.
If presented to an educated human, the text should allow him to simulate execution by hand in exactly the way you had in mind (same steps taken, same results obtained).
When you don't quite understand the definition of a term provided by some author, it's often helpful to look for other definitions of it. I especially like the one for "definite" from wiktionary.org:
Free from any doubt.
In this context, clear becomes understandable, and unambiguous becomes with a single meaning.
It just means that instructions in an algorithm should have one and only one interpretation. Moreover, the interpretation should be obvious.
A statement like "Repeat steps 1 to 4 a few times" does not fit the criteria as "few times" can mean different number of tries to different people.
On the other hand, a statement like "Repeat steps 1 to 4 until x is equal to y" where x and y are some parameters in the algorithm is indeed clear and unambiguous.
I have a large, rather complicated procedural content generation lua project. One thing I want to be able to do, for debugging purposes, is use a random seed so that I can re-run the system & get the same results.
To the end, I print out the seed at the start of a run. The problem is, I still get completely different results each time I run it. Assuming the seed doesn't change anywhere else, this shouldn't be possible, right?
My question is, what other ways are there to influence the output of lua's math.random()? I've searched through all the code in the project, and there's only one place where I call math.randomseed(), and I do that before I do anything else. I don't use the time or date for any calculations, so that wouldn't be influencing the results... What else could I be missing?
Updated on 2/22/16 monkey patching math.random & math.randomseed has, oftentimes (but not always) output the same sequence of random numbers. But still not the same results – so I guess the real question is now: what behavior in lua is indeterminate, and could result in different output when the same code is run in sequence? Noting where it diverges, when it does, is helping me narrow it down, but I still haven't found it. (this code does NOT use coroutines, so I don't think it's a threading / race condition issue)
randomseed is using srandom/srand function, which "sets its argument as the seed for a new sequence of pseudo-random integers to be returned by random()".
I can offer several possible explanations:
you think you call randomseed, but you do not (random will initialize the sequence for you in this case).
you think you call randomseed once, but you call it multiple times (or some other part of the code calls randomseed as well, possibly at different times in your sequence).
some other part of the code calls random (some number of times), which generates different results for your part of the code.
there is nothing wrong with the generated sequence, but you are misinterpreting the results.
your version of Lua has a bug in srandom/random processing.
there is something wrong with srandom or random function in your system.
Having some information about your version of Lua and your system (in addition to the small example demonstrating the issue) would help in figuring out what's causing this.
Updated on 2016/2/22: It should be fairly easy to check; monkeypatch both math.randomseed and math.random and log all the calls and the values returned by the functions for two subsequent runs. Compare the results. If the results differ, you should be able to isolate why they differ and reproduce on a smaller example. You can also look at where the functions are called from using debug.traceback.
Correct, as stated in the documentation, 'equal seeds produce equal sequences of numbers.'
Immediately after setting the seed to a known constant value, output a call to rand - if this varies across runs, you know something is seriously wrong (corrupt library download, whack install, gamma ray hit your drive, etc).
Assuming that the first value matches across runs, add another output midway through the code. From there, you can use a binary search to zero in on where things go wrong (I.E. first half or second half of the code block in question).
While you can & should use some intuition to find the error as you go, keep in mind that if intuition alone was enough, you would have already found it, thus a bit of systematic elimination is warranted.
Revision to cover comment regarding array order:
If possible, use debugging tools. This SO post on detecting when the value of a Lua variable changes might help.
In the absence of tools, here's one way to roll your own for this problem:
A full debugging dump of any sizable array quickly becomes a mess that makes it tough to spot changes. Instead, I'd use a few extra variables & a test function to keep things concise.
Make two deep copies of the array. Let's call them debug01 & debug02 & call the original array original. Next, deliberately swap the order of two elements in debug02.
Next, build a function to compare two arrays & test if their elements match up & return / print the index of the first mismatch if they do not. Immediately after initializing the arrays, test them to ensure:
original & debug01 match
original & debug02 do not match
original & debug02 mismatch where you changed them
I cannot stress enough the insanity of using an unverified (and thus, potentially bugged) test function to track down bugs.
Once you've verified the function works, you can again use a binary search to zero in on where things go off the rails. As before, balance the use of a systematic search with your intuition.
I'm getting involved with ZPL (a little bit) since a few days, so I'm sorry if the questions will look stupid.
I've got to build a bar code 128 and I finally realized: I got to make it as shorter as possible.
My main question is: is it possible to switch to subset C and then back to B for just 2 digits? I read the documentation and subset C will ready digits from 00 to 99, so in theory it should work, practically, will it be worth it?
Basically when I translate a bar code with Zebra designer, and print it to a file, it doesn't bother to switch to subset C for just a couple of digits.
This is the text I need to see in the bar code:
AB1C234D567890123456
By the documentation I read, I would build something like this:
FD>:AB1C>523>64D>5567890123456
Instead Zebra Designer does:
FD>:AB1C234D>5567890123456
So the other question is, will the bar code be the same length? Actually, will mine be shorter? [I don't have a printer with me at the moment]
Last question:
Let's say I don't want to spend much time scripting this up, will the following work ok, or will it make the bar code larger?
AB1C>523>64D>556>578>590>512>534>556
So I can just build a very simple script which checks two chars per time, if they're both numbers, then add >5 to the string.
Thank you :)
Ah, some nice loose terminology. Do you mean couple="exactly 2" or couple="a few"?
Changing from one subset to another takes one code element, so for exactly 2 digits, you'd need one element to change and one to represent the 2 digits in subset C. On the other hand, staying with your original subset would take 2 elements - so no, it's not worth the change.
Further, if you were to change to C for 2 digits and then back to your original, the change would actually be costly - C(12)B = 3 elements whereas 12 would only be 2.
If you repeat the exercise for 4 digits, then switching to C would generate C(12)(34) = 3 elements against 4 to stay with what you have; or C(12)(34)B = 4 elements if you switch and change back, or 4 elements if you stick - so no gain.
With 6 or more successive numerics, then you gain regardless of whether or not you switch back.
So overall,
2-digit terminal : No difference
2-digit other : code is longer
4-digit terminal : code is shorter
4-digit other : no difference
more than 4 digits : code is shorter.
And an ODD number of digits would need to be output in code A or B for the first digit and then the above table applies to the remainder.
This may not be the answer you're looking for, but specifying A (Automatic Mode) as the final parameter to the ^BC command will make the printer do this for you.
Example:
^XA
^FO100,100
^BY3
^BCN,100,N,N,A
^FD0123456789^FS
^XZ
Does Reed-Solomon error correction work in an instance where there is a dropped byte (or multiple dropped bytes)? For example, let's say it's a (12,8) Reed Solomon code, so theoretically it should be able to correct 2 errors (or 4 erasures if the position is known). But, what happens if only 11 (or 10) bytes are received and one doesn't know which byte(s) were dropped? Will Reed-Solomon error correction work?
Thanks,
Ben
RS decoding for erasures requires the position of the symbols "dropped" or lost. The kind of error you're talking about is due to phase distortion.
You can make it work by simply cycling through the possible positions where the character might be missing and letting it try to correct your result, so let's say you received 10 characters:
1234567890
Have it correct the following values:
??1234567890
?1?234567890
?12?34567890
:
1??234567890
1?2?34567890
:
1234567890??
Each attempt will probably give you some result, most of which are not the one you want. But I would expect that there should be exactly one result with the minimal number of additional modifications, and that should be the one you want to use as the most likely to be correct answer.
For example, if you correct the first three numbers of the example above, you might get the following result:
v
361274567890
917234567890
312734569897
: ^ ^
For the first and third case, you have additional corrections made beyond filling in the two blanks (marked with v and ^), whereas in the second case you have only the missing positions filled in and the other characters match the uncorrected input. Therefore, I would choose answer 2 as the most likely to be correct one.
Clearly, the chances that this works depend on whether there are other errors. Unfortunately I'm not able to give you a rigorous set of conditions under which this method will work for sure.
.
Another thing you can do if your message is long enough is to use an interleaving technique to basically have multiple orthogonal RS codes cover your data. That way, if one fails, you might be able to recover with another one. This method is for example used on compact discs (CDs), where it is called CIRC.
No, Reed-Solomon can't automatically correct instances where there are missing bits, because just like most other FEC algorithms, it was only designed to correct bit-flips. If you know the position of the missing bits, you can pad your received signal at those positions so that RS can then work normally.
However, if you don't know the position, you will need to use another algorithm that supports bit-insertion or bit-deletion such as Marker Codes and Watermark Codes.
Also note that RS can be not only used for erasures but also to process noisy bits using Forney syndrome.
I am working on a project that requires the parsing of log files. I am looking for a fast algorithm that would take groups messages like this:
The temperature at P1 is 35F.
The temperature at P1 is 40F.
The temperature at P3 is 35F.
Logger stopped.
Logger started.
The temperature at P1 is 40F.
and puts out something in the form of a printf():
"The temperature at P%d is %dF.", Int1, Int2"
{(1,35), (1, 40), (3, 35), (1,40)}
The algorithm needs to be generic enough to recognize almost any data load in message groups.
I tried searching for this kind of technology, but I don't even know the correct terms to search for.
I think you might be overlooking and missed fscanf() and sscanf(). Which are the opposite of fprintf() and sprintf().
Overview:
A naïve!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.
Example input:
The dog jumped over the moon
The cat jumped over the moon
The moon jumped over the moon
The car jumped over the moon
Frequencies:
Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}
We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).
The next step is to iterate through lines which generated these frequency lists, so let's take the first example.
The: meets some hand-wavy criteria and the algorithm decides it must be static.
dog: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with /[a-z]+/i.
over: same deal as #1; it's static, so leave as is.
the: same deal as #1; it's static, so leave as is.
moon: same deal as #1; it's static, so leave as is.
Thus, just from going over the first line we can put together the following regular expression:
/The ([a-z]+?) jumps over the moon/
Considerations:
Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.
False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.
The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.
Thanks for all the great suggestions.
Chris, is right. I am looking for a generic solution for normalizing any kind of text. The solution of the problem boils down to dynmamically finding patterns in two or more similar strings.
Almost like predicting the next element in a set, based on the previous two:
1: Everest is 30000 feet high
2: K2 is 28000 feet high
=> What is the pattern?
=> Answer:
[name] is [number] feet high
Now the text file can have millions of lines and thousands of patterns. I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern.
I thought about creating some high level semantic hashes to represent the patterns in the message strings.
I would use a tokenizer and give each of the tokens types a specific "weight".
Then I would group the hashes and rate their similarity. Once the grouping is done I would collect the data sets.
I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there.
Klaus
It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. If you are trying to parse data, maybe regular expressions would do too..
You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. That sounds like strong AI to me.
Producing something like this, even just to recognize numbers, gets really hairy. For example is "123.456" one number or two? How about this "123,456"? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? You're going to have to build something that will parse in the way you need. You can do this with regular expressions, or you can do it with sscanf, or you can do it some other way, but you're going to have to write something custom.
However, with basic regular expressions, you can do this yourself. It won't be magic, but it's not that much work. Something like this will parse the lines you're interested in and consolidate them (Perl):
my #vals = ();
while (defined(my $line = <>))
{
if ($line =~ /The temperature at P(\d*) is (\d*)F./)
{
push(#vals, "($1,$2)");
}
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < #vals; $i++)
{
print $vals[$i];
if ($i < #vals - 1)
{
print ",";
}
}
print "}\n";
The output from this isL
The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}
You could do something similar for each type of line you need to parse. You could even read these regular expressions from a file, instead of custom coding each one.
I don't know of any specific tool to do that. What I did when I had a similar problem to solve was trying to guess regular expressions to match lines.
I then processed the files and displayed only the unmatched lines. If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added.
After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines.
In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F.". If you reprocess the file removing any matched line, it leaves "only":
Logger stopped.
Logger started.
Which you can then match with "Logger (.+).".
You can then refine the patterns and find new ones to match your whole log.
#John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place.
#Derek Park: Well, even a strong AI couldn't be sure it had the right answer.
Perhaps some compression-like mechanism could be used:
Find large, frequent substrings
Find large, frequent substring patterns. (i.e. [pattern:1] [junk] [pattern:2])
Another item to consider might be to group lines by edit-distance. Grouping similar lines should split the problem into one-pattern-per-group chunks.
Actually, if you manage to write this, let the whole world know, I think a lot of us would like this tool!
#Anders
Well, even a strong AI couldn't be sure it had the right answer.
I was thinking that sufficiently strong AI could usually figure out the right answer from the context. e.g. Strong AI could recognize that "35F" in this context is a temperature and not a hex number. There are definitely cases where even strong AI would be unable to answer. Those are the same cases where a human would be unable to answer, though (assuming very strong AI).
Of course, it doesn't really matter, since we don't have strong AI. :)
http://www.logparser.com forwards to an IIS forum which seems fairly active. This is the official site for Gabriele Giuseppini's "Log Parser Toolkit". While I have never actually used this tool, I did pick up a cheap copy of the book from Amazon Marketplace - today a copy is as low as $16. Nothing beats a dead-tree-interface for just flipping through pages.
Glancing at this forum, I had not previously heard about the "New GUI tool for MS Log Parser, Log Parser Lizard" at http://www.lizardl.com/.
The key issue of course is the complexity of your GRAMMAR. To use any kind of log-parser as the term is commonly used, you need to know exactly what you're scanning for, you can write a BNF for it. Many years ago I took a course based on Aho-and-Ullman's "Dragon Book", and the thoroughly understood LALR technology can give you optimal speed, provided of course that you have that CFG.
On the other hand it does seem you're possibly reaching for something AI-like, which is a different order of complexity entirely.