Lua string to number parsing speed optimization - performance

I am trying to make a speedtest using Lua as one of the languages and I just wanted some advice on how I could make my code a bit faster if possible. It is important that I do my own speedtest, since I am looking at very specific parameters.
The code is reading from a file which looks something like this, but the numbers are randomly generated and range from 1 zu 1 000 000. There are between 100 and 10 000 numbers in one list:
type
(123,124,364,5867,...)
type
(14224,234646,5686,...)
...
The type is meant for another language, so it can be ignored. I just put this here so you know why I am not parsing every line. This is my Lua code:
incr = 1
for line in io.lines(arg[1]) do
incr = incr +1
if incr % 3 == 0 then
line:gsub('([%d]+),?',function(n)tonumber(n)end)
end
end
Now, the code works and does exactly what I want it to do. This is not about getting it to work, this is simply about speed. I need ideas and advice to make the code work at optimal speed.
Thanks in advance for any answers.

IMHO, this tonumber() benchmarking is rather strange. Most of CPU time would be spent on other tasks (regexp parsing, file reading, ...).
Instead of converting to number and ignoring result it would be more logical to calculate sum of all the numbers in input file:
local gmatch, s = string.gmatch, 0
for line in io.lines(arg[1]) do
for n in gmatch(line, '%d+') do
s = s + n -- converting string to number is automatic here
end
end
print(s)

Related

A memory efficient way for a randomized single pass over a set of indices

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)
To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:
from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
f.seek(0)
counter = 1
while counter < SIZE+1:
f.write(pack('>i', counter))
counter+=1
Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files
So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:
from struct import pack
def rand_index_generator:
generator = RAND_INDEX(1, MAX+1, NO REPLACEMENT)
counter = 0
while counter < MAX:
counter+=1
yield generator.next_index()
with open('./FILE', 'rb+', buffering=0) as f:
counter = 1
for index in rand_index_generator:
f.seek(4*index)
f.write(pack('>i', counter))
counter+=1
I need it:
Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later
Is there a way to do this in Python 2.7?
Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.
index_gen = rand_index_generator(1934919251, 2**28)
Then each time you want the new index you use index_gen.next() and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'
There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

Random number generation from 1 to 7

I was going through Google Interview Questions. to implement the random number generation from 1 to 7.
I did write a simple code, I would like to understand if in the interview this question asked to me and if I write the below code is it Acceptable or not?
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = int(ret[-1])
if ret == 0 or ret == 1:
return 1
elif ret > 7:
ret = ret - 7
return ret
return ret
while 1:
print(generate_rand())
time.sleep(1) # Just to see the output in the STDOUT
(Since the question seems to ask for analysis of issues in the code and not a solution, I am not providing one. )
The answer is unacceptable because:
You need to wait for a second for each random number. Many applications need a few hundred at a time. (If the sleep is just for convenience, note that even a microsecond granularity will not yield true random numbers as the last microsecond will be monotonically increasing until 10us are reached. You may get more than a few calls done in a span of 10us and there will be a set of monotonically increasing pseudo-random numbers).
Random numbers have uniform distribution. Each element should have the same probability in theory. In this case, you skew 1 more (twice the probability for 0, 1) and 7 more (thrice the probability for 7, 8, 9) compared to the others in the range 2-6.
Typically answers to this sort of a question will try to get a large range of numbers and distribute the ranges evenly from 1-7. For example, the above method would have worked fine if u had wanted randomness from 1-5 as 10 is evenly divisible by 5. Note that this will only solve (2) above.
For (1), there are other sources of randomness, such as /dev/random on a Linux OS.
You haven't really specified the constraints of the problem you're trying to solve, but if it's from a collection of interview questions it seems likely that it might be something like this.
In any case, the answer shown would not be acceptable for the following reasons:
The distribution of the results is not uniform, even if the samples you read from time.time() are uniform.
The results from time.time() will probably not be uniform. The result depends on the time at which you make the call, and if your calls are not uniformly distributed in time then the results will probably not be uniformly distributed either. In the worst case, if you're trying to randomise an array on a very fast processor then you might complete the entire operation before the time changes, so the whole array would be filled with the same value. Or at least large chunks of it would be.
The changes to the random value are highly predictable and can be inferred from the speed at which your program runs. In the very-fast-computer case you'll get a bunch of x followed by a bunch of x+1, but even if the computer is much slower or the clock is more precise, you're likely to get aliasing patterns which behave in a similarly predictable way.
Since you take the time value in decimal, it's likely that the least significant digit doesn't visit all possible values uniformly. It's most likely a conversion from binary to some arbitrary number of decimal digits, and the distribution of the least significant digit can be quite uneven when that happens.
The code should be much simpler. It's a complicated solution with many special cases, which reflects a piecemeal approach to the problem rather than an understanding of the relevant principles. An ideal solution would make the behaviour self-evident without having to consider each case individually.
The last one would probably end the interview, I'm afraid. Perhaps not if you could tell a good story about how you got there.
You need to understand the pigeonhole principle to begin to develop a solution. It looks like you're reducing the time to its least significant decimal digit for possible values 0 to 9. Legal results are 1 to 7. If you have seven pigeonholes and ten pigeons then you can start by putting your first seven pigeons into one hole each, but then you have three pigeons left. There's nowhere that you can put the remaining three pigeons (provided you only use whole pigeons) such that every hole has the same number of pigeons.
The problem is that if you pick a pigeon at random and ask what hole it's in, the answer is more likely to be a hole with two pigeons than a hole with one. This is what's called "non-uniform", and it causes all sorts of problems, depending on what you need your random numbers for.
You would either need to figure out how to ensure that all holes are filled equally, or you would have to come up with an explanation for why it doesn't matter.
Typically the "doesn't matter" answer is that each hole has either a million or a million and one pigeons in it, and for the scale of problem you're working with the bias would be undetectable.
Using the same general architecture you've created, I would do something like this:
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = ret % 8 # will return pseudorandom numbers 0-7
if ret == 0:
return 1 # or you could also return the result of another call to generate_rand()
return ret
while 1:
print(generate_rand())
time.sleep(1)

Attempt to "go back" without goto statement

The code examples are gonna be in Lua, but the question is rather general - it's just an example.
for k=0,100 do
::again::
local X = math.random(100)
if X <= 30
then
-- do something
else
goto again
end
end
This code generates 100 pseudorandom numbers between 0-30. It should do it between 0-100, but doesn't let the loop go on if any of them is larger than 30.
I try to do this task without goto statement.
for k=0,100 do
local X = 100 -- may be put behind "for", in some cases, the matter is that we need an 'X' variable
while X >= 30 do --IMPORTANT! it's the opposite operation of the "if" condition above!
X = math.random(100)
end
-- do the same "something" as in the condition above
end
Instead, this program runs the random number generation until I get a desired value. In general, I put all the codes here that was between the main loop and the condition in the first example.
Theoretically, it does the same as the first example, only without gotos. However, I'm not sure in it.
Main question: are these program codes equal? They do the same? If yes, which is the faster (=more optimized)? If no, what's the difference?
It is bad practice to use Goto. Please see http://xkcd.com/292/
Anyway, I'm not much into Lua, but this looks simple enough;
For your first code: What you are doing is starting a loop to repeat 100 times. In the loop you make a random number between 0 and 100. If this number is less than or equal to 30, you do something with it. If this number is greater than 30, you actually throw it away and get another random number. This continues until you have 100 random numbers which will ALL be less than or equal to thirty.
The second code says: Start a loop from 0 to 100. Then you set X to be 100. Then you start another loop with this condition: As long as X is greater than 30, keep randomizing X. Only when X is less than 30 will your code exit and perform some action. When it has performed that action 100 times, the program ends.
Sure, both codes do the same thing, but the first one uses a goto - which is bad practice regardless of efficiency.
The second code uses loops, but is still not efficient - there are 2 levels of loops - and one is based on psuedo-random generation which can be extremely inefficient (maybe the CPU generates only numbers between 30-100 for a trillion iterations?) Then things get very slow. But this is also true for you're first piece of code - it has a 'loop' that is based on psuedo-random number generation.
TLDR; strictly speaking about efficiency, I do not see one of those being more efficient than the other. I could be wrong but it seems the same things is going on.
you can directly use math.random(lower, upper)
for k=0,100 do
local X = math.random(0, 30)
end
even faster.
As I see this pieces of code do the same, but using goto always isn't the best choice (in any programming language). For lua see details here

Why is lua's string pattern matching doing this?

I have an external application which monitors CPU and GPU temperatures...
I am using Lua with the alien extension to grab these values (via GetWindowText) and to do some pattern matching on these values, effectively extracting the temperature digits out of the string, which by default shows up as something like CPU 67.875 °C...
But perhaps I have the wrong idea on how patterns work in LUA (since they don't appear to be exactly like regex)?
The pattern I am using is [%d]+[.%d+]* which should match any number between 0 and 100.0, correct?
Yet oddly enough, I am getting incredibly strange output when values reach around 56.5 degrees (see link).
Why is this happening?
And how can I extract the correct floating point values (as a string) between 0 and 100 in the format of XYY.ZZZ, where X is not optional, Y is optional, and . is optional unless Z exists?
You're seeing the effect of accumulated rounding errors because 0.16 cannot be precisely represented in floating point. The code below performs better:
local n = 0
while n < 10000 do
local s = tostring(n/100)
local t = s:match("[%d]+[.%d+]*")
print(t)
n = n + 16
end
Now, to your question, try the simpler pattern below:
s="CPU 67.875 °C"
print(s:match("CPU +(.-) +"))

Generating random number in a given range in Fortran 77

I am a beginner trying to do some engineering experiments using fortran 77. I am using Force 2.0 compiler and editor. I have the following queries:
How can I generate a random number between a specified range, e.g. if I need to generate a single random number between 3.0 and 10.0, how can I do that?
How can I use the data from a text file to be called in calculations in my program. e.g I have temperature, pressure and humidity values (hourly values for a day, so total 24 values in each text file).
Do I also need to define in the program how many values are there in the text file?
Knuth has released into the public domain sources in both C and FORTRAN for the pseudo-random number generator described in section 3.6 of The Art of Computer Programming.
2nd question:
If your file, for example, looks like:
hour temperature pressure humidity
00 15 101325 60
01 15 101325 60
... 24 of them, for each hour one
this simple program will read it:
implicit none
integer hour, temp, hum
real p
character(80) junkline
open(unit=1, file='name_of_file.dat', status='old')
rewind(1)
read(1,*)junkline
do 10 i=1,24
read(1,*)hour,temp,p,hum
C do something here ...
10 end
close(1)
end
(the indent is a little screwed up, but I don't know how to set it right in this weird environment)
My advice: read up on data types (INTEGER, REAL, CHARACTER), arrays (DIMENSION), input/output (READ, WRITE, OPEN, CLOSE, REWIND), and loops (DO, FOR), and you'll be doing useful stuff in no time.
I never did anything with random numbers, so I cannot help you there, but I think there are some intrinsic functions in fortran for that. I'll check it out, and report tomorrow. As for the 3rd question, I'm not sure what you ment (you don't know how many lines of data you'll be having in a file ? or ?)
You'll want to check your compiler manual for the specific random number generator function, but chances are it generates random numbers between 0 and 1. This is easy to handle - you just scale the interval to be the proper width, then shift it to match the proper starting point: i.e. to map r in [0, 1] to s in [a, b], use s = r*(b-a) + a, where r is the value you got from your random number generator and s is a random value in the range you want.
Idigas's answer covers your second question well - read in data using formatted input, then use them as you would any other variable.
For your third question, you will need to define how many lines there are in the text file only if you want to do something with all of them - if you're looking at reading the line, processing it, then moving on, you can get by without knowing the number of lines ahead of time. However, if you are looking to store all the values in the file (e.g. having arrays of temperature, humidity, and pressure so you can compute vapor pressure statistics), you'll need to set up storage somehow. Typically in FORTRAN 77, this is done by pre-allocating an array of a size larger than you think you'll need, but this can quickly become problematic. Is there any chance of switching to Fortran 90? The updated version has much better facilities for dealing with standardized dynamic memory allocation, not to mention many other advantages. I would strongly recommend using F90 if at all possible - you will make your life much easier.
Another option, depending on the type of processing you're doing, would be to investigate algorithms that use only single passes through data, so you won't need to store everything to compute things like means and standard deviations, for example.
This subroutine generate a random number in fortran 77 between 0 and ifin
where i is the seed; some great number such as 746397923
subroutine rnd001(xi,i,ifin)
integer*4 i,ifin
real*8 xi
i=i*54891
xi=i*2.328306e-10+0.5D00
xi=xi*ifin
return
end
You may modifies in order to take a certain range.

Resources