Fastest String Filtering Algorithm - algorithm

I have 5,000,000 unordered strings formatted this way (Name.Name.Day-Month-Year 24hrTime):
"John.Howard.12-11-2020 13:14"
"Diane.Barry.29-07-2020 20:50"
"Joseph.Ferns.08-05-2020 08:02"
"Joseph.Ferns.02-03-2020 05:09"
"Josephine.Fernie.01-01-2020 07:20"
"Alex.Alexander.06-06-2020 10:10"
"Howard.Jennings.07-07-2020 13:17"
"Hannah.Johnson.08-08-2020 00:49"
...
What is the fastest way to find all strings having a time t between some n and m? (i.e. fastest way to remove all strings whose time < n || m < time)
This filtering will be done multiple times with different ranges. Time ranges must always be on the same day and the starting time is always earlier than the end time.
In java, heres's my current approach given some time string M and N and a 5 million string list:
ArrayList<String> finalSolution = new ArrayList<>();
String[] startingMtimeArr = m.split(":");
String[] startingNtimeArr = n.split(":");
Integer startingMhour = Integer.parseInt(startingMtimeArr[0]);
Integer startingMminute = Integer.parseInt(startingMtimeArr[1]);
Integer endingNhour = Integer.parseInt(startingNtimeArr[0]);
Integer endingNminute = Integer.parseInt(startingNtimeArr[1]);
for combinedString in ArraySizeOf5Million{
String[] arr = combinedString.split(".");
String[] subArr = arr[2].split(" ");
String[] timeArr = subArr[1].split(":");
String hour = timeArr[0];
String minute = timeArr[1];
If hour >= startingMhour
&& minute >= startingMminute
&& hour <= endingNhour
&& minute <= endingNminute {
finalSolution.add(hour)
}
}
Java's my native language but any other languages work too. Better/faster logic is what I am after

Some example in Python using index for every minute:
from pprint import pprint
from itertools import groupby
big_list = [
"John.Howard.12:14",
"Diane.Barry.13:50",
"xxxDiane.Barryxxx.13:50", # <-- added a name in the same HH:MM
"Joseph.Ferns.08:02",
"Joseph.Ferns.05:09",
"Josephine.Fernie.07:20",
"Alex.Alexander.10:10",
"Howard.Jennings.12:17",
"Hannah.Johnson.00:49",
]
# 1. sort the list by time HH:MM
big_list = sorted(big_list, key=lambda k: k[-5:])
# the list is now:
# ['Hannah.Johnson.00:49',
# 'Joseph.Ferns.05:09',
# 'Josephine.Fernie.07:20',
# 'Joseph.Ferns.08:02',
# 'Alex.Alexander.10:10',
# 'John.Howard.12:14',
# 'Howard.Jennings.12:17',
# 'Diane.Barry.13:50',
# 'xxxDiane.Barryxxx.13:50']
# 2. create an index (for every minute in a day)
index = {}
times = []
for i, item in enumerate(big_list):
times.append(int(item[-5:-3]) * 60 + int(item[-2:]))
last = 0
cnt = 0
for v, g in groupby(times):
for i in range(last, v):
index[i] = [cnt, cnt]
s = sum(1 for _ in g)
index[v] = [cnt, cnt + s]
cnt += s
last = v + 1
for i in range(last, 60 * 24):
index[i] = [cnt, cnt]
# 3. you can now do a fast query using the index
def find_all_strings(n, m):
n = int(n[-5:-3]) * 60 + int(n[-2:])
m = int(m[-5:-3]) * 60 + int(m[-2:])
return big_list[index[n][0] : index[m][1]]
print(find_all_strings("00:10", "00:30")) # []
print(find_all_strings("00:30", "00:50")) # ['Hannah.Johnson.00:49']
print(find_all_strings("12:00", "13:55")) # ['John.Howard.12:14', 'Howard.Jennings.12:17', 'Diane.Barry.13:50', 'xxxDiane.Barryxxx.13:50']
print(find_all_strings("13:00", "13:55")) # ['Diane.Barry.13:50', 'xxxDiane.Barryxxx.13:50']
print(find_all_strings("15:00", "23:00")) # []

Since the data will be searched many times I first parse the strings to make it easy fo search multiple times = see by_date.
I use binary search to find the first string of a particular day then iterate through increasing times collecting appropriate strings in variable filtered of function strings_between.
# -*- coding: utf-8 -*-
"""
https://stackoverflow.com/questions/67562250/fastest-string-filtering-algorithm
Created on Tue May 18 09:20:11 2021
#author: Paddy3118
"""
strings = """\
John.Howard.12-11-2020 13:14
Diane.Barry.29-07-2020 20:50
Joseph.Ferns.08-05-2020 08:02
Joseph.Ferns.02-03-2020 05:09
Josephine.Fernie.01-01-2020 07:20
Alex.Alexander.06-06-2020 10:10
Howard.Jennings.07-07-2020 13:17
Hannah.Johnson.08-08-2020 00:49
Josephine.Fernie.08-08-2020 07:20
Alex.Alexander.08-08-2020 10:10
Howard.Jennings.08-08-2020 13:17
Hannah.Johnson.08-08-2020 09:49\
"""
## First parse the date information once for all future range calcs
def to_mins(hr_mn='00:00'):
hr, mn = hr_mn.split(':')
return int(hr) * 60 + int(mn)
by_date = dict() # Keys are individual days, values are time-sorted
for s in strings.split('\n'):
name_day, time = s.strip().split()
name, day = name_day.rsplit('.', 1)
minutes = to_mins(time)
if day not in by_date:
by_date[day] = [(minutes, s)]
else:
by_date[day].append((minutes, s))
for day_info in by_date.values():
day_info.sort()
## Now rely on dict search for day then binary +linear search within day.
def _bisect_left(a, x):
"""Return the index where to insert item x in list a, assuming a is sorted.
The return value i is such that all e in a[:i] have e < x, and all e in
a[i:] have e >= x. So if x already appears in the list, a.insert(x) will
insert just before the leftmost x already there.
'a' is a list of tuples whose first item is assumed sorted and searched apon.
"""
lo, hi = 0, len(a)
while lo < hi:
mid = (lo+hi)//2
# Use __lt__ to match the logic in list.sort() and in heapq
if a[mid][0] < x: lo = mid+1
else: hi = mid
return lo
def strings_between(day="01-01-2020", start="00:00", finish="23:59"):
global by_date
if day not in by_date:
return []
day_data = by_date[day]
start, finish = to_mins(start), to_mins(finish)
from_index = _bisect_left(day_data, start)
filtered = []
for time, s in day_data[from_index:]:
if time <= finish:
filtered.append(s)
else:
break
return filtered
## Example data
assert by_date == {
'12-11-2020': [(794, 'John.Howard.12-11-2020 13:14')],
'29-07-2020': [(1250, 'Diane.Barry.29-07-2020 20:50')],
'08-05-2020': [(482, 'Joseph.Ferns.08-05-2020 08:02')],
'02-03-2020': [(309, 'Joseph.Ferns.02-03-2020 05:09')],
'01-01-2020': [(440, 'Josephine.Fernie.01-01-2020 07:20')],
'06-06-2020': [(610, 'Alex.Alexander.06-06-2020 10:10')],
'07-07-2020': [(797, 'Howard.Jennings.07-07-2020 13:17')],
'08-08-2020': [(49, 'Hannah.Johnson.08-08-2020 00:49'),
(440, 'Josephine.Fernie.08-08-2020 07:20'),
(589, 'Hannah.Johnson.08-08-2020 09:49'),
(610, 'Alex.Alexander.08-08-2020 10:10'),
(797, 'Howard.Jennings.08-08-2020 13:17')]}
## Example queries from command line
"""
In [7]: strings_between('08-08-2020')
Out[7]:
['Hannah.Johnson.08-08-2020 00:49',
'Josephine.Fernie.08-08-2020 07:20',
'Hannah.Johnson.08-08-2020 09:49',
'Alex.Alexander.08-08-2020 10:10',
'Howard.Jennings.08-08-2020 13:17']
In [8]: strings_between('08-08-2020', '09:30', '24:00')
Out[8]:
['Hannah.Johnson.08-08-2020 09:49',
'Alex.Alexander.08-08-2020 10:10',
'Howard.Jennings.08-08-2020 13:17']
In [9]: strings_between('08-08-2020', '09:49', '10:10')
Out[9]: ['Hannah.Johnson.08-08-2020 09:49', 'Alex.Alexander.08-08-2020 10:10']
In [10]:
"""

As #Paddy3118 already pointed out, binary search is probably the way to go.
(if your data is on disk): Load input data and sort by date/time.
With i0 being the start index of the result set and i1 being the end index of the result set (both obtained from binary search): enumerate resulting entries.
The code I used (in Lisp) is shown at the end of this answer. It is not optimized in the slightest (I guess it would be possible to make the loading and initial sorting much faster with a some optimization effort).
This is how my interactive session looked like (includes timing information, for my foo.txt input file containing 5 million entries).
rlwrap sbcl --dynamic-space-size 2048
This is SBCL 2.1.1.debian, an implementation of ANSI Common Lisp.
More information about SBCL is available at http://www.sbcl.org/.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
(ql:quickload :cl-ppcre)
To load "cl-ppcre":
Load 1 ASDF system:
cl-ppcre
; Loading "cl-ppcre"
..
(:CL-PPCRE)
(load "fivemillion.lisp")
T
(time (defparameter data (load-input-for-queries "foo.txt")))
"sorting..."
Evaluation took:
32.091 seconds of real time
32.090620 seconds of total run time (31.386722 user, 0.703898 system)
[ Run times consist of 2.641 seconds GC time, and 29.450 seconds non-GC time. ]
100.00% CPU
15 lambdas converted
115,308,171,684 processor cycles
6,088,198,752 bytes consed
DATA
(time (defparameter output (query-interval data '(2018 1 1) '(2018 1 2))))
Evaluation took:
0.000 seconds of real time
0.000111 seconds of total run time (0.000109 user, 0.000002 system)
100.00% CPU
395,172 processor cycles
65,536 bytes consed
OUTPUT
(time (defparameter output (query-interval data '(2018 1 1) '(2018 1 2 8))))
Evaluation took:
0.000 seconds of real time
0.000113 seconds of total run time (0.000110 user, 0.000003 system)
100.00% CPU
399,420 processor cycles
65,536 bytes consed
OUTPUT
(time (defparameter output (query-interval data '(2018 1 1) '(2019 1 1))))
Evaluation took:
0.020 seconds of real time
0.022469 seconds of total run time (0.022469 user, 0.000000 system)
110.00% CPU
80,800,092 processor cycles
15,958,016 bytes consed
OUTPUT
So, while the load and sort time (done once) is nothing to write home about (but could be optimized), the (query-interval ...) calls are pretty fast. The bigger the result set of the query, the longer the list the function returns (more conses, more run time). I could have been more clever and just return the start and end indices of the result set and leave the collecting of the entries to the caller.
Here the source code, which also includes code for generating the test data sets I used:
(defun random-uppercase-character ()
(code-char (+ (char-code #\A) (random 26))))
(defun random-lowercase-character ()
(code-char (+ (char-code #\a) (random 26))))
(defun random-name-part (nchars)
(with-output-to-string (stream)
(write-char (random-uppercase-character) stream)
(loop repeat (- nchars 1) do
(write-char (random-lowercase-character) stream))))
(defun random-day-of-month ()
"Assumes every month has 31 days, because it does not matter
for this exercise."
(+ 1 (random 31)))
(defun random-month-of-year ()
(+ 1 (random 12)))
(defun random-year ()
"Some year between 2017 and 2022"
(+ 2017 (random 5)))
(defun random-hour-of-day ()
(random 24))
(defun random-minute-of-hour ()
(random 60))
(defun random-entry (stream)
(format stream "\"~a.~a.~d-~d-~d ~d:~d\"~%"
(random-name-part 10)
(random-name-part 10)
(random-day-of-month)
(random-month-of-year)
(random-year)
(random-hour-of-day)
(random-minute-of-hour)))
(defun generate-input (entry-count file-name)
(with-open-file (stream
file-name
:direction :output
:if-exists :supersede)
(loop repeat entry-count do
(random-entry stream))))
(defparameter *line-scanner*
(ppcre:create-scanner
"\"(\\w+).(\\w+).(\\d+)-(\\d+)-(\\d+)\\s(\\d+):(\\d+)\""))
;; 0 1 2 3 4 5 6
;; fname lname day month year hour minute
(defun decompose-line (line)
(let ((parts (nth-value
1
(ppcre:scan-to-strings
*line-scanner*
line))))
(make-array 7 :initial-contents
(list (aref parts 0)
(aref parts 1)
(parse-integer (aref parts 2))
(parse-integer (aref parts 3))
(parse-integer (aref parts 4))
(parse-integer (aref parts 5))
(parse-integer (aref parts 6))))))
(defconstant +fname-index+ 0)
(defconstant +lname-index+ 1)
(defconstant +day-index+ 2)
(defconstant +month-index+ 3)
(defconstant +year-index+ 4)
(defconstant +hour-index+ 5)
(defconstant +minute-index+ 6)
(defvar *compare-<-criteria*
(make-array 5 :initial-contents
(list +year-index+
+month-index+
+day-index+
+hour-index+
+minute-index+)))
(defun compare-< (dl1 dl2)
(labels ((comp (i)
(if (= i 5)
nil
(let ((index (aref *compare-<-criteria* i)))
(let ((v1 (aref dl1 index))
(v2 (aref dl2 index)))
(cond
((< v1 v2) t)
((= v1 v2) (comp (+ i 1)))
(t nil)))))))
(comp 0)))
(defun time-stamp-to-index (hours minutes)
(+ minutes (* 60 hours)))
(defun load-input-for-queries (file-name)
(let* ((decomposed-line-list
(with-open-file (stream file-name :direction :input)
(loop for line = (read-line stream nil nil)
while line
collect (decompose-line line))))
(number-of-lines (length decomposed-line-list))
(decomposed-line-array (make-array number-of-lines
:initial-contents
decomposed-line-list)))
(print "sorting...") (terpri)
(sort decomposed-line-array #'compare-<)))
(defun unify-date-list (date)
(let ((date-length (length date)))
(loop
for i below 5
collecting (if (> date-length i) (nth i date) 0))))
(defun decomposed-line-date<date-list (decomposed-line date-list)
(labels ((comp (i)
(if (= i 5)
nil
(let ((index (aref *compare-<-criteria* i)))
(let ((v1 (aref decomposed-line index))
(v2 (nth i date-list)))
(cond
((< v1 v2) t)
((= v1 v2) (comp (+ i 1)))
(t nil)))))))
(comp 0)))
(defun index-before (data key predicate
&key (left 0) (right (length data)))
(if (and (< left right) (> (- right left) 1))
(if (funcall predicate (aref data left) key)
(let ((mid (+ left (floor (- right left) 2))))
(if (funcall predicate (aref data mid) key)
(index-before data key predicate
:left mid
:right right)
(index-before data key predicate
:left left
:right mid)))
left)
right))
(defun query-interval (data start-date end-date)
"start-date and end-date are given as lists of the form:
'(year month day hour minute) or shorter versions e.g.
'(year month day hour), omitting trailing values which will be
appropriately defaulted."
(let ((d0 (unify-date-list start-date))
(d1 (unify-date-list end-date)))
(let* ((start-index (index-before
data
d0
#'decomposed-line-date<date-list))
(end-index (index-before
data
d1
#'decomposed-line-date<date-list
:left (cond
((< start-index 0) 0)
((>= start-index (length data))
(length data))
(t start-index)))))
(loop for i from start-index below end-index
collecting (aref data i)))))

Related

How do include the upper limit of `rand` in Clojure?

In Clojure's rand function, the lower limit 0 is included, but the upper limit is not.
How to define a rand function equivalent which includes the upper limit?
Edit (simple explanation): Because rand generates a random double between 0 and 1 by generating an integer and dividing my another integer, you can implement a version with inclusive upper bound as
(defn rand-with-nbits [nbits]
(let [m (bit-shift-left 1 nbits)]
(/ (double (rand-int m))
(double (dec m)))))
(apply max (repeatedly 1000 #(rand-with-nbits 3)))
;; => 1.0
In the implementation, 53 bits are used, that is (rand-with-nbits 53).
Longer answer:
Take a look at the source code of rand. It internally calls Math.random, which in turn calls Random.nextDouble, that has the following implementation:
private static final double DOUBLE_UNIT = 0x1.0p-53; // 1.0 / (1L << 53)
...
public double nextDouble() {
return (((long)(next(26)) << 27) + next(27)) * DOUBLE_UNIT;
}
The above code essentially produces a random integer between 0 and 2^53-1 and divides it by by 2^53. To get an inclusive upper bound, you would instead have to divide it by 2^53-1. The above code calls the next method which is protected, so to make our own tweaked implementation, we use proxy:
(defn make-inclusive-rand [lbits rbits]
(let [denom (double
(dec ; <-- `dec` makes upper bound inclusive
(bit-shift-left
1 (+ lbits rbits))))
g (proxy [java.util.Random] []
(nextDouble []
(/ (+ (bit-shift-left
(proxy-super next lbits) rbits)
(proxy-super next rbits))
denom)))]
(fn [] (.nextDouble g))))
(def my-rand (make-inclusive-rand 26 27))
It is unlikely that you will ever generate the upper bound:
(apply max (repeatedly 10000000 my-rand))
;; => 0.9999999980774417
However, if the underlying integers are generated with fewer bits, you will see that it works:
(def rand4 (make-inclusive-rand 2 2))
(apply max (repeatedly 100 rand4))
;; => 1.0
By the way, the above implementation is not thread-safe.

Why is Racket implementation so much faster than MIT Scheme?

The following code uses the Euclidean algorithm to calculate gcd(a,b) and integers s, t such that sa+tb=gcd(a,b) (for a Discrete Mathematics course). I coded it in C, and perhaps this will clearly illustrate the algorithm.
gcd.c:
#include <stdio.h>
int gcd_st(int m, int n, int *s, int *t) {
int a, b, res, tmp;
a = m>n?m:n;
b = m>n?n:m;
if(!b) {
*s = 1;
*t = 0;
return a;
}
res = gcd_st(b, a%b, s, t);
tmp = *t;
*t = *s - *t*(a/b);
*s = tmp;
return res;
}
int main() {
int st[2];
for(int i=0; i<100000000; i++)
gcd_st(42, 56, st, st+1);
for(int i=0; i<100000000; i++)
gcd_st(273, 110, st, st+1);
int res = gcd_st(42, 56, st, st+1);
printf("%d %d %d\n", res, st[0], st[1]);
res = gcd_st(273, 110, st, st+1);
printf("%d %d %d\n", res, st[0], st[1]);
}
Just for fun, I decided to code it in Scheme (Lisp) as well. At first, I tested it on MIT Scheme's implementation, and then using Racket's implementation.
gcd.scm (without first two lines); gcd.rkt (including first two lines):
#!/usr/bin/racket
#lang racket/base
(define (gcd_st m n)
(let ((a (max m n)) (b (min m n)))
(if (= b 0) (list a 1 0)
(let ((res (gcd_st b (remainder a b))))
(let ((val (list-ref res 0))
(s (list-ref res 1))
(t (list-ref res 2)))
(list val t (- s (* t (quotient a b)))))))))
(define (loop n fn)
(if (= n 0) 0
(loop (- n 1) fn)))
(loop 100000000 (lambda () (gcd_st 42 56)))
(loop 100000000 (lambda () (gcd_st 273 110)))
(display "a b: (gcd s t)\n42 56: ")
(display (gcd_st 42 56))
(display "\n273 110: ")
(display (gcd_st 273 110))
(display "\n")
Both programs run 10^8 iterations on two sample cases and produce the same output. However, the two Scheme implementations (which share the same code/algorithm) differ greatly in performance. The Racket implementation is also a great deal quicker than the C implementation, which in turn is much faster than the MIT-Scheme implementation.
The time difference is so drastic I thought maybe Racket was optimizing out the entire loop, since the result is never used, but the time still does seem to scale linearly with loop iterations. Is it possible that it is doing some introspection and optimizing out some of the code in the loop?
$ time ./gcd.rkt # Racket
0
0
a b: (gcd s t)
42 56: (14 1 -1)
273 110: (1 27 -67)
real 0m0.590s
user 0m0.565s
sys 0m0.023s
$ time scheme --quiet <gcd.scm # MIT-Scheme
a b: (gcd s t)
42 56: (14 1 -1)
273 110: (1 27 -67)
real 0m59.250s
user 0m58.886s
sys 0m0.129s
$ time ./gcd.out # C
14 1 -1
1 27 -67
real 0m7.987s
user 0m7.967s
sys 0m0.000s
Why is the Racket implementation so much quicker?
=====
Update: If anyone's wondering, here are the results using the corrected loop function taking the answer into account:
loop:
(define (loop n fn)
(fn)
(if (= n 1) 0
(loop (- n 1) fn)))
Racket (still slightly outperforms C, even including its setup time):
real 0m7.544s
user 0m7.472s
sys 0m0.050s
MIT Scheme
real 9m59.392s
user 9m57.568s
sys 0m0.113s
The question still holds about the large difference between the Scheme implementations (still large), however. I'll ask this separately to ignore confusion with the previous error.
You are not actually invoking your thunk that calls the computation within your implementation of loop. This is why it's so much faster than the C implementation. You're not actually computing anything.
I'm not sure why exactly MIT Scheme is so slow for this. Just counting down from 100 million seems like it should be lightning fast like it is in Racket.
To actually compute the gcd redundantly, throw away the result, and measure the time, implement loop like this:
(define (loop n fn)
(if (= n 0) 0
(begin
(fn)
(loop (- n 1) fn))))

Iterative tree calculation in scheme

I'm trying to implement a function defined as such:
f(n) = n if n < 4
f(n) = f(n - 1) + 2f(n - 2) + 3f(n - 3) + 4f(n - 4) if n >= 4
The iterative way to do this would be to start at the bottom until I hit n, so if n = 6:
f(4) = (3) + 2(2) + 3(1) + 4(0) | 10
f(5) = f(4) + 2(3) + 3(2) + 4(1) | 10 + 16 = 26
f(6) = f(5) + 2f(4) + 3(3) + 4(2) | 26 + 2(10) + 17 = 63
Implementation attempt:
; m1...m4 | The results of the previous calculations (eg. f(n-1), f(n-2), etc.)
; result | The result thus far
; counter | The current iteration of the loop--starts at 4 and ends at n
(define (fourf-iter n)
(cond [(< n 4) n]
[else
(define (helper m1 m2 m3 m4 result counter)
(cond [(= counter n) result]
[(helper result m1 m2 m3 (+ result m1 (* 2 m2) (* 3 m3) (* 4 m4)) (+ counter 1))]))
(helper 3 2 1 0 10 4)]))
Several problems:
The returned result is one iteration less than what it's supposed to be, because the actual calculations don't take place until the recursive call
Instead of using the defined algorithm to calculate f(4), I'm just putting it right in there that f(4) = 10
Ideally I want to start result at 0 and counter at 3 so that the calculations are applied to m1 through m4 (and so that f(4) will actually be calculated out instead of being preset), but then 0 gets used for m1 in the next iteration when it should be the result of f(4) instead (10)
tl;dr either the result calculation is delayed, or the result itself is delayed. How can I write this properly?
I think the appropriately "Scheme-ish" way to write a function that's defined recursively like that is to use memoization. If a function f is memoized, then when you call f(4) first it looks up 4 in a key-value table and if it finds it, returns the stored value. Otherwise, it simply calculates normally and then stores whatever it calculates in the table. Therefore, f will never evaluate the same computation twice. This is similar to the pattern of making an array of size n and filling in values starting from 0, building up a solution for n. That method is called dynamic programming, and memoization and dynamic programming are really different ways of looking at the same optimization strategy - avoiding computing the same thing twice. Here's a simple Racket function memo that takes a function and returns a memoized version of it:
(define (memo f)
(let ([table (make-hash)])
(lambda args
(hash-ref! table
args
(thunk (apply f args))))))
Now, we can write your function f recursively without having to worry about the performance problems of ever calculating the same result twice, thus going from an exponential time algorithm down to a linear one while keeping the implementation straightforward:
(define f
(memo
(lambda (n)
(if (< n 4)
n
(+ (f (- n 1))
(* 2 (f (- n 2)))
(* 3 (f (- n 3)))
(* 4 (f (- n 4))))))))
Note that as long as the function f exists, it will keep in memory a table containing the result of every time it's ever been called.
If you want a properly tail-recursive solution, your best approach is probably to use the named let construct. If you do (let name ([id val] ...) body ...) then calling (name val ...) anywhere in body ... will jump back to the beginning of the let with the new values val ... for the bindings. An example:
(define (display-n string n)
(let loop ([i 0])
(when (< i n)
(display string)
(loop (add1 i)))))
Using this makes a tail-recursive solution for your problem much less wordy than defining a helper function and calling it:
(define (f n)
(if (< n 4)
n
(let loop ([a 3] [b 2] [c 1] [d 0] [i 4])
(if (<= i n)
(loop (fn+1 a b c d) a b c (add1 i))
a))))
(define (fn+1 a b c d)
(+ a (* 2 b) (* 3 c) (* 4 d)))
This version of the function keeps track of four values for f, then uses them to compute the next value and ditches the oldest value. This builds up a solution while only keeping four values in memory, and it doesn't keep a huge table stored between calls. The fn+1 helper function is for combining the four previous results of the function into the next result, it's just there for readability. This might be a function to use if you want to optimize for memory usage. Using the memoized version has two advantages however:
The memoized version is much easier to understand, the recursive logic is preserved.
The memoized version stores results between calls, so if you call f(10) and then f(4), the second call will only be a table lookup in constant time because calling f(10) stored all the results for calling f with n from 0 to 10.

How to count number of digits?

(CountDigits n) takes a positive integer n, and returns the number of digits it contains. e.g.,
(CountDigits 1) → 1
(CountDigits 10) → 2
(CountDigits 100) → 3
(CountDigits 1000) → 4
(CountDigits 65536) → 5
I think I'm supposed to use the remainder of the number and something else but other then that im really lost. what i tried first was dividing the number by 10 then seeing if the number was less then 1. if it was then it has 1 digit. if it doesnt then divide by 100 and so on and so forth. but im not really sure how to extend that to any number so i scrapped that idea
(define (num-digits number digit)
(if (= number digit 0)
1
Stumbled across this and had to provide the log-based answer:
(define (length n)
(+ 1 (floor (/ (log n) (log 10))))
)
Edit for clarity: This is an O(1) solution that doesn't use recursion. For example, given
(define (fact n)
(cond
[(= n 1) 1]
[else (* n (fact (- n 1)))]
)
)
(define (length n)
(+ 1 (floor (/ (log n) (log 10))))
)
Running (time (length (fact 10000))) produces
cpu time: 78 real time: 79 gc time: 47
35660.0
Indicating that 10000! produces an answer consisting of 35660 digits.
After some discussion in the comments, we figured out how to take a number n with x digits and to get a number with x-1 digits: divide by 10 (using integer division, i.e., we ignore the remainder). We can check whether a number only has one digit by checking whether it's less than 10. Now we just need a way to express the total number of digits in a number as a (recursive) function. There are two cases:
(base case) a number n less than 10 has 1 digit. So CountDigits(n) = 1.
(recursive case) a number n greater than 10 has CountDigits(n) = 1+CountDigits(n/10).
Now it's just a matter of coding this up. This sounds like homework, so I don't want to give everything away. You'll still need to figure out how to write the condition "n < 10" in Scheme, as well as "n/10" (just the quotient part), but the general structure is:
(define (CountDigits n) ; 1
(if [n is less than 10] ; 2
1 ; 3
(+ 1 (CountDigits [n divided by 10])))) ; 4
An explanation of those lines, one at a time:
(define (CountDigits n) begins the definition of a function called CountDigits that's called like (CountDigits n).
In Racket, if is used to evaluate one expression, called the test, or the condition, and then to evaluate and return the value of one of the two remaining expressions. (if test X Y) evaluates test, and if test produces true, then X is evaluated and the result is returned, but otherwise Y is evaluated and the result is returned.
1 is the value that you want to return when n is less than 10 (the base case above).
1+CountDigits(n/10) is the value that you want to return otherwise, and in Racket (and Scheme, and Lisp in general) it's written as (+ 1 (CountDigits [n divided by 10])).
It will be a good idea to familiarize with the style of the Racket documentation, so I will point you to the appropriate chapter: 3.2.2 Generic Numerics. The functions you'll need should be in there, and the documentation should provide enough examples for you to figure out how to write the missing bits.
I know this is old but for future reference to anyone who finds this personally I'd write it like this:
(define (count-digits n acc)
(if (< n 10)
(+ acc 1)
(count-digits (/ n 10) (+ acc 1))))
The difference being that this one is tail-recursive and will essentially be equivalent to an iterative function(and internally Racket's iterative forms actually exploit this fact.)
Using trace illustrates the difference:
(count-digits-taylor 5000000)
>(count-digits-taylor 5000000)
> (count-digits-taylor 500000)
> >(count-digits-taylor 50000)
> > (count-digits-taylor 5000)
> > >(count-digits-taylor 500)
> > > (count-digits-taylor 50)
> > > >(count-digits-taylor 5)
< < < <1
< < < 2
< < <3
< < 4
< <5
< 6
<7
7
(count-digits 5000000 0)
>(count-digits 5000000 0)
>(count-digits 500000 1)
>(count-digits 50000 2)
>(count-digits 5000 3)
>(count-digits 500 4)
>(count-digits 50 5)
>(count-digits 5 6)
<7
7
For this exercise this doesn't matter much, but it's a good style to learn. And of course since the original post asks for a function called CountDigits which only takes one argument (n) you'd just add:
(define (CountDigits n)
(count-digits n 0))

Generating integers in ascending order using a set of prime numbers

I have a set of prime numbers and I have to generate integers using only those prime factors in increasing order.
For example, if the set is p = {2, 5} then my integers should be 1, 2, 4, 5, 8, 10, 16, 20, 25, …
Is there any efficient algorithm to solve this problem?
Removing a number and reinserting all its multiples (by the primes in the set) into a priority queue is wrong (in the sense of the question) - i.e. it produces correct sequence but inefficiently so.
It is inefficient in two ways - first, it overproduces the sequence; second, each PriorityQueue operation incurs extra cost (the operations remove_top and insert are not usually both O(1), certainly not in any list- or tree-based PriorityQueue implementation).
The efficient O(n) algorithm maintains pointers back into the sequence itself as it is being produced, to find and append the next number in O(1) time. In pseudocode:
return array h where
h[0]=1; n=0; ps=[2,3,5, ... ]; // base primes
is=[0 for each p in ps]; // indices back into h
xs=[p for each p in ps] // next multiples: xs[k]==ps[k]*h[is[k]]
repeat:
h[++n] := minimum xs
for each ref (i,x,p) in (is,xs,ps):
if( x==h[n] )
{ x := p*h[++i]; } // advance the minimal multiple/pointer
For each minimal multiple it advances its pointer, while at the same time calculating its next multiple value. This too effectively implements a PriorityQueue but with crucial distinctions - it is before the end point, not after; it doesn't create any additional storage except for the sequence itself; and its size is constant (just k numbers, for k base primes) whereas the size of past-the-end PriorityQueue grows as we progress along the sequence (in the case of Hamming sequence, based on set of 3 primes, as n2/3, for n numbers of the sequence).
The classic Hamming sequence in Haskell is essentially the same algorithm:
h = 1 : map (2*) h `union` map (3*) h `union` map (5*) h
union a#(x:xs) b#(y:ys) = case compare x y of LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
We can generate the smooth numbers for arbitrary base primes using the foldi function (see Wikipedia) to fold lists in a tree-like fashion for efficiency, creating a fixed sized tree of comparisons:
smooth base_primes = h where -- strictly increasing base_primes NB!
h = 1 : foldi g [] [map (p*) h | p <- base_primes]
g (x:xs) ys = x : union xs ys
foldi f z [] = z
foldi f z (x:xs) = f x (foldi f z (pairs f xs))
pairs f (x:y:t) = f x y : pairs f t
pairs f t = t
It is also possible to directly calculate a slice of Hamming sequence around its nth member in O(n2/3) time, by direct enumeration of the triples and assessing their values through logarithms, logval(i,j,k) = i*log 2+j*log 3+k*log 5. This Ideone.com test entry calculates 1 billionth Hamming number in 1.12 0.05 seconds (2016-08-18: main speedup due to usage of Int instead of the default Integer where possible, even on 32-bit; additional 20% thanks to the tweak suggested by #GordonBGood, bringing band size complexity down to O(n1/3)).
This is discussed some more in this answer where we also find its full attribution:
slice hi w = (c, sortBy (compare `on` fst) b) where -- hi is a top log2 value
lb5=logBase 2 5 ; lb3=logBase 2 3 -- w<1 (NB!) is (log2 width)
(Sum c, b) = fold -- total count, the band
[ ( Sum (i+1), -- total triples w/this j,k
[ (r,(i,j,k)) | frac < w ] ) -- store it, if inside the band
| k <- [ 0 .. floor ( hi /lb5) ], let p = fromIntegral k*lb5,
j <- [ 0 .. floor ((hi-p)/lb3) ], let q = fromIntegral j*lb3 + p,
let (i,frac) = pr (hi-q) ; r = hi - frac -- r = i + q
] -- (sum . map fst &&& concat . map snd)
pr = properFraction
This can be generalized for k base primes as well, probably running in O(n(k-1)/k) time.
(see this SO entry for an important later development. also, this answer is interesting. and another related answer.)
The basic idea is that 1 is a member of the set, and for each member of the set n so also 2n and 5n are members of the set. Thus, you begin by outputting 1, and push 2 and 5 onto a priority queue. Then, you repeatedly pop the front item of the priority queue, output it if it is different from the previous output, and push 2 times and 5 times the number onto the priority queue.
Google for "Hamming number" or "regular number" or go to A003592 to learn more.
----- ADDED LATER -----
I decided to spend a few minutes on my lunch hour to write a program to implement the algorithm described above, using the Scheme programming language. First, here is a library implementation of priority queues using the pairing heap algorithm:
(define pq-empty '())
(define pq-empty? null?)
(define (pq-first pq)
(if (null? pq)
(error 'pq-first "can't extract minimum from null queue")
(car pq)))
(define (pq-merge lt? p1 p2)
(cond ((null? p1) p2)
((null? p2) p1)
((lt? (car p2) (car p1))
(cons (car p2) (cons p1 (cdr p2))))
(else (cons (car p1) (cons p2 (cdr p1))))))
(define (pq-insert lt? x pq)
(pq-merge lt? (list x) pq))
(define (pq-merge-pairs lt? ps)
(cond ((null? ps) '())
((null? (cdr ps)) (car ps))
(else (pq-merge lt? (pq-merge lt? (car ps) (cadr ps))
(pq-merge-pairs lt? (cddr ps))))))
(define (pq-rest lt? pq)
(if (null? pq)
(error 'pq-rest "can't delete minimum from null queue")
(pq-merge-pairs lt? (cdr pq))))
Now for the algorithm. Function f takes two parameters, a list of the numbers in the set ps and the number n of items to output from the head of the output. The algorithm is slightly changed; the priority queue is initialized by pushing 1, then the extraction steps start. Variable p is the previous output value (initially 0), pq is the priority queue, and xs is the output list, which is accumulated in reverse order. Here's the code:
(define (f ps n)
(let loop ((n n) (p 0) (pq (pq-insert < 1 pq-empty)) (xs (list)))
(cond ((zero? n) (reverse xs))
((= (pq-first pq) p) (loop n p (pq-rest < pq) xs))
(else (loop (- n 1) (pq-first pq) (update < pq ps)
(cons (pq-first pq) xs))))))
For those not familiar with Scheme, loop is a locally-defined function that is called recursively, and cond is the head of an if-else chain; in this case, there are three cond clauses, each clause with a predicate and consequent, with the consequent evaluated for the first clause for which the predicate is true. The predicate (zero? n) terminates the recursion and returns the output list in the proper order. The predicate (= (pq-first pq) p) indicates that the current head of the priority queue has been output previously, so it is skipped by recurring with the rest of the priority queue after the first item. Finally, the else predicate, which is always true, identifies a new number to be output, so it decrements the counter, saves the current head of the priority queue as the new previous value, updates the priority queue to add the new children of the current number, and inserts the current head of the priority queue into the accumulating output.
Since it is non-trivial to update the priority queue to add the new children of the current number, that operation is extracted to a separate function:
(define (update lt? pq ps)
(let loop ((ps ps) (pq pq))
(if (null? ps) (pq-rest lt? pq)
(loop (cdr ps) (pq-insert lt? (* (pq-first pq) (car ps)) pq)))))
The function loops over the elements of the ps set, inserting each into the priority queue in turn; the if returns the updated priority queue, minus its old head, when the ps list is exhausted. The recursive step strips the head of the ps list with cdr and inserts the product of the head of the priority queue and the head of the ps list into the priority queue.
Here are two examples of the algorithm:
> (f '(2 5) 20)
(1 2 4 5 8 10 16 20 25 32 40 50 64 80 100 125 128 160 200 250)
> (f '(2 3 5) 20)
(1 2 3 4 5 6 8 9 10 12 15 16 18 20 24 25 27 30 32 36)
You can run the program at http://ideone.com/sA1nn.
This 2-dimensional exploring algorithm is not exact, but works for the first 25 integers, then mixes up 625 and 512.
n = 0
exp_before_5 = 2
while true
i = 0
do
output 2^(n-exp_before_5*i) * 5^Max(0, n-exp_before_5*(i+1))
i <- i + 1
loop while n-exp_before_5*(i+1) >= 0
n <- n + 1
end while
Based on user448810's answer, here's a solution that uses heaps and vectors from the STL.
Now, heaps normally output the largest value, so we store the negative of the numbers as a workaround (since a>b <==> -a<-b).
#include <vector>
#include <iostream>
#include <algorithm>
int main()
{
std::vector<int> primes;
primes.push_back(2);
primes.push_back(5);//Our prime numbers that we get to use
std::vector<int> heap;//the heap that is going to store our possible values
heap.push_back(-1);
std::vector<int> outputs;
outputs.push_back(1);
while(outputs.size() < 10)
{
std::pop_heap(heap.begin(), heap.end());
int nValue = -*heap.rbegin();//Get current smallest number
heap.pop_back();
if(nValue != *outputs.rbegin())//Is it a repeat?
{
outputs.push_back(nValue);
}
for(unsigned int i = 0; i < primes.size(); i++)
{
heap.push_back(-nValue * primes[i]);//add new values
std::push_heap(heap.begin(), heap.end());
}
}
//output our answer
for(unsigned int i = 0; i < outputs.size(); i++)
{
std::cout << outputs[i] << " ";
}
std::cout << std::endl;
}
Output:
1 2 4 5 8 10 16 20 25 32

Resources