Slowdowns of performance of ets select

Slowdowns of performance of ets select - performance

I did some tests around performance of selection from ets tables and noted weird behaviour. For example we have a simple ets table (without any specific options) which stores key/value - a random string and a number:
:ets.new(:table, [:named_table])
for _i <- 1..2000 do
:ets.insert(:table, {:crypto.strong_rand_bytes(10)
|> Base.url_encode64
|> binary_part(0, 10), 100})
end
and one entry with known key:
:ets.insert(:table, {"test_string", 200})
Now there is simple stupid benchmark function which tries to select test_string from the ets table multiple times and to measure time of each selection:
test_fn = fn() ->
Enum.map(Enum.to_list(1..10_000), fn(x) ->
:timer.tc(fn() ->
:ets.select(:table, [{{:'$1', :'$2'},
[{:'==', :'$1', "test_string"}],
[:'$_']}])
end)
end) |> Enum.unzip
end
Now If I take a look at maximum time with Enum.max(timings) it will return a value which is approximately 10x times greater than almost of all other selections. So, for example:
iex(1)> {timings, _result} = test_fn.()
....
....
....
iex(2)> Enum.max(timings)
896
iex(3)> Enum.sum(timings) / length(timings)
96.8845
We may see here that maximum value is almost 10x times greater than average value.
What's happening here? Is it somehow related to GC, time for memory allocation or something like this? Do you have any ideas why selection from an ets table may give such slowdowns sometimes or how to profile this.
UPD.
here is the graph of timings distribution:

match_spec, the 2nd argument of the select/2 is making it slower.
According to an answer on this question
Erlang: ets select and match performance
In trivial non-specific use-cases, select is just a lot of work around match.
In non-trivial more common use-cases, select will give you what you really want a lot quicker.
Also, if you are working with tables with set or ordered_set type, to get a value based on a key, use lookup/2 instead, as it is lot faster.
On my pc, following code
def lookup() do
{timings, _} = Enum.map(Enum.to_list(1..10_000), fn(_x) ->
:timer.tc(fn() ->
:ets.lookup(:table, "test_string")
end)
end) |> Enum.unzip
IO.puts Enum.max(timings)
IO.puts Enum.sum(timings) / length(timings)
end
printed
0
0.0
While yours printed
16000
157.9
In case you are interested, here you can find the NIF C code for ets:select.
https://github.com/erlang/otp/blob/9d1b3bb0db87cf95cb821af01189f6d6be072f79/erts/emulator/beam/erl_db.c

Related

Designing an algorithm to check combinations

I´m having serious performance issues with a job that is running everyday and I think i cannot improve the algorithm; so I´m gonnga explain you what is the problem to solve and the algorithm we have, and maybe you have some other ideas to solve the problem better.
So the problem we have to solve is:
There is a set of Rules, ~ 120.000 Rules.
Every rule has a set of combinations of Codes. Codes are basically strings. So we have ~8 combinations per rule. Example of a combination: TTAAT;ZZUHH;GGZZU;WWOOF;SSJJW;FFFOLL
There is a set of Objects, ~800 objects.
Every object has a set of ~200 codes.
We have to check for every Rule, if there is at least one Combination of Codes that is fully contained in the Objects. It means =>
loop in Rules
Loop in Combinations of the rule
Loop in Objects
every code of the combination found in the Object? => create relationship rule/object and continue with the next object
end of loop
end of loop
end of loop
For example, if we have the Rule with this combination of two codes: HHGGT; ZZUUF
And let´s say we have an object with this codes: HHGGT; DHZZU; OIJUH; ZHGTF; HHGGT; JUHZT; ZZUUF; TGRFE; UHZGT; FCDXS
Then we create a relationship between the Object and the Rule because every code of the combination of the rule is contained in the codes of the object => this is what the algorithm has to do.
As you can see this is quite expensive, because we need 120.000 x 8 x 800 = 750 millions of times in the worst-case scenario.
This is a simplified scenario of the real problem; actually what we do in the loops is a little bit more complicated, that´s why we have to reduce this somehow.
I tried to think in a solution but I don´t have any ideas!
Do you see something wrong here?
Best regards and thank you for the time :)

Something like this might work better if I'm understanding correctly (this is in python):
RULES = [
['abc', 'def',],
['aaa', 'sfd',],
['xyy', 'eff',]]
OBJECTS = [
('rrr', 'abc', 'www', 'def'),
('pqs', 'llq', 'aaa', 'sdr'),
('xyy', 'hjk', 'fed', 'eff'),
('pnn', 'rrr', 'mmm', 'qsq')
]
MapOfCodesToObjects = {}
for obj in OBJECTS:
for code in obj:
if (code in MapOfCodesToObjects):
MapOfCodesToObjects[code].add(obj)
else:
MapOfCodesToObjects[code] = set({obj})
RELATIONS = []
for rule in RULES:
if (len(rule) == 0):
continue
if (rule[0] in MapOfCodesToObjects):
ValidObjects = MapOfCodesToObjects[rule[0]]
else:
continue
for i in range(1, len(rule)):
if (rule[i] in MapOfCodesToObjects):
codeObjects = MapOfCodesToObjects[rule[i]]
else:
ValidObjects = set()
break
ValidObjects = ValidObjects.intersection(codeObjects)
if (len(ValidObjects) == 0):
break
for vo in ValidObjects:
RELATIONS.append((rule, vo))
for R in RELATIONS:
print(R)
First you build a map of codes to objects. If there are nObj objects and nCodePerObj codes on average per object, this takes O(nObj*nCodePerObj * log(nObj*nCodePerObj).
Next you iterate through the rules and look up each code in each rule in the map you built. There is a relation if a certain object occurs for every code in the rule, i.e. if it is in the set intersection of the objects for every code in the rule. Since hash lookups have O(1) time complexity on average, and set intersection has time complexity O(min of the lengths of the 2 sets), this will take O(nRule * nCodePerRule * nObjectsPerCode), (note that is nObjectsPerCode, not nCodePerObj, the performance gets worse when one code is included in many objects).

SAS: What's the optimal way to find the sum of a column by another column?

I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();

sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.

The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.

If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;

Julia: Parallel for loop over partitions iterator

So I'm trying to iterate over the list of partitions of something, say 1:n for some n between 13 and 21. The code that I ideally want to run looks something like this:
valid_num = #parallel (+) for p in partitions(1:n)
int(is_valid(p))
end
println(valid_num)
This would use the #parallel for to map-reduce my problem. For example, compare this to the example in the Julia documentation:
nheads = #parallel (+) for i=1:200000000
Int(rand(Bool))
end
However, if I try my adaptation of the loop, I get the following error:
ERROR: `getindex` has no method matching getindex(::SetPartitions{UnitRange{Int64}}, ::Int64)
in anonymous at no file:1433
in anonymous at multi.jl:1279
in run_work_thunk at multi.jl:621
in run_work_thunk at multi.jl:630
in anonymous at task.jl:6
which I think is because I am trying to iterate over something that is not of the form 1:n (EDIT: I think it's because you cannot call p[3] if p=partitions(1:n)).
I've tried using pmap to solve this, but because the number of partitions can get really big, really quickly (there are more than 2.5 million partitions of 1:13, and when I get to 1:21 things will be huge), constructing such a large array becomes an issue. I left it running over night and it still didn't finish.
Does anyone have any advice for how I can efficiently do this in Julia? I have access to a ~30 core computer and my task seems easily parallelizable, so I would be really grateful if anyone knows a good way to do this in Julia.
Thank you so much!

The below code gives 511, the number of partitions of size 2 of a set of 10.
using Iterators
s = [1,2,3,4,5,6,7,8,9,10]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
sum(map(is_valid, takenth(chain(1:29,drop(partitions(s), i-1)), 30)))
end
This solution combines the takenth, drop, and chain iterators to get the same effect as the take_every iterator below under PREVIOUS ANSWER. Note that in this solution, every process must compute every partition. However, because each process uses a different argument to drop, no two processes will ever call is_valid on the same partition.
Unless you want to do a lot of math to figure out how to actually skip partitions, there is no way to avoid computing partitions sequentially on at least one process. I think Simon's answer does this on one process and distributes the partitions. Mine asks each worker process to compute the partitions itself, which means the computation is being duplicated. However, it is being duplicated in parallel, which (if you actually have 30 processors) will not cost you time.
Here is a resource on how iterators over partitions are actually computed: http://www.informatik.uni-ulm.de/ni/Lehre/WS03/DMM/Software/partitions.pdf.
PREVIOUS ANSWER (More complicated than necessary)
I noticed Simon's answer while writing mine. Our solutions seem similar to me, except mine uses iterators to avoid storing partitions in memory. I'm not sure which would actually be faster for what size sets, but I figure it's good to have both options. Assuming it takes you significantly longer to compute is_valid than to compute the partitions themselves, you can do something like this:
s = [1,2,3,4]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
foldl((x,y)->(x + int(is_valid(y))), 0, take_every(partitions(s), i-1, 30))
end
which gives me 7, the number of partitions of size 2 for a set of 4. The take_every function returns an iterator that returns every 30th partition starting with the ith. Here is the code for that:
import Base: start, done, next
immutable TakeEvery{Itr}
itr::Itr
start::Any
value::Any
flag::Bool
skip::Int64
end
function take_every(itr, offset, skip)
value, state = Nothing, start(itr)
for i = 1:(offset+1)
if done(itr, state)
return TakeEvery(itr, state, value, false, skip)
end
value, state = next(itr, state)
end
if done(itr, state)
TakeEvery(itr, state, value, true, skip)
else
TakeEvery(itr, state, value, false, skip)
end
end
function start{Itr}(itr::TakeEvery{Itr})
itr.value, itr.start, itr.flag
end
function next{Itr}(itr::TakeEvery{Itr}, state)
value, state_, flag = state
for i=1:itr.skip
if done(itr.itr, state_)
return state[1], (value, state_, false)
end
value, state_ = next(itr.itr, state_)
end
if done(itr.itr, state_)
state[1], (value, state_, !flag)
else
state[1], (value, state_, false)
end
end
function done{Itr}(itr::TakeEvery{Itr}, state)
done(itr.itr, state[2]) && !state[3]
end

One approach would be to divide the problem up into pieces that are not too big to realize and then process the items within each piece in parallel, e.g. as follows:
function my_take(iter,state,n)
i = n
arr = Array[]
while !done(iter,state) && (i>0)
a,state = next(iter,state)
push!(arr,a)
i = i-1
end
return arr, state
end
function get_part(npart,npar)
valid_num = 0
p = partitions(1:npart)
s = start(p)
while !done(p,s)
arr,s = my_take(p,s,npar)
valid_num += #parallel (+) for a in arr
length(a)
end
end
return valid_num
end
valid_num = #time get_part(10,30)
I was going to use the take() method to realize up to npar items from the iterator but take() appears to be deprecated so I've included my own implementation which I've called my_take(). The getPart() function therefore uses my_take() to obtain up to npar partitions at a time and carry out a calculation on them. In this case, the calculation just adds up their lengths, because I don't have the code for the OP's is_valid() function. get_part() then returns the result.
Because the length() calculation isn't very time-consuming, this code is actually slower when run on parallel processors than it is on a single processor:
$ julia -p 1 parpart.jl
elapsed time: 10.708567515 seconds (373025568 bytes allocated, 6.79% gc time)
$ julia -p 2 parpart.jl
elapsed time: 15.70633439 seconds (548394872 bytes allocated, 9.14% gc time)
Alternatively, pmap() could be used on each piece of the problem instead of the parallel for loop.
With respect to the memory issue, realizing 30 items from partitions(1:10) took nearly 1 gigabyte of memory on my PC when I ran Julia with 4 worker processes so I expect realizing even a small subset of partitions(1:21) will require a great deal of memory. It may be desirable to estimate how much memory would be needed to see if it would be at all possible before trying such a computation.
With respect to the computation time, note that:
julia> length(partitions(1:10))
115975
julia> length(partitions(1:21))
474869816156751
... so even efficient parallel processing on 30 cores might not be enough to make the larger problem solvable in a reasonable time.

Operating in parallel on a large constant datastructure in Julia

I have a large vector of vectors of strings:
There are around 50,000 vectors of strings,
each of which contains 2-15 strings of length 1-20 characters.
MyScoringOperation is a function which operates on a vector of strings (the datum) and returns an array of 10100 scores (as Float64s). It takes about 0.01 seconds to run MyScoringOperation (depending on the length of the datum)
function MyScoringOperation(state:State, datum::Vector{String})
...
score::Vector{Float64} #Size of score = 10000
I have what amounts to a nested loop.
The outer loop typically would runs for 500 iterations
data::Vector{Vector{String}} = loaddata()
for ii in 1:500
score_total = zeros(10100)
for datum in data
score_total+=MyScoringOperation(datum)
end
end
On one computer, on a small test case of 3000 (rather than 50,000) this takes 100-300 seconds per outer loop.
I have 3 powerful servers with Julia 3.9 installed (and can get 3 more easily, and then can get hundreds more at the next scale).
I have basic experience with #parallel, however it seems like it is spending a lot of time copying the constant (It more or less hang on the smaller testing case)
That looks like:
data::Vector{Vector{String}} = loaddata()
state = init_state()
for ii in 1:500
score_total = #parallel(+) for datum in data
MyScoringOperation(state, datum)
end
state = update(state, score_total)
end
My understanding of the way this implementation works with #parallel is that it:
For Each ii:
partitions data into a chuck for each worker
sends that chuck to each worker
works all process there chunks
main procedure sums the results as they arrive.
I would like to remove step 2,
so that instead of sending a chunk of data to each worker,
I just send a range of indexes to each worker, and they look it up from their own copy of data. or even better, only giving each only their own chunk, and having them reuse it each time (saving on a lot of RAM).
Profiling backs up my belief about the functioning of #parellel.
For a similarly scoped problem (with even smaller data),
the non-parallel version runs in 0.09seconds,
and the parallel runs in
And the profiler shows almost all the time is spent 185 seconds.
Profiler shows almost 100% of this is spend interacting with network IO.

This should get you started:
function get_chunks(data::Vector, nchunks::Int)
base_len, remainder = divrem(length(data),nchunks)
chunk_len = fill(base_len,nchunks)
chunk_len[1:remainder]+=1 #remained will always be less than nchunks
function _it()
for ii in 1:nchunks
chunk_start = sum(chunk_len[1:ii-1])+1
chunk_end = chunk_start + chunk_len[ii] -1
chunk = data[chunk_start: chunk_end]
produce(chunk)
end
end
Task(_it)
end
function r_chunk_data(data::Vector)
all_chuncks = get_chunks(data, nworkers()) |> collect;
remote_chunks = [put!(RemoteRef(pid)::RemoteRef, all_chuncks[ii]) for (ii,pid) in enumerate(workers())]
#Have to add the type annotation sas otherwise it thinks that, RemoteRef(pid) might return a RemoteValue
end
function fetch_reduce(red_acc::Function, rem_results::Vector{RemoteRef})
total = nothing
#TODO: consider strongly wrapping total in a lock, when in 0.4, so that it is garenteed safe
#sync for rr in rem_results
function gather(rr)
res=fetch(rr)
if total===nothing
total=res
else
total=red_acc(total,res)
end
end
#async gather(rr)
end
total
end
function prechunked_mapreduce(r_chunks::Vector{RemoteRef}, map_fun::Function, red_acc::Function)
rem_results = map(r_chunks) do rchunk
function do_mapred()
#assert r_chunk.where==myid()
#pipe r_chunk |> fetch |> map(map_fun,_) |> reduce(red_acc, _)
end
remotecall(r_chunk.where,do_mapred)
end
#pipe rem_results|> convert(Vector{RemoteRef},_) |> fetch_reduce(red_acc, _)
end
rchunk_data breaks the data into chunks, (defined by get_chunks method) and sends those chunks each to a different worker, where they are stored in RemoteRefs.
The RemoteRefs are references to memory on your other proccesses(and potentially computers), that
prechunked_map_reduce does a variation on a kind of map reduce to have each worker first run map_fun on each of it's chucks elements, then reduce over all the elements in its chuck using red_acc (a reduction accumulator function). Finally each worker returns there result which is then combined by reducing them all together using red_acc this time using the fetch_reduce so that we can add the first ones completed first.
fetch_reduce is a nonblocking fetch and reduce operation. I believe it has no raceconditions, though this maybe because of a implementation detail in #async and #sync. When julia 0.4 comes out, it is easy enough to put a lock in to make it obviously have no race conditions.
This code isn't really battle hardened. I don;t believe the
You also might want to look at making the chuck size tunable, so that you can seen more data to faster workers (if some have better network or faster cpus)
You need to reexpress your code as a map-reduce problem, which doesn't look too hard.
Testing that with:
data = [float([eye(100),eye(100)])[:] for _ in 1:3000] #480Mb
chunk_data(:data, data)
#time prechunked_mapreduce(:data, mean, (+))
Took ~0.03 seconds, when distributed across 8 workers (none of them on the same machine as the launcher)
vs running just locally:
#time reduce(+,map(mean,data))
took ~0.06 seconds.

Speed dating algorithm

I work in a consulting organization and am most of the time at customer locations. Because of that I rarely meet my colleagues. To get to know each other better we are going to arrange a dinner party. There will be many small tables so people can have a chat. In order to talk to as many different people as possible during the party, everybody has to switch tables at some interval, say every hour.
How do I write a program that creates the table switching schedule? Just to give you some numbers; in this case there will be around 40 people and there can be at most 8 people at each table. But, the algorithm needs to be generic of course

heres an idea
first work from the perspective of the first person .. lets call him X
X has to meet all the other people in the room, so we should divide the remaining people into n groups ( where n = #_of_people/capacity_per_table ) and make him sit with one of these groups per iteration
Now that X has been taken care of, we will consider the next person Y
WLOG Y be a person X had to sit with in the first iteration itself.. so we already know Y's table group for that time-frame.. we should then divide the remaining people into groups such that each group sits with Y for every consecutive iteration.. and for each iteration X's group and Y's group have no person in common
.. I guess, if you keep doing something like this, you will get an optimal solution (if one exists)
Alternatively you could crowd source the problem by giving each person a card where they could write down the names of all the people they got dine with.. and at the end of event, present some kind of prize to the person with the most names in their card

This sounds like an application for genetic algorithm:
Select a random permutation of the 40 guests - this is one seating arrangement
Repeat the random permutation N time (n is how many times you are to switch seats in the night)
Combine the permutations together - this is the chromosome for one organism
Repeat for how ever many organisms you want to breed in one generation
The fitness score is the number of people each person got to see in one night (or alternatively - the inverse of the number of people they did not see)
Breed, mutate and introduce new organisms using the normal method and repeat until you get a satisfactory answer
You can add in any other factors you like into the fitness, such as male/female ratio and so on without greatly changing the underlying method.

Why not imitate real world?
class Person {
void doPeriodically() {
do {
newTable = random (numberOfTables);
} while (tableBusy(newTable))
switchTable (newTable)
}
}
Oh, and note that there is a similar algorithm for finding a mating partner and it's rumored to be effective for those 99% of people who don't spend all of their free time answering programming questions...

Perfect Table Plan

You might want to have a look at combinatorial design theory.

Intuitively I don't think you can do better than a perfect shuffle, but it's beyond my pre-coffee cognition to prove it.

This one was very funny! :D
I tried different method but the logic suggested by adi92 (card + prize) is the one that works better than any other I tried.
It works like this:
a guy arrives and examines all the tables
for each table with free seats he counts how many people he has to meet yet, then choose the one with more unknown people
if two tables have an equal number of unknown people then the guy will choose the one with more free seats, so that there is more probability to meet more new people
at each turn the order of the people taking seats is random (this avoid possible infinite loops), this is a "demo" of the working algorithm in python:
import random
class Person(object):
def __init__(self, name):
self.name = name
self.known_people = dict()
def meets(self, a_guy, propagation = True):
"self meets a_guy, and a_guy meets self"
if a_guy not in self.known_people:
self.known_people[a_guy] = 1
else:
self.known_people[a_guy] += 1
if propagation: a_guy.meets(self, False)
def points(self, table):
"Calculates how many new guys self will meet at table"
return len([p for p in table if p not in self.known_people])
def chooses(self, tables, n_seats):
"Calculate what is the best table to sit at, and return it"
points = 0
free_seats = 0
ret = random.choice([t for t in tables if len(t)<n_seats])
for table in tables:
tmp_p = self.points(table)
tmp_s = n_seats - len(table)
if tmp_s == 0: continue
if tmp_p > points or (tmp_p == points and tmp_s > free_seats):
ret = table
points = tmp_p
free_seats = tmp_s
return ret
def __str__(self):
return self.name
def __repr__(self):
return self.name
def Switcher(n_seats, people):
"""calculate how many tables and what switches you need
assuming each table has n_seats seats"""
n_people = len(people)
n_tables = n_people/n_seats
switches = []
while not all(len(g.known_people) == n_people-1 for g in people):
tables = [[] for t in xrange(n_tables)]
random.shuffle(people) # need to change "starter"
for the_guy in people:
table = the_guy.chooses(tables, n_seats)
tables.remove(table)
for guy in table:
the_guy.meets(guy)
table += [the_guy]
tables += [table]
switches += [tables]
return switches
lst_people = [Person('Hallis'),
Person('adi92'),
Person('ilya n.'),
Person('m_oLogin'),
Person('Andrea'),
Person('1800 INFORMATION'),
Person('starblue'),
Person('regularfry')]
s = Switcher(4, lst_people)
print "You need %d tables and %d turns" % (len(s[0]), len(s))
turn = 1
for tables in s:
print 'Turn #%d' % turn
turn += 1
tbl = 1
for table in tables:
print ' Table #%d - '%tbl, table
tbl += 1
print '\n'
This will output something like:
You need 2 tables and 3 turns
Turn #1
Table #1 - [1800 INFORMATION, Hallis, m_oLogin, Andrea]
Table #2 - [adi92, starblue, ilya n., regularfry]
Turn #2
Table #1 - [regularfry, starblue, Hallis, m_oLogin]
Table #2 - [adi92, 1800 INFORMATION, Andrea, ilya n.]
Turn #3
Table #1 - [m_oLogin, Hallis, adi92, ilya n.]
Table #2 - [Andrea, regularfry, starblue, 1800 INFORMATION]
Because of the random it won't always come with the minimum number of switch, especially with larger sets of people. You should then run it a couple of times and get the result with less turns (so you do not stress all the people at the party :P ), and it is an easy thing to code :P
PS:
Yes, you can save the prize money :P

You can also take look at stable matching problem. The solution to this problem involves using max-flow algorithm. http://en.wikipedia.org/wiki/Stable_marriage_problem

I wouldn't bother with genetic algorithms. Instead, I would do the following, which is a slight refinement on repeated perfect shuffles.
While (there are two people who haven't met):
Consider the graph where each node is a guest and edge (A, B) exists if A and B have NOT sat at the same table. Find all the connected components of this graph. If there are any connected components of size < tablesize, schedule those connected components at tables. Note that even this is actually an instance of a hard problem known as Bin packing, but first fit decreasing will probably be fine, which can be accomplished by sorting the connected components in order of biggest to smallest, and then putting them each of them in turn at the first table where they fit.
Perform a random permutation of the remaining elements. (In other words, seat the remaining people randomly, which at first will be everyone.)
Increment counter indicating number of rounds.
Repeat the above for a while until the number of rounds seems to converge.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio