Algorithm to find optimal groups - algorithm

A device contains an array of locations, some of which contain values that we want to read periodically.
Our list of locations that we want to read periodically also specifies how often we want to read them. It is permitted to read a value more frequently than specified, but not less frequently.
A single read operation can read a contiguous sequence of locations from the array, so it is possible to return a group of multiple values from one read operation. The maximum number of contiguous locations that can be read in a single operation is M.
The goal is to group locations so as to minimize the time-averaged number of read operations. In the event that there is more than one way to do this, the tie-breaker is to minimize the time-averaged number of locations read.
(Bonus points are awarded if the algorithm to do this allows incremental changes to the list of locations - i.e. adding or removing one location to/from the list doesn't require the groupings to be recalculated from scratch!)
I'll try to clarify this with some examples where M=6.
The following diagram shows the array of locations. The numbers represent the desired read period for that location.
| 1 | 1 | | | 1 | | | | | | 5 | | 2 |
\-------------------/ \-----------/
group A group B
In this first example group A is read every second and group B every 2 seconds. Note that the location that should be read every 5 seconds is actually read every 2 seconds - which is fine.
| 1 | | | | | 1 | 1 | | 1 |
\-----------------------/\----------/
group A group B (non-optimal!)
This example shows a failure of my initial simple-minded algorithm, which was to fill up the first group until full and then start another. The following grouping is more optimal because although the number of group reads per second is the same, the number of locations read in those groups is smaller:
| 1 | | | | | 1 | 1 | | 1 |
\---/ \---------------/
group A group B (optimal)
Finally, an example where three groups is better than two:
| 5 | | | | | 1 | 1 | | | | | 5 |
\-----------------------/\----------------------/
group A group B (non-optimal)
This solution requires two group reads per second. A better solution is as follows:
| 5 | | | | | 1 | 1 | | | | | 5 |
\---/ \-------/ \---/
group A group B group C
This requires two reads every 5 seconds (groups A and C) plus one every second (group B): 1.4 group reads per second.
Edit: (There is an even better solution to this example if you allow reads to be non-periodic. On the 1st second, read both groups of the first solution. On seconds 2, 3, 4 and 5 read group B of the second solution. Repeat. This results in 1.2 group reads per second. But I'm going to disallow this because it would make the code responsible for scheduling the reads much more complicated.)
I looked up clustering algorithms but this isn't a clustering problem. I also found Algorithm to allocate a list of numbers to N groups under certain condition, which pointed to the 'Bin packing' problem, but I don't think this is it either.
By the way, sorry for the vague title. I can't think of a concise description, or even relevant search keywords!
New examples added 28 September 2010:
This is like the previous example, but all items updating at the same rate. Now two groups is better than three:
| 1 | | | | | 1 | 1 | | | | | 1 |
\-----------------------/\----------------------/
group A group B (optimal)
I've started trying to see how iterative improvements might be implemented. Suppose a grouping algorithm came up with:
| 1 | | | | | 1 | 1 | | | | | 1 | 1 | | | | | 1 |
\---/ \-------/ \-------/ \---/
group A group B group C group D (non-optimal)
\-----------------------/\----------------------/\----------------------/
group A group B group C (optimal)
This can be improved to three adjacent groups each of 6. Rex suggested (comment below) that I could try combining triplets into pairs. But in this case I would have to combine a quartet into a triplet, because there is no legal intermediate arrangement in which A+B+C (or B+C+D) can be rearranged into a pair leaving D as it is.
I originally thought that this was an indication that in the general case there is no guarantee that a new valid solution can be created from an existing valid solution by making a local modification. This would have meant that algorithms such as simulated annealing, genetic algorithms, etc, could be used to try to refine a suboptimal solution.
But Rex pointed out (comment below) that you can always split an existing group into two. Despite the fact that this always increases the cost function, all that means is that the solution needs to get out of its local minimum in order to reach the global minimum.

This problem has the same property of instability on addition of new items that similar NP-complete problems do, so I assume it is one also. Since I suspect that you want something that works reasonably well instead of a proof of why it's hard, I'll focus on an algorithm to give an approximate solution.
I would solve this problem by converting this into a graph where bins were valued at 1/N if they had to be read N times per second, and blur the graph with a width of M (e.g. 6), peaked at the original. (For 6, I might use weighting (1/6 1/5 1/4 1/3 1/2 1 1/2 1/3 1/4 1/5 1/6).) Then throw bins at all the local maxima (sort pairs by distance apart and cover close pairs of maxima first if you can). Now you'll have most of your most important values covered. Then catch any missing groups by extending the existing reads, or by adding new reads if necessary. Depending on the structure you may want to add some refinement by shifting locations between reads, but if you're lucky that won't even be necessary.
Since this is essentially a local algorithm, if you keep track of the blurred graph, you can fairly easily add new items and re-do the peak-covering locally (and the refinement locally).
Just to see how this would work on your data, the two-group case would look like (multiplying by 60 so I don't have to keep track of fractional weights)
60 30 20 15 12 10 00 00 00 <- contribution from left-most location
10 12 15 20 30 60 30 20 15 <- second
00 10 12 15 20 30 60 30 20 <- third
00 00 00 10 12 15 20 30 60 <- rightmost
--------------------------
70 42 47 50 74 B5 B0 80 95 (using "B" to represent 11)
^^ ^^ ^^ Local maxima
------------- -------
dist=6 dist=4
|===========| <- Hit closely-spaced peaks first
|==| <- Then remaining
So we're done, and the solution is optimal.
For the three group example, weighting "5" as "1/5" and multiplying everything by 300 so again there are no fractions,
060 030 020 015 012 010 000 000 000 000 000 000 <- from 5 on left
050 060 075 100 150 300 150 100 075 060 050 000 <- 1 on left
000 050 060 075 100 150 300 150 100 075 060 050 <- on right
000 000 000 000 000 000 010 012 015 020 030 060 <- 5 on right
-----------------------------------------------
110 140 155 190 262 460 460 262 190 155 140 110
|=======| <- only one peak, grab it
=== === <- missed some, so pick them back up

Related

Knapsack or similar with no values and with limits as to which items can be assigned where?

Say I have a number of weights which I need to spread out across a finite number of knapsacks so that each knapsack has as even a distribution of weights as possible. The catch is that different weights can only be put into the first bags, where each value of varies for each weight.
For example, a weight might only be able to inserted into bags up to bag 4, i.e. bags 1 through 4. Another might have a limit up to 5. The goal as previously stated is to attempt an even spread across all bags, with the number of bags set by the weight with the highest limit.
Is there a name for this problem, and what algorithms exist?
EDIT: To help visualise, say I have 4 weights:
+----------+--------+-----------+
| Weight # | Weight | Bag Limit |
+----------+--------+-----------+
| 1 | 2 | 2 |
| 2 | 3 | 3 |
| 3 | 1 | 1 |
| 4 | 2 | 4 |
+----------+--------+-----------+
A solution to the problem might look like this
| 1 | | | | | | |
| 2 | | 3 | | 2 | | |
|___| |___| |___| |___|
Bag 1 Bag 2 Bag 3 Bag 4
Weights 3 and 1 were placed into Bag 1
Weight 2 was placed into Bag 2
Weight 4 was placed into Bag 3
Here, the load is spread as evenly as possible, and the problem is solved (although perhaps not optimally, as I did this in my head)
Hopefully this might clear up what I'm trying to solve.
I'd describe this problem as bin packing with side constraints -- a lot of NP-hard problems don't have good names because there are so many of them. I would expect the LP-based methods for packing variable-sized bins that decompose the problem into (1) a packing problem over whole bins (2) a knapsack problem within a bin to generate candidate bins to carry over reasonably well.

Pseudocode: Recursively process start/stop times in list

Here's rather nebulous question.
I have a list of start/stop times from script executions, which may include nested script calls.
| script | start | stop | duration | time executing |
| ------ | ----- | ---- | -------------- | ----------------------------------- |
| A | 1 | 8 | 7 i.e. (8-1) | 3 i.e. ((8-1) - (6-2) - (5-4)) |
| ->B | 2 | 6 | 4 i.e. (6-2) | 3 i.e. ((6-2) - (5-4)) |
| ->->C | 4 | 5 | 1 i.e. (5-4) | 1 |
| D | 9 | 10 | 1 i.e. (10-9) | 1 |
| E | 11 | 12 | 1 i.e. (12-11) | 1 |
| F | 9 | 16 | 7 i.e. (16-9) | 5 i.e. ((16-9) - (14-13) - (16-15)) |
| ->G | 13 | 14 | 4 i.e. (14-13) | 1 i.e. (14-13) |
| ->H | 15 | 16 | 1 i.e. (15-14) | 1 i.e. (16-15) |
Duration is the total time spent in a script.
Time executing is the time spent in the script, but not in subscript.
So A calls B and B calls C. C takes 1 tick, B takes 4 but time executing is just 3, and A takes 7 ticks, but time executing is 3.
F calls G and then H, so takes 7 ticks but time executing is only 5.
What I'm trying to wrap my ('flu-ridden) head around is a pseudo-code algorithm for step-wise or recursing through the list of times in order to generate the time executing value for each row.
Any help for this problem (or cure for common cold) gratefully received. :-)
If all time points are distinct, then script execution timespans are related to each other by an ordered tree: Given any pair of script execution timespans, either one strictly contains the other, or they don't overlap at all. This enables an easy recovery of parent-child relationships, if you wanted to do that.
But if you just care about execution times, we don't even need that! :) There's a pretty simple algorithm that just sorts the starting and ending times and walks through the resulting array of "events", maintaining a stack of open "frames":
Create an array of (time, scriptID) pairs, and insert the start time and end time of each script into it (i.e., insert two pairs per script into the same array).
Sort the array by time.
Create a stack of integer triples, and push a single (0, 0, 0) entry on it. (This is just a dummy entry to simplify later code.) Also create an array seen[] with a boolean flag per script ID, all initially set to false.
Iterate through the sorted array of (time, scriptID) pairs:
Whenever you see a (time, scriptID) pair for a script ID that you have not seen before, that script is starting.
Set seen[scriptID] = true.
Push the triple (time, scriptID, 0) onto the stack. The final component, initially 0, will be used to accumulate the total duration spent in this script's "descendant" scripts.
Whenever you see a time for a script ID that you have seen before (because seen[scriptID] == true), that script is ending.
Pop the top (time, scriptID, descendantDuration) triple from the stack (note that the scriptID in this triple should match the scriptID in the pair at the current index of the array; if not, then somehow you have "intersecting" script timespans that could not correspond to any sequence of nested script runs).
The duration for this script ID is (as you already knew) time - startTime[scriptID].
Its execution time is duration - descendantDuration.
Record the time spent in this script and its descendants by adding its duration to the new top-of-stack's descendantDuration (i.e., third) field.
That's all! For n script executions this will take O(n log n) time, because the sorting step takes that long (iterating over the array and performing the stack operations take just O(n)). Space usage is O(n).

What algorithm to use to format an EDIFACT file?

I work with EDIFACT messages and have developed lots of tools to help me parse and extract the relevant information out of the raw file format.
Something I have always struggled with is presenting the raw EDIFACT. I typically just copy the message into Microsoft Word, do a find and replace for the segment separator and view the contents line by line.
I have always wanted to display the EDIFACT file in its hierarchy format but can not for the life of me work out a method to do this.
Below is a small extract of a raw EDIFACT message.
The left side shows how I get the data (not including line numbers), the right side shows how I want it to be displayed based on a customers specification.
01. UNA -UNA
02. UNB -UNB
03. UNH -UNH
04. BGM -BGM
05. DTM - | DTM
06. DTM - | DTM
07. DTM - | DTM
08. NAD - | NAD
09. NAD - | NAD
10. NAD - | NAD
11. GIS - | GIS
12. LIN - | | LIN
13. LOC - | | | LOC
14. LOC - | | | LOC
15. LOC - | | | LOC
16. RFF - | | | RFF
17. QTY - | | | QTY
18. QTY - | | | QTY
19. RFF - | | | | RFF
20. DTM - | | | | | DTM
21. SCC - | | | SCC
22. QTY - | | | | QTY
23. DTM - | | | | | DTM
24. DTM - | | | | | DTM
25. SCC - | | | SCC
26. QTY - | | | | QTY
27. DTM - | | | | | DTM
28. DTM - | | | | | DTM
29. SCC - | | | SCC
30. QTY - | | | | QTY
31. DTM - | | | | | DTM
32. QTY - | | | | QTY
33. DTM - | | | | | DTM
34. SCC - | | | SCC
35. QTY - | | | | QTY
36. DTM - | | | | | DTM
37. NAD - | | | NAD
38. CTA - | | | | CTA
39. COM - | | | | | COM
40. SCC - | | | | SCC
41. QTY - | | | | | QTY
42. UNT -UNT
43. UNZ -UNZ
You can see that the data is tree based, and it is described by a specification that is sent to me. One specification for the above EDIFACT message is as follow:
Tag St Max Lvl
0000 1 UNA C 1 0 SERVICE STRING ADVICE
0000 2 UNB M 1 0 INTERCHANGE HEADER
0010 3 UNH M 1 0 MESSAGE HEADER
0020 4 BGM M 1 0 BEGINNING OF MESSAGE
0030 5 DTM M 10 1 DATE/TIME/PERIOD
0040 6 FTX C 5 1 FREE TEXT
0080 SG2 C 99 1 NAD
0090 7 NAD M 1 1 NAME AND ADDRESS
0190 SG6 C 9999 1 GIS-SG7-SG12
0200 8 GIS M 1 1 GENERAL INDICATOR
0210 SG7 C 1 2 NAD
0220 9 NAD M 1 2 NAME AND ADDRESS
0370 SG12 C 9999 2 LIN-LOC-FTX-SG13-SG15-SG17-SG22
0380 10 LIN M 1 2 LINE ITEM
0450 11 LOC C 999 3 PLACE/LOCATION IDENTIFICATION
0470 12 FTX C 5 3 FREE TEXT
0480 SG13 C 10 3 RFF
0490 13 RFF M 1 3 REFERENCE
0540 SG15 C 10 3 QTY-SG16
0550 14 QTY M 1 3 QUANTITY
0570 SG16 C 10 4 RFF-DTM
0580 15 RFF M 1 4 REFERENCE
0590 16 DTM C 1 5 DATE/TIME/PERIOD
0600 SG17 C 999 3 SCC-SG18
0610 17 SCC M 1 3 SCHEDULING CONDITIONS
0620 SG18 C 999 4 QTY-DTM
0630 18 QTY M 1 4 QUANTITY
0640 19 DTM C 2 5 DATE/TIME/PERIOD
0760 SG22 C 999 3 NAD-SG24-SG27
0770 20 NAD M 1 3 NAME AND ADDRESS
0830 SG24 C 5 4 CTA-COM
0840 21 CTA M 1 4 CONTACT INFORMATION
0850 22 COM C 5 5 COMMUNICATION CONTACT
0920 SG27 M 999 4 SCC-SG28
0940 SG28 M 999 5 QTY
0950 24 QTY M 1 5 QUANTITY
1030 25 UNT M 1 0 MESSAGE TRAILER
0000 26 UNZ M 1 0 INTERCHANGE TRAILER
The important columns are Tag, St (M=Mandatory, C=Conditional), Max (Maximum times it can repeat), lvl (How deep in the tree it is).
The Tags that start with SG denote that there is a loop
The problem I face is that the format is very flexible, where it can have conditional segments, conditional loops, repeated segments. Trying to think of a method that can handle all this has been my issue.
Starting from the top in the above specification, you can immeditely see that when you come to the DTM tag, it can be repeated upto a max of 10 times. In the sample EDIFACT message, it only appears 3 times on lines 5, 6, 7. Following the specification, FTX may appear but does not in my sample message, then there is a SG2 tag, which means the following NAD tag may repeat 99 times.
Moving slightly ahead inside the LIN tag (which is under the SG12 group, which can repeat upto 9999 times and in many cases does repeat a number of times), it comes to the first QTY tag.
According to the specification, this segment can have conditional group (SG15) RFF and a DTM under it. Using my sample, you can see on line 17 and 18 that it has the QTY segment but line 18, has this conditional group too.
Similiar things start happening when you look into the SCC segments too.
What I have in my mind, is to be able to enter that specification into some sort of file format, then run the raw EDIFACT message against the rules of this specification so the output is hierarchy based so it's easy to see at a glance what data relates to what segment and a way to check to see if the EDIFACT message is valid.
What I have trouble with, is the actual algorithm or process to do that conversion.
I have tried naive approaches, like going line by line but then it gets messy when I am trying to work out if the current line is in a group, or a repeat or something else.
I have tried a recursive approach, by splitting the entire EDIFACT by the largest group (The SG12-LIN group), then recursively process each of them splits and build an output. This has been my best approach yet but it's still far from working with many false readings due to my logic not being right.
I basically need to be able to pick a segment of the message, and determine where in the hierarchy it should be and display it.
I am at a loss on how I can solve this. I am sure there is a nice simple method at doing this but I just cannot work it out.
Any assistance would be most grateful.
Slight update.
I have converted the specification into a XML file following the hierarchy of said specification. This XML file now contains all the groups and various attributes related to each tag. Now I have a start on what the EDIFACT needs to conform too.
If I go through it on paper (and in head) and I can build the output that I am trying to do with a bit of forward thinking so my new idea is to "Scan ahead" in the EDIFACT file, and build a probably based result.
Bit like how a chess AI looks ahead a few moves.
most of the thing you want I can help you with (and did them). But this is not easy done on a small piece of paper with no interaction.
So if you want more information, just contact me. (no, this is not a commercial thing)

turned on bits counter

Suppose I have a black box with 3 inputs (each input is 1 bit) and 2 bits output.
The black box counts the amount of turned on input bits.
Using only such black boxes,one needs to implement the counter of turned on bits in the input,which has 7 bits.The implementation should use the minimum possible amount of black boxes.
//This is a job interview question
You're making a binary adder. Try this...
Two black boxes for input with one input remaining:
7 6 5 4 3 2 1
| | | | | | |
------- ------- |
| | | | |
| H L | | H L | |
------- ------- |
| | | | |
Take the two low outputs and the remaining input (1) and feed them to another black box:
L L 1
| | |
-------
| |
| C L |
-------
| |
The low output from this black box will be the low bit of the result. The high output is the carry bit. Feed this carry bit along with the high bits from the first two black boxes into the fourth black box:
H H C L
| | | |
------- |
| | |
| H M | |
------- |
| | |
The result should be the number of "on" bits in the input expressed in binary by the High, Middle and Low bits.
Suppose that each BB outputs a 2-bit binary count 00, 01, 10, or 11, when 0, 1, 2, or 3 of its inputs are on. Also suppose that the desired ultimate output O₄O₂O₁ is a 3-bit binary count 000 ... 111, when 0, 1, ... 7 of the 7 input bits i₁...i₇ are on. For problems like this in general, you can write a boolean expression for what the BB does and a boolean expression for the desired output and then synthesize the output. In this particular case, however, try the obvious approach of putting i₁, i₂, i₃ into a first box B₁, and i₄, i₅, i₆ into a second box B₂, and i₇ into one input of a third box B₃. Looking at this it's clear that if you run the units outputs from B₁ and B₂ into the other two inputs of B₃ then the units output from B₃ is equal to the desired value O₁. You can get the sum of the twos outputs from B₁, B₂, B₃ via a box B₄, and this sum is equal to the desired values O₄O₂.

Ways to speed up my bash script?

Yet, i know its a lot faster than doing things by hand. But is there anyway to maybe speed up this script? Multi-thread or something? I'm new to unix and this is my first script =). Open for suggestions or any changes made. Script seems to pause a lot on a certain generated domain randomly.
#!/bin/bash
for domain in $(pwgen -1A0B 2 10000);
do
whois $domain.com | egrep -q '^No match|^NOT FOUND|^Not fo|AVAILABLE|^No Data Fou|has not been regi|No entri'
if [ $? -eq 0 ]; then
echo "$domain.com : available"
else
echo "$domain.com"
fi
done
Before splitting and distribution,
WARNING This seem not to be useful: Asking pwgen to build 10'000 lines formed by two characters between a and z... Also there is only echo $((26*26)) -> 676 possibilities (in fact, as pwgen try to build speakable words, there is only 625 possibilities).
pwgen -1A0B 2 10000 | sort | uniq -c | sort -n | tail
27 ju
27 mu
27 vs
27 xt
27 zx
28 df
28 sy
28 zc
29 dp
29 zd
So with this command, you will do upto 29 times same thing.
Trying 10x to run pwgen -1A0B 2 10000 for printing how much different combinaison is proposed and which combinaison was proposed more time and less time:
for ((i=10;i--;)); do
echo $(
(
(
pwgen -1A0B 2 10000 |
sort |
uniq -c |
sort -n |
tee /dev/fd/6 |
wc -l >/dev/fd/7
) 6>&1 | (
head -n1
tail -n1
)
) 7>&1
)
done
6 bd 625 31 bn
3 bj 625 29 sq
6 je 625 30 ey
4 ac 625 30 sz
5 ds 625 29 wf
4 xw 625 28 qb
4 jj 625 30 pa
6 io 625 29 sg
4 vw 625 30 kb
5 fz 625 31 os
this print:
| | | | |
| | | | \- max proposed pattern
| | | \---- number of times max proposed pattern was issued
| | \-------- number of different differents purposes
| \----------- min proposed pattern
\-------------- number of times min proposed pattern was issued
Create a file with desired domain names first. Call this domains.lst:
pwgen -1A0B 2 10000 > domains.lst
Then create smaller files out of this:
split --lines=100 domains.lst domains.lst.
Then create a script which gets a file-name and processes that file using whois. Also creates an output file input.out.
Create another script that uses & to start the above script in the background for all small chunks. Merge the outputs after all background tasks finish.

Resources