Generating random words in MIPS - smips

I want to generate random words in MIPS. I know how to generate random numbers, I just want a random word from a word bank. I have tried this but I have no idea how to print them.
.data
### WORD BANK ###
a0: .asciiz "computer"
b1: .asciiz "processor"
c2: .asciiz "motherboard"
d3: .asciiz "graphics"
e4: .asciiz "network"
f5: .asciiz "ethernet"
g6: .asciiz "memory"
h7: .asciiz "microsoft"
i8: .asciiz "linux"
j9: .asciiz "transistor"
k10: .asciiz "antidisestablishmentarianism"
l11: .asciiz "protocol"
m12: .asciiz "instruction"
word: .word a0,b1,c2,d3,e4,f5,g6,h7,i8,j9,k10,l11,m12
.text
la $To,word
How may I choose a random word from a given list?

If you have a randomly generated number n in the range of int, find the index of your number using the remainder of n divided by word bank size (in this case, 13). If you have an upper bound for the RNG set it to the word bank size. Then just load the string using the index from memory.

You could create an Array, then with a forEach Loop with inside an object that generates a random numbers, you display the word of the array that corresponds to random generated number.
THIS IS JAVA CODE, BUT I HOPE IT MIGHT HELP
import java.util.Random;
//main class
public class Test1
{
public static void main(String[] args)
{
//Array of names
String[] wordBank = {"luca", "serena", "giuseppe", "nicole", "eleonora", "elena", "matteo"};
//random generation of names
for(int i=1; i<10; ++i)
{
Random dice = new Random();
int dice2 = dice.nextInt(6);
System.out.println(wordBank[dice2]);
}
}
}

Related

Ignoring non integers in an unknown length of data input

i am new to c language and seeking help in understanding my mistake.
I want to write a program that counts the number of 2 digit numbers in a row of integers and chars, for example " 21c sdhhj 32 fhddhf234 45" here are 3 two digit numbers. I set terminations to my loop (failed scanf %d or EOF) and still get an infinite loop. I understand thet failed scanf of integers should return 0 or -1 at EOF so why i get infinite loop? Thank you in advance! :)
void read(int blue[],int red[],int couple[])
{
int vote=0,rcount=0,bcount=0;
int ok=-2;
while (ok!=EOF)
{
ok=scanf("%d",&vote);
if (ok==0)
continue;
if (vote<TOTAL&&vote>0)
{
rcount=vote%10;
bcount=vote/10;
if (rcount==bcount)
continue;
couple[vote]++;
red[rcount]++;
blue[bcount]++;
}
ok=0;
}
i want to scan and store them as long as they are smaller then TOTAL (99) until the input is over.

Differentiation between integer and character

I have just started learning c++ and have come across various data types in c++. I also learnt how the computer stores values when the data type is specified . One doubt that occurred to me while learning char data types was how did the computer differentiate between integers and characters.
I learnt that the character data type uses 8 bits to store a character and the computer can store a character in its memory location by following ASCII encoding rules. However, I didn't realise how the computer knows whether the byte 00100001 represents the latter 'a' or the integer 65. Is there any special bit assigned for this purpose?
when we do
int a = 65
or
char ch = 'a'
If we check the memory address we will see the value 00100001 as expected.
In application layer we choose to cast as character or integer
prinf("%d", ch)
will print 65
Characters are represented as integers inside the computer. Hence the data type "char" is simply a subset of the data type "int".
Refer to following page: will clear all the ambiguities in your mind.
Data Types Detail
The computer itself does not remember or set any bits to distinguish chars from ints. Instead it's the compiler which maintains that information and generates proper machine code which operates on data appropriately.
You can even override and 'mislead' the compiler if you want. For example you can cast a char pointer to a void pointer and then to an int pointer and then try to read the location referred to as an int. I think 'dynamic casts' are also possible. If there was an actual bit used then such operations would not be possible.
Adding more details in response to comment:
Hi, really what you should ask is that who will retrieve the values? Imagine that you write the contents of memory to file and send them over the Internet. If the receiver "knows" that its receiving chars then there is no need to encode the identity of chars. But if the receiver could receive either chars or ints then it would need identifying bits. In the same way, when you compile a program and the compiler knows what's stored where, there is no need to 'figure out' anything since you already know it. Now how a char is encoded as bits vs a float vs an int is decided by a standard like IEEE standard
You have asked a simple yet profound question. :-)
Answers and an example or two are below.
(see edit2, at bottom, for a longer example that tries to illustrate what happens when you interpret a single memory location's bit patterns in different ways).
The "profound" aspect of it lies in the astounding variety of character encodings that exist. There are many - I wager more than you believe there could possibly be. :-)
This is a worthwhile read: http://www.joelonsoftware.com/articles/Unicode.html
full title: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
As for your first question: "how did the computer differentiate between integers and characters":
The computer doesn't (for better or worse).
The meaning of bit patterns is interpreted by whatever reads them.
Consider this example bit pattern (8 bits, or one byte):
01000001b = 41x = 65d (binary, hex & decimal respectively).
If that bit pattern is based on ASCII it will represent an uppercase A.
If that bit pattern is EBCDIC it will represent an "non-breaking space" character (at least according to the EBCDIC chart at wikipedia, most of the others I looked at don't say what 65d means in EBCDIC).
(Just for trivia's sake, in EBCDIC, 'A' would be represented with a different bit pattern entirely: C1x or 193d.)
If you read that bit pattern is an integer (perhaps a short), it may indicate you have 65 dollars in a bank account (or euros, or something else - just like the character set your bit pattern won't have anything in it to tell you what currency it is.
If that bit pattern is part of a 24-bit pixel encoding for your display (3 bytes for RBG), perhps 'blue' in RBG encoding, it may indicate your pixel is roughly 25% blue (e.g. 65/255 is about 25.4%); 0% would be black, 100% would be as blue as possible.
So, yeah, there are lots of variations on how bits can be interpreted. It is up to your program to keep track of that.
edit: it is common to add metadata to track that, so if you are dealing with currencies you may have one byte for currency type and other bytes for the quantity of a given currency. Currency type would have to be encoded as well; there are different ways to do that... something that "C++ enum" attempts to solve in a space-efficient way: http://www.cprogramming.com/tutorial/enum.html ).
As for 8 bits (one byte) per character, that is an Fair Assumption when you're starting out. But it isn't always true. Lots of languages will use 2+ bytes for each character when you get into Unicode.
However... ASCII is very common and it fits into a single byte (8 bits).
If you are handling simple english text (A-Z, 0-9 and so on), that my be enough for you.
Spend some time browsing here and look at acsii, ebcdic and others:
http://www.lookuptables.com/
If you're running on linux or smth, hexdump can be your friend.
Try the following
$ hexdump -C myfile.dat
Whatever operating system you're using, you will want to find a hexdump utility you can use to see what is really in your data files.
You mentioned C++, I think it would would be an interesting exercise to write a "thing" byte-dumper utility, just a short program that takes a void* pointer and the number of bytes it has and then prints out that many bytes worth of values.
Good luck with your studies! :-)
Edit 2: I added a small research program... I don't know how to illustrate the idea more concisely (seems easer in C than C++).
Anyway...
In this example program, I have two character pointers that are referencing memory used by an integer.
The actual code (see 'example program', way below) is messier with casting, but this illustrates the basic idea:
unsigned short a; // reserve 2 bytes of memory to store our 'unsigned short' integer.
char *c1 = &a; // point to first byte at a's memory location.
char *c2 = c1 + 1; // point to next byte at a's memory location.
Note how 'c1' and 'c2' both share the memory that is also used by 'a'.
Walking through the output...
The sizeof's basically tells you how many bytes something uses.
The ===== Message Here ===== lines are like a comment printed out by the dump() function.
The important thing about the dump() function is that it is using the bit patterns in the memory location for 'a'.
dump() doesn't change those bit patterns, it just retrieves them and displays them via cout.
In the first run, before calling dump I assign the following bit pattern to a:
a = (0x41<<8) + 0x42;
This left-shifts 0x41 8 bits and adds 0x42 to it.
The resulting bit pattern is = 0x4142 (which is 16706 decimal, or 100001 100010 binary).
One of the bytes will be 0x41, the other will hold 0x42.
Next it calls the dump() method:
dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
Note the output for this run on my virtual box Ubuntu found the address of a was 0x6021b8.
Which nicely matches the expected addresses pointed to by both c1 & c2.
Then I modify the bit pattern in 'a'...
a += 1; dump(); // why did this find a 'C' instead of 'B'?
a += 5; dump(); // why did this find an 'H' instead of 'C' ?
As you dig deeper into C++ (and maybe C ) you will want to be able to draw memory maps like this (more or less):
=== begin memory map ===
+-------+-------+
unsigned short a : byte0 : byte1 : holds 2 bytes worth of bit patterns.
+-------+-------+-------+-------+
char * c1 : byte0 : byte1 : byte3 : byte4 : holds address of a
+-------+-------+-------+-------+
char * c2 : byte0 : byte1 : byte3 : byte4 : holds address of a + 1
+-------+-------+-------+-------+
=== end memory map ===
Here is what it looks like when it runs; I encourage you to walk through the C++ code
in one window and tie each piece of output back to the C++ expression that generated it.
Note how sometimes we do simple math to add a number to a (e.g. "a +=1" followed by "a += 5").
Note the impact that has on the characters that dump() extracts from memory location 'a'.
=== begin run ===
$ clear; g++ memfun.cpp
$ ./a.out
sizeof char =1, unsigned char =1
sizeof short=2, unsigned short=2
sizeof int =4, unsigned int =4
sizeof long =8, unsigned long =8
===== In ASCII, 0x41 is 'A' and 0x42 is 'B' =====
a=16706(dec), 0x4142 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=B
c2=A
in hex, c1=42
in hex, c2=41
===== after a+= 1 =====
a=16707(dec), 0x4143 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=C
c2=A
in hex, c1=43
in hex, c2=41
===== after a+= 5 =====
a=16712(dec), 0x4148 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=H
c2=A
in hex, c1=48
in hex, c2=41
===== In ASCII, 0x58 is 'X' and 0x42 is 'Y' =====
a=22617(dec), 0x5859 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Y
c2=X
in hex, c1=59
in hex, c2=58
===== In ASCII, 0x59 is 'Y' and 0x5A is 'Z' =====
a=22874(dec), 0x595a (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Z
c2=Y
in hex, c1=5a
in hex, c2=59
Done.
$
=== end run ===
=== begin example program ===
#include <iostream>
#include <string>
using namespace std;
// define some global variables
unsigned short a; // declare 2 bytes in memory, as per sizeof()s below.
char *c1 = (char *)&a; // point c1 to start of memory belonging to a (1st byte).
char * c2 = c1 + 1; // point c2 to next piece of memory belonging to a (2nd byte).
void dump(const char *msg) {
// so the important thing about dump() is that
// we are working with bit patterns in memory we
// do not own, and it is memory we did not set (at least
// not here in dump(), the caller is manipulating the bit
// patterns for the 2 bytes in location 'a').
cout << "===== " << msg << " =====\n";
cout << "a=" << dec << a << "(dec), 0x" << hex << a << dec << " (address of a: " << &a << ")\n";
cout << "c1=" << (void *)c1 << " (should be the same as 'address of a')\n";
cout << "c2=" << (void *)c2 << " (should be just 1 more than 'address of a')\n";
cout << "c1=" << (char)(*c1) << "\n";
cout << "c2=" << (char)(*c2) << "\n";
cout << "in hex, c1=" << hex << ((int)(*c1)) << dec << "\n";
cout << "in hex, c2=" << hex << (int)(*c2) << dec << "\n";
}
int main() {
cout << "sizeof char =" << sizeof( char ) << ", unsigned char =" << sizeof( unsigned char ) << "\n";
cout << "sizeof short=" << sizeof( short ) << ", unsigned short=" << sizeof( unsigned short ) << "\n";
cout << "sizeof int =" << sizeof( int ) << ", unsigned int =" << sizeof( unsigned int ) << "\n";
cout << "sizeof long =" << sizeof( long ) << ", unsigned long =" << sizeof( unsigned long ) << "\n";
// this logic changes the bit pattern in a then calls dump() to interpret that bit pattern.
a = (0x41<<8) + 0x42; dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
a+= 1; dump( "after a+= 1" );
a+= 5; dump( "after a+= 5" );
a = (0x58<<8) + 0x59; dump( "In ASCII, 0x58 is 'X' and 0x42 is 'Y'" );
a = (0x59<<8) + 0x5A; dump( "In ASCII, 0x59 is 'Y' and 0x5A is 'Z'" );
cout << "Done.\n";
}
=== end example program ===
int is an integer, a number that has no digits after the decimal point. It can be positive or negative. Internally, integers are stored as binary numbers. On most computers, integers are 32-bit binary numbers, but this size can vary from one computer to another. When calculations are done with integers, anything after the decimal point is lost. So if you divided 2 by 3, the result is 0, not 0.6666.
char is a data type that is intended for holding characters, as in alphanumeric strings. This data type can be positive or negative, even though most character data for which it is used is unsigned. The typical size of char is one byte (eight bits), but this varies from one machine to another. The plot thickens considerably on machines that support wide characters (e.g., Unicode) or multiple-byte encoding schemes for strings. But in general char is one byte.

How do I get random numbers in Elm 0.13 without a signal?

I'm making a game where I need to draw random lines on the screen. Now it seems like Random needs a signal to work in 0.13 (and we are forced to work in 0.13). So how do I obtain those random number?
I started from the game skeleton provided at the elm-lang website and got to this:
type UserInput = { space : Bool, keys : [KeyCode] }
type Input = { timeDelta : Float, userInput : UserInput }
userInput : Signal UserInput
userInput = lift2 UserInput Keyboard.space Keyboard.keysDown
framesPerSecond = 30
delta : Signal Float
delta = lift (\t -> t / framesPerSecond) (Time.fps framesPerSecond)
input : Signal Input
input = Signal.sampleOn delta (Signal.lift2 Input delta userInput)
gameState : Signal GameState
gameState = Signal.foldp stepGame defaultGame input
stepGame : Input -> GameState -> GameState
stepGame i g =
if g.state == Start then *Get random floats*
Now in stepGame, I want to draw random lines. The problem is that I can only get random floats by providing a signal in 0.13. I have the Input signal close by the step function, but when I change the header to lets say
stepGame : Signal Input -> GameState -> GameState it doesn't compile. So how do I get a signal in that function to get some random numbers... I can't seem to find the solution, it's driving me crazy.
There are two ways to do this. It really depends on whether the amount of random numbers you need is static or not.
Static number of random numbers
Extend your input with random numbers from Random.floatList:
type Input = { timeDelta : Float, userInput : UserInput, randoms : [Float] }
staticNoOfFloats = 42
input : Signal Input
input = Signal.sampleOn delta (Signal.lift3 Input delta userInput (Random.floatList (always staticNoOfFloats <~ delta)))
Dynamic number of random numbers
Use a community library (also outlined in this SO answer) called generator. You can use a random seed by using Random.range in much the same way as outlined above. The library is a pure pseudo-random number generator based on generating a random number and a new Generator that will produce the next random number.
Why not use Random.floatList in the dynamic case?
Usually if you need a dynamic number of random numbers that number is dependent on the current state of the program. Because that state is captured inside the foldp, where you're also doing your updating based on those random numbers, this makes it impossible to use a "signal function", i.e. something of type Signal a -> Signal b.

Is it possible to convert any base to any base (range 2 to 46)

I know it is simple and possible to convert any base to any base. First, convert any base to decimal and then decimal to any other base. However, I had done this before for range 2 to 36 but never done for 2 to 46.
I don't understand what I will put after 36, because 36 means 'z' (1-10 are decimal numbers then the 26 characters of the alphabet).
Please explains what happens after 36.
Every base has a purpose. Usually we do base conversion to make complex computations simpler.
Here are some most popular bases used and their representation.
2-binary numeral system
used internally by nearly all computers, is base two. The two digits are 0 and 1, expressed from switches displaying OFF and ON respectively.
8-octal system
is occasionally used in computing. The eight digits are 0–7.
10-decimal system
the most used system of numbers in the world, is used in arithmetic. Its ten digits are 0–9.
12-duodecimal (dozenal) system
is often used due to divisibility by 2, 3, 4 and 6. It was traditionally used as part of quantities expressed in dozens and grosses.
16-hexadecimal system
is often used in computing. The sixteen digits are 0–9 followed by A–F.
60-sexagesimal system
originated in ancient Sumeria and passed to the Babylonians. It is still used as the basis of our modern circular coordinate system (degrees, minutes, and seconds) and time measuring (minutes and hours).
64-Base 64
is also occasionally used in computing, using as digits A–Z, a–z, 0–9, plus two more characters, often + and /.
256-bytes
is used internally by computers, actually grouping eight binary digits together. For reading by humans, bytes are usually shown in hexadecimal.
The octal, hexadecimal and base-64 systems are often used in computing because of their ease as shorthand for binary. For example, every hexadecimal digit has an equivalent 4 digit binary number.
Radices are usually natural numbers. However, other positional systems are possible, e.g. golden ratio base (whose radix is a non-integer algebraic number), and negative base (whose radix is negative).
Your doubt is whether we can convert any base to any other base after base exceeds 36
( # of Alphabets + # of digits = 26+ 10= 36)
Taking example of 64-Base
It uses A–Z(Upper case)(26), a–z(lower case)(26), 0–9(10), plus 2 more characters. This way the constraint of 36 is resolved.
As we have (26+26+10+2)64 symbols in 64-base for representation, we can represent any number in 64 base. Similarly for more base they use different symbols for representation.
Source: http://en.wikipedia.org/wiki/Radix
The symbols you use for digits are arbitrary. For example base64 encoding uses 'A' to represent the zero valued digit and '0' represents the digit with the value 52. In base64 the digits go through the alphabet A-Z, then the lower case alphabet a-z, then the traditional digits 0-9, and then usually '+' and '/'.
One base 60 system used these symbols:
So the symbols used are arbitrary. There's nothing that 'happens' after 36 except what you say happens for your system.
With number systems, you are allowed to play god.
Playing god
What you need to understand is, that symbols are completely arbitrary. There is no god-given rule for "what comes after 36". You are free to define whatever you like.
To encode numbers with a certain base, all you need is the following:
base-many distinct symbols
a total order on the symbols
An arbitrary example
Naturally, there's an infinite amount of possibilities to create such a symbol table for a certain base:
Θ
ェ
す
)
0
・
_
o
や
ι
You could use this, to encode numbers with base 10. Θ being the zero-element, ェ being the one, etc.
Conventions
Of course, your peers would not be too happy if you started using the above symbol table. Because the symbols are arbitrary, we need conventions. 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 is a convention, as are the symbols we use for hexadecimal, binary, etc. It is generally agreed upon what symbol table we use for what basis, that is why we can read the numbers someone else writes down.
The important thing to remember is that all numbers are symbolic of a value. Thus if you wanted to do that, you could just make a list containing the values at each position. After base 36, you simply run out of characters you can make a logical sequence out of. For example, if you used the Cambodian Alphabet with 70 odd characters, you could do base 80.
Here is the complete code I have written, hope this will help.
import java.util.Scanner;
/*
* author : roottraveller, nov 4th 2017
*/
public class BaseXtoBaseYConversion {
BaseXtoBaseYConversion() {
}
public static String convertBaseXtoBaseY(String inputNumber, final int inputBase, final int outputBase) {
int decimal = baseXToDecimal(inputNumber, inputBase);
return decimalToBaseY(decimal, outputBase);
}
private static int baseXNumeric(char input) {
if (input >= '0' && input <= '9') {
return Integer.parseInt(input + "");
} else if (input >= 'a' && input <= 'z') {
return (input - 'a') + 10;
} else if (input >= 'A' && input <= 'Z') {
return (input - 'A') + 10;
} else {
return Integer.MIN_VALUE;
}
}
public static int baseXToDecimal(String input, final int base) {
if(input.length() <= 0) {
return Integer.MIN_VALUE;
}
int decimalValue = 0;
int placeValue = 0;
for (int index = input.length() - 1; index >= 0; index--) {
decimalValue += baseXNumeric(input.charAt(index)) * (Math.pow(base, placeValue));
placeValue++;
}
return decimalValue;
}
private static char baseYCharacter(int input) {
if (input >= 0 && input <= 9) {
String str = String.valueOf(input);
return str.charAt(0);
} else {
return (char) ('a' + (input - 10));
//return ('A' + (input - 10));
}
}
public static String decimalToBaseY(int input, int base) {
String result = "";
while (input > 0) {
int remainder = input % base;
input = input / base;
result = baseYCharacter(remainder) + result; // Important, Notice the reverse order here
}
return result;
}
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.println("Enter : number baseX baseY");
while(true) {
String inputNumber = scanner.next();
int inputBase = scanner.nextInt();
int outputBase = scanner.nextInt();
String outputNumber = convertBaseXtoBaseY(inputNumber, inputBase, outputBase);
System.out.println("Result = " + outputNumber);
}
}
}

Finding a pattern in a set

What algorithms could i use to determine common characters in a set of strings?
To make the example simple, I only care about 2+ characters in a row and if it shows up in 2 or more of the sample. For instance:
0000abcde0000
0000abcd00000
000abc0000000
00abc000de000
I'd like to know:
00 was used in 1,2,3,4
000 was used in 1,2,3,4
0000 was used in 1,2,3
00000 was used in 2,3
ab was used in 1,2,3,4
abc was used in 1,2,3,4
abcd was used in 1,2
bc was used in 1,2,3,4
bcd was used in 1,2
cd was used in 1,2
de was used in 1,4
I'm assuming that this is not homework. (If it is, you're one your own re plagiarism! ;-)
Below is a quick-and-dirty solution. The time complexity is O(m**2 * n) where m is the average string length and n is the size of the array of strings.
An instance of Occurrence keeps the set of indices which contain a given string. The commonOccurrences routine scans a string array, calling captureOccurrences for each non-null string. The captureOccurrences routine puts the current index into an Occurrence for each possible substring of the string it is given. Finally, commonOccurrences forms the result set by picking only those Occurrences that have at least two indices.
Note that your example data has many more common substrings than you identified in the question. For example, "00ab" occurs in each of the input strings. An additional filter to select interesting strings based on content (e.g. all digits, all alphabetic, etc.) is -- as they say -- left as an exercise for the reader. ;-)
QUICK AND DIRTY JAVA SOURCE:
package com.stackoverflow.answers;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
public class CommonSubstringFinder {
public static final int MINIMUM_SUBSTRING_LENGTH = 2;
public static class Occurrence implements Comparable<Occurrence> {
private final String value;
private final Set<Integer> indices;
public Occurrence(String value) {
this.value = value == null ? "" : value;
indices = new TreeSet<Integer>();
}
public String getValue() {
return value;
}
public Set<Integer> getIndices() {
return Collections.unmodifiableSet(indices);
}
public void occur(int index) {
indices.add(index);
}
public String toString() {
StringBuilder result = new StringBuilder();
result.append('"').append(value).append('"');
String separator = ": ";
for (Integer i : indices) {
result.append(separator).append(i);
separator = ",";
}
return result.toString();
}
public int compareTo(Occurrence that) {
return this.value.compareTo(that.value);
}
}
public static Set<Occurrence> commonOccurrences(String[] strings) {
Map<String,Occurrence> work = new HashMap<String,Occurrence>();
if (strings != null) {
int index = 0;
for (String string : strings) {
if (string != null) {
captureOccurrences(index, work, string);
}
++index;
}
}
Set<Occurrence> result = new TreeSet<Occurrence>();
for (Occurrence occurrence : work.values()) {
if (occurrence.indices.size() > 1) {
result.add(occurrence);
}
}
return result;
}
private static void captureOccurrences(int index, Map<String,Occurrence> work, String string) {
final int maxLength = string.length();
for (int i = 0; i < maxLength; ++i) {
for (int j = i + MINIMUM_SUBSTRING_LENGTH; j < maxLength; ++j) {
String partial = string.substring(i, j);
Occurrence current = work.get(partial);
if (current == null) {
current = new Occurrence(partial);
work.put(partial, current);
}
current.occur(index);
}
}
}
private static final String[] TEST_DATA = {
"0000abcde0000",
"0000abcd00000",
"000abc0000000",
"00abc000de000",
};
public static void main(String[] args) {
Set<Occurrence> found = commonOccurrences(TEST_DATA);
for (Occurrence occurrence : found) {
System.out.println(occurrence);
}
}
}
SAMPLE OUTPUT: (note that there was actually only one Occurrence per line; I can't seem to prevent the blockquote markup from merging lines)
"00": 0,1,2,3
"000": 0,1,2,3
"0000": 0,1,2
"0000a": 0,1
"0000ab": 0,1
"0000abc": 0,1
"0000abcd": 0,1
"000a": 0,1,2
"000ab": 0,1,2
"000abc": 0,1,2
"000abcd": 0,1
"00a": 0,1,2,3
"00ab": 0,1,2,3
"00abc": 0,1,2,3
"00abc0": 2,3
"00abc00": 2,3
"00abc000": 2,3
"00abcd": 0,1
"0a": 0,1,2,3
"0ab": 0,1,2,3
"0abc": 0,1,2,3
"0abc0": 2,3
"0abc00": 2,3
"0abc000": 2,3
"0abcd": 0,1
"ab": 0,1,2,3
"abc": 0,1,2,3
"abc0": 2,3
"abc00": 2,3
"abc000": 2,3
"abcd": 0,1
"bc": 0,1,2,3
"bc0": 2,3
"bc00": 2,3
"bc000": 2,3
"bcd": 0,1
"c0": 2,3
"c00": 2,3
"c000": 2,3
"cd": 0,1
"de": 0,3
"de0": 0,3
"de00": 0,3
"e0": 0,3
"e00": 0,3
This is most probably an NP-hard problem. It looks similar to multiple sequence alignment, which is. Basically, you could adapt multidimensional Smith-Waterman (= local sequence alignment) for your needs. There might be a more efficient algorithm, though.
Build a tree where the path through the tree is the letter sequence. Have each node contain a "set" that the string references are added to in passing (or just keep a count). Then keep track of N locations in the word where N is the longest sequence you care about (e.g., start a new handle at each char walking all handles down at each step and abort each handle after N steps)
This would work better with a small, finite and dense alphabet (DNA was the first place I thought to use it).
Edit: If you known in advance the pattern you care about, the above can be altered to work by building the tree ahead of time and then only checking to see if you are on the tree rather than extending it.
an example
input
abc
abd
abde
acc
bde
the tree
a : 4
b : 3
c : 1
d : 2
e : 1
c : 1
c : 1
b : 4
d : 3
e : 2
c : 1
c : 3
c : 1
d : 3
e : 2
Look up "Suffix Trees" on the web. Or pick up "Algorithms on Strings, Trees and Sequences" by Dan Gusfield. I don't have the book with me to verify, but the wikipedia page on suffix trees says that page 205 contains a solution for your problem: "finding the longest substrings common to at least k strings in a set".
Do you know the "values" you need to search for ahead of time? Or do you need code to parse the strings, and give you stats like you posted?
Using the Boyer-Moore algorithm is a very quick way to tell if substrings exist (and even locate them), if you know what you are looking for ahead of time.
you can use an analysis of distance matrix. Any diagonal movement (no cost change) is an exact match.
You may find a suffix array simpler and more efficient than a suffix tree, depending on how frequent common substrings are in your data -- if they're common enough, you'll need the more sophisticated suffix-array construction algorithm. (The naive method is to just use your library sort function.)

Resources