How to find all characters in a string whose appearance is greater than 2 - algorithm

I have a question about algorithm:
How to find all characters in a string whose appearance is greater than a specific number, say 2 for example efficiently?
Regards.

Counting sort will be extremely efficient for one-byte encodings, border case is two-byte encodings. For wider encodings it is not so efficient, but counting array may be replaced with hash table.
EDIT: By the way, that is too general solution, doing only counting phase and outputting results on the fly will be more than enough.

s= #your string
h=Hash.new(0)
s.each_char {|c| h[c]+=1 }
h.select {|char,count| count>2}

var word = "......";
var chars = word.GroupBy(w => w).Where(g => g.Count > 2).Select(g => new { character = g.Key, count = g.Count });

Couldn't resist to try this out.
Keep an internal array of 256 elements for each (ASCII) character.
Loop once over the input string.
Increment the count for given character using the ordinal value of the character as a direct access into the internal array.
Delphi implementation
Type
TCharCounter = class(TObject)
private
FCounts: array[0..255] of byte;
public
constructor Create(const Value: string);
function Count(const AChar: Char): Integer;
end;
{ TCharCounter }
constructor TCharCounter.Create(const Value: string);
var
I: Integer;
begin
inherited Create;
for I := 1 to Length(Value) do
Inc(FCounts[Ord(Value[I])]);
end;
function TCharCounter.Count(const AChar: Char): Integer;
begin
Result := FCounts[Ord(AChar)];
end;

I would sort the string, then just walk through it and keep a running tally for each letter. The last is just O(n) so it'll be as efficient as your sort.

the easier way is to use an array: occurrence[256], initialize them all with 0's
and for every char in string, occurrence[(int)char]++.
And then you just scan the occurrence to find the occurrence of characters satisfying your criterion.

Related

Reading a text file and constructing a matrix from it

I need to construct a matrix; a number of columns and rows are also in the first row of the matrix, I'll make an example so its more clearer.
4 3
1 2 3
5 6 7
9 10 8
1 11 13
Where m=4 (number of rows) and n=3 (number of columns)
This is an example of a text file. Is something like this even possible?
Program Feb;
const
max=100;
type
Matrix=array[1..max,1..max] of integer;
var datoteka:text;
m,n:integer;
counter:integer;
begin
assign(datoteka,'datoteka.txt');
reset(datoteka);
while not eoln(datoteka) do
begin
read(datoteka, m);
read(datoteka, n);
end;
repeat
read eoln(n)
until eof(datoteka)
write (m,n);
end.
My code isn't a big help, cause I don't know how to write it.
First, have a look at the code I wrote to do the task, and then look at my explanation below.
program Matrixtest;
uses
sysutils;
var
NoOfCols,
NoOfRows : Integer;
Source : TextFile;
Matrix : array of array of integer;
FileName : String;
Row,
Col : Integer; // for-loop iterators to access a single cell of the matrix
Value : Integer;
begin
// First, construct the name of the file defining the matrix
// This assumes that the file is in the same folder as this app
FileName := ExtractFilePath(ParamStr(0)) + 'MatrixDef.Txt';
writeln(FileName); // echo it back to the screen so we can see it
// Next, open the file
Assign(Source, FileName);
Reset(Source);
read(Source, NoOfRows, NoOfCols);
writeln('Cols: ', NoOfCols, 'Rows: ', NoOfRows);
SetLength(Matrix, NoOfCols, NoOfRows);
readln(source); // move to next line in file
// Next, read the array data
for Row := 1 to NoOfRows do begin
for Col := 1 to NoOfCols do begin
read(Source, Value);
Matrix[Col - 1, Row - 1] := Value;
end;
end;
// Display the array contents
for Row := 1 to NoOfRows do begin
for Col := 1 to NoOfCols do begin
writeln('Row: ', Row, ' contents', Matrix[Col - 1, Row - 1]);
end;
end;
Close(Source); // We're done with the file, so close it to release OS resources
readln; // this waits until you press a key, so you can read what's been displayed
end.
In your program, you can use a two-dimensional array to represent your matrix. Free Pascal supports multi-dimensional arrays; see https://wiki.lazarus.freepascal.org/Multidimensional_arrays for more information.
This is a complex task, so it helps to know how to do more basic things like reading an array of a size known at compile-time from a text file.
The wrinkle in this task is that you are supposed to read the dimensions (numbers of rows and columns) of the matrix at run-time from the file which contains the matrix's contents.
One inefficient way to do this would be to declare the matrix array with huge dimensions, larger than anything you would expect in practice, using the type of array declaration in the Wiki page linked above.
A better way is to use dynamic arrays, whose dimensions you can set at run-time. To use this, you need to know:
How to declare a dynamic array in Free Pascal
How to set the dimensions of the array at run-time, once you've picked them up from your matrix-definition file (hint: SetLength is the way to do this)
The fact that a Free Pascal dynamic array is zero-based
The easiest way of managing zero-based arrays is to write your code (in terms of Row and Column variables) as if the matrix were declared as array[1..NoOfRows, 1..NoOfColumns] and subtract one from the array indexes only when you actually access the array, as in:
Row := 3;
Column := 4;
Value := Matrix[Row - 1, Column - 1];

Not sure when to use ':' or '='

I am getting this error when compiling my code: "cars2.pp(3,8) Fatal: Syntax error, "=" expected but ":" found"
Here's my code:
program vehInfo;
type
wheels: array [1 .. 6] of integer;
purchaseYear: array [1919 .. 2051] of integer;
style = (sports, SUV, minivan, motorcycle, sedan, exotic);
pwrSrc = (electric, hybrid, gas, diesel);
vehicle = record
wheel : wheels;
buyDate : purchaseYear;
styles : style;
source : pwrSrc;
end;
var
myVehicle: vehicle;
listOfCars: file of vehicle;
begin
assign(listOfCars, 'hwkcarsfile.txt');
reset(listOfCars);
read(listOfCars, myVehicle);
writeln('wheel type: ' , myVehicle.wheel);
writeln('year purchased: ' , myVehicle.buyDate);
writeln('style: ' , myVehicle.styles);
writeln('power source: ' , myVehicle.source)
close(listOfCars);
end.
I am new to Pascal, any help would be appreciated, thank you.
It is quite simple: type uses =, while variable declarations use :.
So:
type
wheels = 1..6; // not an array, but a subrange type!
purchaseYear = 1919..2051; // not an array, but a subrange type!
style = (sports, SUV, minivan, motorcycle, sedan, exotic);
pwrSrc = (electric, hybrid, gas, diesel);
vehicle = record
wheel: wheels; { a field of a record is a variable }
buyDate: purchaseYear;
styles: style;
source: pwrSrc;
end;
...
var
myVehicle: vehicle;
listOfCars: file of vehicle;
Subrange types are ordinal types (in this case, both are integers), but within the given range. Any value outside the range is illegal. You don't want to have arrays of numbers, you only want the number of wheels and the year (a number too) the vehicle was purchased. You don't need 133 different dates, do you?

Transpose a string in Pascal

I'm very new to Pascal and still learning much. I have to write a code that :
Takes input of a string
Split the string into two characters each (Snippet)
Use the Snippet to get an index from an array
Transpose the Snippet to a certain value
If Index + Transpose is larger than the length of the Array, return nothing
If not, append the transposed Snippet to a result string
Return the transposed string
I can only write 1 through 3, the rest is still a blur for me. Helps are appreciated.
(And I also want to improve it without many for loops. Any thoughts?)
program TransposeString;
var
melody : Array[1..24] of String[2] = ('c.', 'c#', 'd.', 'd#', 'e.', 'f.', 'f#', 'g.', 'g#', 'a.', 'a#', 'b.', 'C.', 'C#', 'D.', 'D#', 'E.', 'F.', 'F#', 'G.', 'G#', 'A.', 'A#', 'B.');
songstring, transposedstring : String;
transposevalue : byte;
function Transpose(song : String; transposevalue : byte): String;
var
songsnippet : String[2];
iter_song, iter_index, index : byte;
begin
for iter_song := 1 to length(song) do
begin
if iter_song mod 2 = 0 then continue;
songsnippet := song[iter_song] + song[iter_song + 1]; //Split the string into 2 characters each
for iter_index := 1 to 24 do
begin
if melody[iter_index] = songsnippet then
begin
index := iter_index; //Get Index
break;
end;
end;
//Check Transpose + Index
//Transpose Snippet
//Append Snippet to Result String
end;
end;
begin
readln(songstring);
readln(transposevalue);
transposedstring := transpose(songstring, transposevalue);
writeln(transposedstring);
end.
As a starter for you to work from, rather than just spoon-feeding an answer:
You have the index of the snippet (note) in index. Assuming the notes are in order you need to return the note from the array positions above it, so
result := result + melody[iter_index + transposevalue];
You need to check the length of the array before trying to read from it, otherwise it'll crash (step 5). This is just an if statement.
I wouldn't worry too much about for loops - 2 deep nesting isn't that bad. If you wanted to split it out a bit then GetTransposedNote(const note:string): string; could be split out as a new function.
Things you may want to think about are:
What if you can't find the note in the array?
Do you want to be case-sensitive
What if the input string has an odd number of characters?
You are most of the way there already, though.

Reversing encryption in delphi

It was not me who wrote this code, it was the previous programmer. However, I noticed he didn't provide a decryption algorithm, rendering the encryption useless.
How can I decrypt this?
function Encrypt(jstr: String): String;
var
I: Integer;
A: Real;
begin
if Length(jstr) = 0 Then begin
Result := '';
Exit;
end;
A := 0;
for I := 0 To Length(jstr) do
A := A + (Ord(jstr[I]) * Pos(jstr[I],jstr)) / 33;
Result := FormatFloat('0000000000.0000000000',A);
if Pos(',',Result) > 0 then begin
Insert('.',Result,Pos(',',Result));
Delete(Result,Pos(',',Result),1);
end;
end;
Thanks!
It looks like a one way hash and hence is not reversible. For example, is the string is very big the result is still a string representation of a float.
That function cannot be reversed. Since it takes input of arbitrary length and returns output of finite length, simple information theory tells you the futility of attempting to write a general inverse. Even for shorter input strings it seems to me that different input strings can result in the same encrypted string.
Even as a hash this function seems very brittle to me due to the bizarre use of floating point code. If I were you I would replace this function with something more fit for purpose.
Finally, I recommend that you undertake a review of all code produced by this developer. The low quality of this code and algorithm suggests to me that everything that this developer touched is liable to have defects.

How to find all brotherhood strings?

I have a string, and another text file which contains a list of strings.
We call 2 strings "brotherhood strings" when they're exactly the same after sorting alphabetically.
For example, "abc" and "cba" will be sorted into "abc" and "abc", so the original two are brotherhood. But "abc" and "aaa" are not.
So, is there an efficient way to pick out all brotherhood strings from the text file, according to the one string provided?
For example, we have "abc" and a text file which writes like this:
abc
cba
acb
lalala
then "abc", "cba", "acb" are the answers.
Of course, "sort & compare" is a nice try, but by "efficient", i mean if there is a way, we can determine a candidate string is or not brotherhood of the original one after one pass processing.
This is the most efficient way, i think. After all, you can not tell out the answer without even reading candidate strings. For sorting, most of the time, we need to do more than 1 pass to the candidate string. So, hash table might be a good solution, but i've no idea what hash function to choose.
Most efficient algorithm I can think of:
Set up a hash table for the original string. Let each letter be the key, and the number of times the letter appears in the string be the value. Call this hash table inputStringTable
Parse the input string, and each time you see a character, increment the value of the hash entry by one
for each string in the file
create a new hash table. Call this one brotherStringTable.
for each character in the string, add one to a new hash table. If brotherStringTable[character] > inputStringTable[character], this string is not a brother (one character shows up too many times)
once string is parsed, compare each inputStringTable value with the corresponding brotherStringTable value. If one is different, then this string is not a brother string. If all match, then the string is a brother string.
This will be O(nk), where n is the length of the input string (any strings longer than the input string can be discarded immediately) and k is the number of strings in the file. Any sort based algorithm will be O(nk lg n), so in certain cases, this algorithm is faster than a sort based algorithm.
Sorting each string, then comparing it, works out to something like O(N*(k+log S)), where N is the number of strings, k is the search key length, and S is the average string length.
It seems like counting the occurrences of each character might be a possible way to go here (assuming the strings are of a reasonable length). That gives you O(k+N*S). Whether that's actually faster than the sort & compare is obviously going to depend on the values of k, N, and S.
I think that in practice, the cache-thrashing effect of re-writing all the strings in the sorting case will kill performance, compared to any algorithm that doesn't modify the strings...
iterate, sort, compare. that shouldn't be too hard, right?
Let's assume your alphabet is from 'a' to 'z' and you can index an array based on the characters. Then, for each element in a 26 element array, you store the number of times that letter appears in the input string.
Then you go through the set of strings you're searching, and iterate through the characters in each string. You can decrement the count associated with each letter in (a copy of) the array of counts from the key string.
If you finish your loop through the candidate string without having to stop, and you have seen the same number of characters as there were in the input string, it's a match.
This allows you to skip the sorts in favor of a constant-time array copy and a single iteration through each string.
EDIT: Upon further reflection, this is effectively sorting the characters of the first string using a bucket sort.
I think what will help you is the test if two strings are anagrams. Here is how you can do it. I am assuming the string can contain 256 ascii characters for now.
#define NUM_ALPHABETS 256
int alphabets[NUM_ALPHABETS];
bool isAnagram(char *src, char *dest) {
len1 = strlen(src);
len2 = strlen(dest);
if (len1 != len2)
return false;
memset(alphabets, 0, sizeof(alphabets));
for (i = 0; i < len1; i++)
alphabets[src[i]]++;
for (i = 0; i < len2; i++) {
alphabets[dest[i]]--;
if (alphabets[dest[i]] < 0)
return false;
}
return true;
}
This will run in O(mn) if you have 'm' strings in the file of average length 'n'
Sort your query string
Iterate through the Collection, doing the following:
Sort current string
Compare against query string
If it matches, this is a "brotherhood" match, save it/index/whatever you want
That's pretty much it. If you're doing lots of searching, presorting all of your collection will make the routine a lot faster (at the cost of extra memory). If you are doing this even more, you could pre-sort and save a dictionary (or some hashed collection) based off the first character, etc, to find matches much faster.
It's fairly obvious that each brotherhood string will have the same histogram of letters as the original. It is trivial to construct such a histogram, and fairly efficient to test whether the input string has the same histogram as the test string ( you have to increment or decrement counters for twice the length of the input string ).
The steps would be:
construct histogram of test string ( zero an array int histogram[128] and increment position for each character in test string )
for each input string
for each character in input string c, test whether histogram[c] is zero. If it is, it is a non-match and restore the histogram.
decrement histogram[c]
to restore the histogram, traverse the input string back to its start incrementing rather than decrementing
At most, it requires two increments/decrements of an array for each character in the input.
The most efficient answer will depend on the contents of the file. Any algorithm we come up with will have complexity proportional to N (number of words in file) and L (average length of the strings) and possibly V (variety in the length of strings)
If this were a real world situation, I would start with KISS and not try to overcomplicate it. Checking the length of the target string is simple but could help avoid lots of nlogn sort operations.
target = sort_characters("target string")
count = 0
foreach (word in inputfile){
if target.len == word.len && target == sort_characters(word){
count++
}
}
I would recommend:
for each string in text file :
compare size with "source string" (size of brotherhood strings should be equal)
compare hashes (CRC or default framework hash should be good)
in case of equity, do a finer compare with string sorted.
It's not the fastest algorithm but it will work for any alphabet/encoding.
Here's another method, which works if you have a relatively small set of possible "letters" in the strings, or good support for large integers. Basically consists of writing a position-independent hash function...
Assign a different prime number for each letter:
prime['a']=2;
prime['b']=3;
prime['c']=5;
Write a function that runs through a string, repeatedly multiplying the prime associated with each letter into a running product
long long key(char *string)
{
long long product=1;
while (*string++) {
product *= prime[*string];
}
return product;
}
This function will return a guaranteed-unique integer for any set of letters, independent of the order that they appear in the string. Once you've got the value for the "key", you can go through the list of strings to match, and perform the same operation.
Time complexity of this is O(N), of course. You can even re-generate the (sorted) search string by factoring the key. The disadvantage, of course, is that the keys do get large pretty quickly if you have a large alphabet.
Here's an implementation. It creates a dict of the letters of the master, and a string version of the same as string comparisons will be done at C++ speed. When creating a dict of the letters in a trial string, it checks against the master dict in order to fail at the first possible moment - if it finds a letter not in the original, or more of that letter than the original, it will fail. You could replace the strings with integer-based hashes (as per one answer regarding base 26) if that proves quicker. Currently the hash for comparison looks like a3c2b1 for abacca.
This should work out O(N log( min(M,K) )) for N strings of length M and a reference string of length K, and requires the minimum number of lookups of the trial string.
master = "abc"
wordset = "def cba accb aepojpaohge abd bac ajghe aegage abc".split()
def dictmaster(str):
charmap = {}
for char in str:
if char not in charmap:
charmap[char]=1
else:
charmap[char] += 1
return charmap
def dicttrial(str,mastermap):
trialmap = {}
for char in str:
if char in mastermap:
# check if this means there are more incidences
# than in the master
if char not in trialmap:
trialmap[char]=1
else:
trialmap[char] += 1
else:
return False
return trialmap
def dicttostring(hash):
if hash==False:
return False
str = ""
for char in hash:
str += char + `hash[char]`
return str
def testtrial(str,master,mastermap,masterhashstring):
if len(master) != len(str):
return False
trialhashstring=dicttostring(dicttrial(str,mastermap))
if (trialhashstring==False) or (trialhashstring != masterhashstring):
return False
else:
return True
mastermap = dictmaster(master)
masterhashstring = dicttostring(mastermap)
for word in wordset:
if testtrial(word,master,mastermap,masterhashstring):
print word+"\n"

Resources