Find a Global Atom from a partial string - winapi

I can create an Global Atom using GlobalAddAtom and I can find that atom again using GlobalFindAtom if I already know the string associated with the atom. But is there a way to find all atoms whose associated string matches a given partial string?
For example, let's say I have an atom whose string is "Hello, World!" How can I later find that atom by searching for just "Hello"?

Unfortunately, the behavior you're describing is not possible for Atom Tables. This is because Atom Tables in Windows are basically Hash Tables, and the mapping process handles strings in entirety and not by parts.
Of course, it almost sounds like it would be possible, as quoted from the MSDN documentation:
Applications can also use local atom tables to save time when searching for a particular string. To perform a search, an application need only place the search string in the atom table and compare the resulting atom with the atoms in the relevant structures. Comparing atoms is typically faster than comparing strings.
However, they are referring to exact matches. This limitation probably seems dated compared to what is possible with resources currently available to software. However, Atoms have been available as far back as Win16 and in those times, this facility allowed a means for applications to manage string data effectively in minimal memory. Atoms are still used now to manage window class names, and still provide decent benefits in reducing the footprint of multiple stored copies of strings.
If you need to store string data efficiently and to be able to scan by partial starting matches, a Suffix Tree is likely to meet or exceed your needs.

It actually can be done, but only through scanning them all. In LINQPad 5 this can be done in 0.025 seconds on my machine, so it is quite fast. Here is an example implementation:
void Main()
{
const string atomPrefix = "Hello";
const int bufferSize = 1024;
ushort smallestAtomIndex = 0XC000;
var buffer = new StringBuilder(bufferSize);
var results = new List<string>();
for (ushort atomIndex = smallestAtomIndex; atomIndex < ushort.MaxValue; atomIndex++)
{
var resultLength = GlobalGetAtomName(atomIndex, buffer, bufferSize);
if (buffer.ToString().StartsWith(atomPrefix))
{
results.Add($"{buffer} - {atomIndex}");
}
buffer.Clear();
}
results.Dump();
}
[DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
public static extern uint GlobalGetAtomName(ushort atom, StringBuilder buffer, int size);

Related

What's the best way to compress multiple values into deserializable value?

I'm implementing an openpeeps.com library for Flutter in which user can create their own peeps to use as an avatar within our product.
One of the reasons behind using peeps as avatar is that (in theory) it can be easily stored as a single value within a database.
A Peep within my library contains of up to 6 PeepAtoms:
class Peep {
final PeepAtom head;
final PeepAtom face;
final PeepAtom facialHair;
final PeepAtom? accessories;
final PeepAtom? body;
final PeepAtom? pose;
}
A PeepAtom is currently just a name identifying the underlying image file required to build a Peep:
class PeepAtom {
final String name;
}
How to get a hash?
What I'd like to do now is get a single value from a Peep (int or string) which I can store in a database. If I retrieve the data, I'd like to deconstruct the value into the unique atoms so I can render the appropriate atom images to display the Peep. While I'm not really looking to optimize for storage size, it would be nice if the bytesize would be small.
Since I'm normally not working with such stuff I don't have an idea what's the best option. These are my (naïve) ideas:
do a Peep.toJson and convert the output to base64. Likely inefficient due to a bunch of unnecessary characters.
do a PeepAtom.hashCode for each field within a Peep and upload this. As an array that would be 64bit = 8 Byte * 6 (Atoms). Thats pretty ok but not a single value.
since there are only a limited number of Atoms in each category (less than 100) I could use bitshifts and ^ to put this into one int. However, I think this would not really working because I'd need a unique identifier and since I'm code generating the PeepAtoms within my code that likely would be quite complex.
Any better ideas/algorithms?
I'm not sure what you mean by "quite complex". It looks quite simple to pack your atoms into a double.
Note that this is no way a "hash". A hash is a lossy operation. I presume that you want to recover the original data.
Based on your description, you need seven bits for each atom. They can range in 0..98 (since you said "less than 100"). A double has 52 bits of mantissa. Your six atoms needs 42 bits, so it fits easily. For atoms that can be null, just give that a special unused 7-bit value, like 127.
Now just use multiply and add to combine them. Use modulo and divide to pull them back out. E.g.:
double val = head;
val = val * 128 + face;
val = val * 128 + facialHair;
...
To extract:
int pose = val % 128;
val = (val / 128).floorToDouble();
int body = val % 128;
val = (val / 128).floorToDouble();
...

Algorithm - Implement two functions that assign/release unique id's from a pool

I am trying to find a good solution for this question -
Implement two functions that assign/release unique id's from a pool. Memory usage should be minimized and the assign/release should be fast, even under high contention.
alloc() returns available ID
release(id) releases previously assigned ID
The first thought was to maintain a map of IDs and availability(in boolean). Something like this
Map<Integer, Boolean> availabilityMap = new HashMap();
public Integer alloc() {
for (EntrySet es : availabilityMap.entrySet()) {
if (es.value() == false) {
Integer key = es.key();
availabilityMap.put(key, true);
return key;
}
}
}
public void release(Integer id) {
availabilityMap.put(id, false);
}
However this is not ideal for multiple threads and "Memory usage should be minimized and the assign/release should be fast, even under high contention."
What would be a good way to optimize both memory usage and speed?
For memory usage, I think map should be replaced with some other data structure but I am not sure what it is. Something like bit map or bit set? How can I maintain id and availability in this case?
For concurrency I will have to use locks but I am not sure how I can effectively handle contention. Maybe put availabile ids in separate chunks so that each of them can be accessed independently? Any good suggestions?
First of all, you do not want to run over entire map in order to find available ID.
So you can maintain two sets of IDs, the first one for available IDs, and the second one is for allocated IDs.
Then it will make allocation/release pretty easy and fast.
Also you can use ConcurrentMap for both containers (sets), it will reduce the contention.
Edit: Changed bottom sentinel, fixed a bug
First, don't iterate the entire map to find an available ID. You should only need constant time to do it.
What you could do to make it fast is to do this:
Create an int index = 1; for your counter. This is technically the number of IDs generated + 1, and is always > 0.
Create a ArrayDeque<Integer> free = new ArrayDeque<>(); to house the free IDs. Guaranteed constant access.
When you allocate an ID, if the free ID queue is empty, you can just return the counter and increment it (i.e. return index++;). Otherwise, grab its head and return that.
When you release an ID, push the previously used ID to the free deque.
Remember to synchronize your methods.
This guarantees O(1) allocation and release, and it also keeps allocation quite low (literally once per free). Although it's synchronized, it's fast enough that it shouldn't be a problem.
An implementation might look like this:
import java.util.ArrayDeque;
public class IDPool {
int index = 1;
ArrayDeque<Integer> free = new ArrayDeque<>();
public synchronized int acquire() {
if (free.isEmpty()) return index++;
else return free.pop();
}
public synchronized void release(id) {
free.push(id);
}
}
Additionally, if you want to ensure the free ID list is unique (as you should for anything important) as well as persistent, you can do the following:
Use an HashMap<Integer id, Integer prev> to hold all generated IDs. Remember it doesn't need to be ordered or even iterated.
This is technically going to be a stack encoded inside a hash map.
Highly efficient implementations of this exist.
In reality, any unordered int -> int map will do here.
Track the top ID for the free ID set. Remember that 1 can represent nothing and zero used, so you don't have to box it. (IDs are always positive.) Initially, this would just be int top = 1;
When allocating an ID, if there are free IDs (i.e. top >= 2), do the following:
Set the new top to the old head's value in the free map.
Set the old top's value in the map to 0, marking it used.
Return the old top.
When releasing an old ID, do this instead:
If the old ID is already in the pool, return early, so we don't corrupt it.
Set the ID's value in the map to the old top.
Set the new top to the ID, since it's always the last one to use.
The optimized implementation would end up looking like this:
import java.util.HashMap;
public class IDPool {
int index = 2;
int top = 1;
HashMap<Integer, Integer> pool = new HashMap<>();
public synchronized int acquire() {
int id = top;
if (id == 1) return index++;
top = pool.replace(id, 0);
return id;
}
public synchronized void release(id) {
if (pool.getOrDefault(id, 1) == 0) return;
pool.put(id, top);
top = id;
}
}
If need be, you could use a growable integer array instead of the hash map (it's always contiguous), and realize significant performance gains. Matter of fact, that is how I'd likely implement it. It'd just require a minor amount of bit twiddling to do so, because I'd maintain the array's size to be rounded up to the next power of 2.
Yeah...I had to actually write a similar pool in JavaScript because I actually needed moderately fast IDs in Node.js for potentially high-frequency, long-lived IPC communication.
The good thing about this is that it generally avoids allocations (worst case being once per acquired ID when none are released), and it's very amenable to later optimization where necessary.

Data Structure for phone book [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
storing 1 million phone numbers
How to design a data structure for a phone address book with 3 fields
name, phone number , address
one must be able to search this phone book on any of the 3 fields
Hash table wouldn't work because all the three fields should hash to the same value which is i think impossible. I thought about trie and other data structures too but couldn't think of a proper answer.
You Should use TRIE data Structure for Implementing Phonebook. TRIE is an ordered tree data structure that uses strings as keys. Unlike Binary Trees, TRIE does not store keys associated with the node.
Good example
You could accomplish this with a single hash table or other type of associative array (if you wanted to). For each person, just have three keys in the table (name, address, phone) all pointing to the same record.
I think a combination of a trie (each phone book entry is one leaf) and two skip lists (one for each name and address) could turn out to be effective.
Just assign each node one set of pointers to move along the name axis, and one set of pointers to move along the address axis (that is, to traverse the skip lists).
You can't exactly sort something in three ways at the same time. Nor can you feasibly build a single hash table which allows lookup with only a third of the key.
What you probably want to do is basically what databases do:
Store one (possibly unsorted) master list of all your records.
For each column you want to be able to search on, build some kind of lookup structure which returns a pointer/index into the master list.
So, for example, you build a flat array of {name, phone, address} structs in whatever order you want, and then for each row, put a (phone -> row#) mapping into a hash table. Non-unique columns could hash to a list of row numbers, or you could put them in a binary tree where duplicate keys aren't an issue.
As far as space requirements, you basically end up storing every element twice, so your space requirement will at least double. On top of this you've got the overhead from the data structures themselves; keeping three hash tables loaded at ~70% capacity, your storage requirements increase by at least 2.4 times.
You can do away with one of these auxiliary lookup structures by keeping your main table sorted on one of the columns, so you can search on it directly in O(logN). However, this makes inserting/deleting rows very expensive (O(N)), but if your data is fairly static, this isn't much of an issue. And if this is the case, sorted arrays would be the most space-efficient choice for your auxiliary lookups as well.
in a phone book, the telephone number should be unique, address is unique, but the name could be duplicated.
so perhaps you can use hash table combine with linked list to approach this.
you can use any one or combination of the 'name, address, phone number' as hash key, if you simply use name as hash key, then linked list is needed to store the duplicated entries.
in this approach, search based on the hash key is O(1) efficiency, but search based on the other two will be O(n).
C or C++ or C#?
Use a list of classes
public class PhoneBook
{
public string name;
public string phoneNumber;
public string address;
}
place this in a list and you have a phone book
In C, I think a struct is the best option.
typedef struct _Contact Contact;
struct _Contact
{
char* name;
char* number;
char* address;
};
Contact* add_new_contact( char* name, char* number, char* address )
{
Contact* c = (Contact*) malloc( sizeof( Contact ) );
c->name = name;
c->number = number;
c->address = address;
return c;
}
Contact* phone_book [ 20 ]; /* An array of Contacts */
Use the standard string functions ( <string.h> or if using a C++ compiler, <cstring> ) or something like the glib for searching the names, numbers etc.
Here's a simple example:
Contact* search_for_number( Contact* phone_book[], const char* number )
{
register int i;
for( i = 0; i < sizeof( phone_book ); i++)
{
if ( strcmp( phone_book[i]->number, number ) == 0 ) return phone_book[i];
}
return NULL;
}
There is also a good, helpful code example over here.
Alternatively
You may be able to use linked lists, but since C or the C standard library doesn't provide linked-lists, you either need to implement it yourself, or to use a third-party library.
I suggest using the g_linked_list in the glib.

Efficient data structure/algorithm for transliteration based word lookup

I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.
For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.
There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?
Thanks in advance.
Edit: If I understand correctly, this is suggested by #Dialecticus-
I want to transliterate from Language1, which has 3 characters a,b,c to Language2, which has 6 characters p,q,r,x,y,z. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.
Lets assume phonetically here is our associative arrays/transliteration table:
a -> p, q
b -> r
c -> x, y, z
We also have a valid word lists in plain arrays for Language2:
...
px
qy
...
If the user types ac, the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px and qy.
What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz] and matching that with my valid word list, which provides the output px and qy.
I'm eager to know if there is any better method than that.
From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S' which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S'1,S'2,...,S'n] of output strings which might potentially be equivalent to S.
One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have
px->ac
qy->ac
pr->ab
(I have added an extra valid word pr for clarity)
Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.
Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.
Thus for our example, the Trie would look something like this
Root (empty)
| a
|
V
+---Node (empty)---+
| b | c
| |
V V
Node (px,qy) Node (pr)
Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.
Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.
Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input - i.e. a prefix match. For example, if fed with the input symbol 'a', we can use the trie to find all valid words that can begin with 'a' ('px','qr','py'). But doing that is not as fast as finding the exact match.
Here's a quick hack at a solution (in Java):
import java.util.*;
class TrieNode{
// child nodes - size of array depends on your alphabet size,
// her we are only using the lowercase English characters 'a'-'z'
TrieNode[] next=new TrieNode[26];
List<String> words;
public TrieNode(){
words=new ArrayList<String>();
}
}
class Trie{
private TrieNode root=null;
public void addWord(String sourceLanguage, String targetLanguage){
root=add(root,sourceLanguage.toCharArray(),0,targetLanguage);
}
private static int convertToIndex(char c){ // you need to change this for your alphabet
return (c-'a');
}
private TrieNode add(TrieNode cur, char[] s, int pos, String targ){
if (cur==null){
cur=new TrieNode();
}
if (s.length==pos){
cur.words.add(targ);
}
else{
cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ);
}
return cur;
}
public List<String> findMatches(String text){
return find(root,text.toCharArray(),0);
}
private List<String> find(TrieNode cur, char[] s, int pos){
if (cur==null) return new ArrayList<String>();
else if (pos==s.length){
return cur.words;
}
else{
return find(cur.next[convertToIndex(s[pos])],s,pos+1);
}
}
}
class MyMiniTransliiterator{
public static void main(String args[]){
Trie t=new Trie();
t.addWord("ac","px");
t.addWord("ac","qy");
t.addWord("ab","pr");
System.out.println(t.findMatches("ac")); // prints [px,qy]
System.out.println(t.findMatches("ab")); // prints [pr]
System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything
}
}
This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.
I would build transliterated sentence one symbol at the time, instead of one word at the time. For most languages it is possible to transliterate every symbol independently of other symbols in the word. You can still have exceptions as whole words that have to be transliterated as complete words, but transliteration table of symbols and exceptions will surely be smaller than transliteration table of all existing words.
Best structure for transliteration table is some sort of associative array, probably utilizing hash tables. In C++ there's std::unordered_map, and in C# you would use Dictionary.

Alternatives to traditional algorithms for encoding UUIDS (e.g. Base32, Base62)

We need to convert huge numbers of UUIDS into xml-compatible strings. If we use a Base32 algorithm (which maps each 5 bits to one of 32 characters) this leads to 26 char strings, if we us a Base62 algorithm (which iteratively divides the 128 bit integer by 62 and records the modulus as one of 62 characters) this leads to 22 char strings. While base62 returns shorter strings it is much more cpu-intensive, therefore we are stuck with Base32 (Base64 is not an option because of xml).
Do you know any other types of encoding algorithms that could help us here? Are there variants of Base32-like bit pattern encoding algorithms that could be used with bases that are not powers of 2? Or are there hybrid algorithms which combine approaches of the first with approaches of the second algorithm? We would like to reduce the char strings to less than 26 if possible.
You mentioned 62, which suggests that you are limiting your alphabet to A-Z (capitalised and lowercase) and the digits 0-9. Why not add another couple of XML compatible characters to that list, such as +, ., ~ or ! to bring that number up to 64? You'll be be able to do bit-shifting rather than division, which should make the algorithm as fast as the Base32 one and reduce your string sizes.
Edit: Since the restriction that these characters are also available for other as yet unspecified languages, you might care to escape some of your characters to represent your 64 options. If you use, for example, _ as an escape character you could have _1 and _2 represent options 63 and 64. The statistics mention in the original question suggest that UUIDS are 128-bits, so our Base64 would give us 22 characters if there is no escaping and, where up to 4 items are escaped, keeps within your 26 characters.
Wikipedia offers two versions of Base64 that are usable in XML namespaces.
http://en.wikipedia.org/wiki/Base64#XML. I wrote the following JAVA that follows to do URLSafe, UUIDs in Java, (call theObjectReturned.toString() to get it as a guid string).
I've seen other code for Java that is supposed to be very fast and could be easily modified to do the XML safe variants:
http://iharder.sourceforge.net/current/java/base64/
code follows. Save in a file called UUIDUtil.java
public class UUIDUtil{
public static UUID combUUID(){
private UUID srcUUID = UUID.randomUUID();;
private java.sql.Timestamp ts = new java.sql.Timestamp(Calendar.getInstance().getTime().getTime());
long upper16OfLowerUUID = this.zeroLower48BitsOfLong( srcUUID.getLeastSignificantBits() );
long lower48Time = UUIDUtil.zeroUpper16BitsOfLong( ts );
long lowerLongForNewUUID = upper16OfLowerUUID | lower48Time;
return new UUID( srcUUID.getMostSignificantBits(), lowerLongForNewUUID );
}
public static base64URLSafeOfUUIDObject( UUID uuid ){
byte[] bytes = ByteBuffer.allocate(16).putLong(0, uuid.getLeastSignificantBits()).putLong(8, uuid.getMostSignificantBits()).array();
return Base64.encodeBase64URLSafeString( bytes );
}
public static base64URLSafeOfUUIDString( String uuidString ){
UUID uuid = UUID.fromString( uuidString );
return UUIDUtil.base64URLSafeOfUUIDObject( uuid );
}
private static long zeroLower48BitsOfLong( long longVar ){
long upper16BitMask = -281474976710656L;
return longVar & upper16BitMask;
}
private static void zeroUpper16BitsOfLong( long longVar ){
long lower48BitMask = 281474976710656L-1L;
return longVar & lower48BitMask;
}
}

Resources