associate multiple strings to only one - ruby

I'm trying to make an algorithm that easily simplifies and groups synonyms (with mismatches, capitals, acronims, etc) into only one. I supose there should exist a standard way to build such a structure that, looking for a string with possible mismatches, if the string exists in the structure, it returns a normalized string key. In short, sometimes the same concept could be written in several ways, but I only want to keep the concept.
For instance: Supose I want to normalize or simplify the appearances of
"General Director", "General Manager", "G, Dtor", "Gen Dir", ...
into
"GEN_DIR"
and keep only this result for further reference.
By the way, I suppose that building a Hash with key/value pairs like
hash["General Director"]="GEN_DIR"
hash["General Manager"]="GEN_DIR"
hash["G, Dtor"]="GEN_DIR"
hash["G, Dir"]="GEN_DIR"
could be a solution, but I suspect that there are more elegant or adequate solutions to that.
I would also need the way to persist this associative structure easily without any database because it should grow as I find more mismatches of the same word or sentence. A possible approach I think is to define this structure by means of a DSL, but I'm open to suggestions.

Well, there is no rule, at least a clear one.
My aim is to scrap from web some "structured" data that sometimes is incorrectly or incompletely typed. Some fields are descriptions and can be left as is. But some fields are suposedly to be "sets" but aren's correctly typed (as in my example). As a human can read that, he immediatelly knows what it means and can associate that with its meaning.
But I would like to automate as much as possible the process of reducing those possible mismatches to only one "string" (or symbol) before, for instance, saving it into a database. So, what I would need is a kindof hash or dictionary, as sawa correctly stated, that I can use to lookup any of such dirty strings to get the normalized string or symbol.
Also, of course, it would be desirable a way to make this hash (or whatelse it could be) to learn from new mismatches in some way and add a new association automatically (possibly it could be based on a distance measure between mismatched string and normalized string that, if lower than X, a new association is built). The whole association (i.e, hash) should grow as new mismatches and concepts arise and, though, it should be kept anywhere (possibly in an xml file, or something like what Mori answered below) for future uses.
Any new Idea?

Related

What is the essential difference between Document and Collectiction in YAML syntax?

Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.
First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.

Most elegant way to represent lots of fields

So, I have this list of attributes that I would like to represent in my Go code. That is, I'm looking for some data structure that can convey all these.
Mind like a diamond
Knows what's best
Shoes that cut
Eyes that burn like cigarettes
The right allocations
Fast
Thorough
Sharp as a tack
Playing with her jewellry
Putting up her hair
Touring the facilities
Picking up slack
Short skirt
Long jacket
Gets up early
Stays up late
Uninterrupted prosperity
Uses a machete to cut through red tape
Fingernails that shine like justice
Voice that is dark like tinted glass
Smooth liquidation
Right dividends
Wants a car with a cupholder arm rest
Wants a car that will get her there
Changing her name from Kitty to Karen
Trading her MG for a white Chrysler LeBaron
They're not going to change any time soon, but I will want to do the following things with them:
Read them from and write them to a JSON file.
Associate a boolean with each attribute (bonus points for a way to associate any one arbitrary type).
Easily display these strings to a human.
Easily count how many are "set" (That is, a function that, when called on someone who gets up early, has the right dividends, and is sharp as a tack, but nothing else, returns 3).
Should be as strongly typed as possible (It should be impossible, or at least difficult, to ask for an attribute that doesn't exist. You should be able to see all the attributes by looking at the code).
Preserving this order would be a plus, but not strictly necessary.
The approaches I've thought of so far are:
A struct with all of these as fields, but the count function is inelegant to write, and the 'nice' string name of the fields has to be stored elsewhere. Even if that can be solved with tags, the count problem remains.
A map with constant keys (or an array with enumerated values). Counting is a simple loop, and the keys can be the nice string names. However, this creates the problem of asking for keys that don't exist, and you have to hide the dictionary behind a function to stop users from trying to add new keys.
Is there something I'm missing here, or will I have to compromise?
I'd use a uint64 as a bitset (or an actual bitset type, of which none are built in but some are open source). It meets your requirements in the following ways:
JSON storage as "010101" string or base64 (11 chars).
Easy display: just define an array of the strings in the same order they are stored in the bitset.
Easily count how many are "set" - use "popcount" (open source implementations are easy to come by).
Order is naturally preserved--the first attribute will always be bit 0, etc.
If you have more than 64 attributes, you can use math/big.
You can use map[int]interface{} to store the value of any length inside the interface with integer as key or you can also use slice of interface for looping inside the data like this []interface{}. Later on you can easily type assert the interface to get underlying value which is string in your case.
Since it will be slice of interface or map of interface you can easily marshal it to convert it into json and save into file which can be read or write easily.

Is it possible to get items from DynamoDB where the primary key ends with a given string?

Is it possible, using the AWS Ruby SDK (or just DynamoDB in general), to get an item or items from a table that uses a primary key only, and where that primary key ends with a certain string?
I haven't come across anything in the docs that explicitly answers this question, either in the ruby ddb docs or the general docs for ddb. I'm not saying the question is not answered, but if it is, I can't find it.
If it is possible, could someone provide an example for ruby or link to the docs where an example exists?
Although #Ryan is correct and this can be done with query, just bear in mind that you're doing a "full-table-scan" here. That might be OK for a one-time job but probably not the best practice for a routine task (and of course not as a part of your API calls).
If your use-case involves quickly finding objects based on their suffix in a specific field, consider extracting this suffix (assuming it's a fixed-size suffix) as another field and have a secondary index on that one. If you want to query arbitrary length suffixes, I would create a lookup table and update it with possible suffixes (or some of them, to save some calls, and then filter when querying).
It looks like you would want to use the Query method on the SDK to find the items your looking for. It seems that "EndsWith" is not available as a comparison operator in the SDK though. So you would need to use CONTAINS and then check your results locally.
This should lead to the best performance, letting DynamoDb do the initial heavy lifting and then further pruning the results once you receive them.
http://docs.aws.amazon.com/sdkforruby/api/Aws/DynamoDB/Client.html#query-instance_method

Can Groups be used to emulate the "class" or "struct" data structures from other languages

Is there a data structure within LiveCode that can be used as a "holder" for associated data, letting me handle it collectively? I come from a Java / Javascript / C background so I am looking for a Class or Struct sort of data structure.
I've found examples of Groups, which seem to have some of this functionality, but it feels a bit like I'm bending the language to meet my needs.
As a specific example, suppose I had an image field on my screen that would randomly display an image and, when pressed, play an associated sound clip. I'd expect to create a list of "structures" that contained the path to the image and the path to the associated sound clip, and use that data to populate the image field and to decide what sound clip to play.
Would a Group be the correct structure to use in this case? Or am I approaching this in a way that isn't really fitting with the way LiveCode works?
It takes a little getting used to, but the xTalk world is much simpler and more open than any ordinary procedural language. So much of what you once had to manage is no longer required.
So when splash21 said that you could store all your image and sound references in a custom property, he was really saying that the LiveCode environment contains intrinsic, high level functionality that makes these sorts of things instantly accessible, and the only thing required of you is to call for them, and they simply work.
The only way to appreciate this is to make a few simple programs, to really see what is possible. Make your application. Everything you mentioned can be accomplished with perhaps a dozen lines of code in a single handler. I recommend that you join the LiveCode use list and forums. The community is vibrant and eager to help, frequently with full blown solutions to specific problems, but more importantly, as guides and mentors to new users
Craig Newman
Arrays in LiveCode are actually associative arrays (like hash maps). A key is associated with a value. The value might be as well an array.
Chapter 5.5.7 of the User's Guide says
Array elements may contain nested or sub-elements, making them multi-dimensional.
This type of array is ideal for processing hierarchical data structures such as trees or
XML. To access a sub-element, simply declare it using an additional set of square
brackets.
put "ABC" into myVariable["myKeyName"][“aChildElement”]
see also
How to store pictures in a stack?
Dave- I'm hoping to get a struct-like container implemented in the near future. Meanwhile you can, as splash21 mentioned, use custom properties (or better yet, custom property sets) to do what you want. This will give you a pseudo-struct for each object and you can implement the file and sound specifications into the properties. And if you use that in conjunction with a behavior object you'll end up very close to a real inheritable class formation.

How to uniquly identify an two objects in same page having same url

I Have two objects in same page but with different locations(tabs), I want to verify those objects each a part ...
i cant uniquely any of objects because the have same properties.
These objects clearly are unique to a point because they have completely different text, this means that you will be able to create an object to match only one of them. My suggestion would be to look for the object by using its text property, one of them will always have "Top Ranking" the other you wil need to turn into a regular expression for the text and will be something "Participants (\d+)".
I am assuming that this next answer is unlikely to be possible so saved it for after the answer you are likely to use but the best solution would of course be to get someone with access to give these elements ids for you to search for. This will in the long term be much easier for you to maintain and not using text will allow this test to run in any language.
Manaysah, do these objects have different indexes? Use the object spy and determine which index they have, the ordinal identifier index may be a solution to your problem. You could also try adding an innertext object property if possible, using a wildcard for the number inside the () as it appears dynamic.
try using xpath for the objects...xpath will definitely be different

Resources