Is there a better name for "Last In, Random Out?" - random

So we've all heard of LIFO, FIFO, FILO, and LILO. (Last in, first out, etc.) What about LIRO? Last in, random out. Is that a common collection management pattern?
This occurs more commonly in nature, right? For example: water-homeostasis. If you drink too much water, the body's kidneys finds a way to get rid of the excess. Except, it's not the last water molecule that gets pushed out. Or the first. Whatever water molecule happens to be passing through the kidneys at the time is outputted.
I've been thinking about how to articulate this better in context of computing. Perhaps the analogy is irrelevant. In the case of water-homeostasis, all molecules are exactly the same (i.e., the values are the same even of the referenced object is not).
Update:
Upon further discussion, a colleague recommended that "Any In, Random Out" would be more pertinent.
In some of the responses, it was suggested that the input has no effect on the output. I don't think this is entirely true. Consider the following collection:
[3, 7, 3, 7, 7]
Even if the ordinality is random, the output is not. For example, the collection could not yield 5, 8, or 3,000,000. The input not only affects the range of eligible output values, it could (as in my analogy) trigger the output.

LIFO, FIFO, FILO, and LILO are so named to describe the relationship between input and output.
In your case, there is no relationship between input and output, so copying that naming schema doesn't make sense.
Your output logic is simply called: Random.

Related

word2vec window size at sentence boundaries

I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
[
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
...
]
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you
I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

How to decide to convert to categorical variable or keep it numeric?

This might be a basic or trivial question and might be straightforward. Still I would like to ask this to clear my doubt once and for all.
Take example of Passanger Class in Famous Titanic Data. Functionally it is indeed a Categorical Data, so it will make perfect sense to convert it to categorical variable. Algorithms as per my understanding tend to see a pattern specific to that class. But at the same time if you see it as numeric variable, it might denote a range also for a decision tree. Say passangers in between first class and second class.
It looks both are correct and both will affect the machine learning algorithm outputs in different ways.
Which one is appropriate and is there anywhere there is a extensive discussion about it? Should we use such ambiguous variables as numeric as well its copy as a categorical variable, which might prove to be a technique to uncover more patterns?
I suppose it's up to you whether you'd rather interpret a continuous PassengerClass variable as "for every one-unit increase in PassengerClass, the passenger's likelihood of survival goes up/down X%," versus a categorical (factor) PassengerClass as, "the likelihoods of survival for groups 2 and 3 (for example, leaving 1st-class passengers as the base group) are X and Y% percent higher, respectively, than the base group, holding all else constant."
I think about variables like PassengerClass almost as "treatment groups." Yes, I suppose you could interpret it as continuous, but I think it makes more sense to consider the unique effects of each class like "people who were given the drug versus those who weren't" - you can very easily compare the impacts of being in a higher class (e.g. 2 or 3) to being in the most common class, 1, which again would be left out.
The problem with mapping categorical notions to numerical is that some algorithms (e.g. neural networks) will interpret the value itself as having a meaning, i.e. you would get different results if you assign values 1,2,3 to passenger classes than, for example 0,1,2 or 3,2,1. The correspondence between the passenger classes and numbers is purely conventional and doesn't necessarily convey any additional meaning.
One could argue that the lesser the number, the "better" the class is, however it's still hard to interpret it as "the first class is twice as good as second class", unless you'll define some measure of "goodness" that will make the relation between numbers "1" and "2" sensible.
In this example, you have categorical data that is ordinal - meaning you can rank the categories (from best accommodations to worst, for example) but they're still categories. Regardless of how you label them, there's no actual information about the relative distances among your categories. You can put them in a table, but not (correctly) on a number line. In cases like this, it's generally best to treat your categorical data as independent categories.

Save/Restore Ruby's Random

I'm trying to create a game, which I want to always run the same given the same seed. That means that random events - be them what they may - will always be the same for two players using the same seed.
However, given the user's ability to save and load the game, Ruby's Random would reset every time the save loaded, making the whole principle void if two players save and load at different points.
The only solution I have imagined for this is, whenever a save file is loaded, to generate the same number of points as before, and thus getting Ruby's Random to the same state as it was before load. However, to do that I'd need to extend it so a counter is updated every time a random number is generated.
Does anyone know how to do that or has a better way to restore the state of Ruby's Random?
PS: I cannot use an instance of Random (Random.new) and Marshall it. I have to use Ruby's default.
Sounds like Marshal.dump/Marshal.load may be exactly what you want. The Random class documentation explicitly states "Random objects can be marshaled, allowing sequences to be saved and resumed."
You may still have problems with synchronization across games, since different user-based decisions can take you through different logic paths and thus use the sequence of random numbers in entirely different ways.
I'd suggest maybe saving the 'current' data to a file when the user decides to save (or when the program closes) depending on what you prefer.
This can be done using the File class in ruby.
This would mean you'd need to keep track of turns and pass that along with the save data. Or you could loop through the data in the file and find out how many turns have occurred that way I suppose.
So you'd have something like:
def loadGame(loadFile)
loadFile.open
data = loadFile.read
# What you do below here depends on how you decide to store the data in saveGame.
end
def saveGame(saveFile)
saveFile.open
saveFile.puts data
end
Havent really tried the above code so it could be bad syntax or such. It's mainly just the concept I'm trying to get across.
Hopefully that helps?
There are many generators that compute each random number in the sequence from the previous value alone, so if you used one of those you need only save the last random number as part of the state of the game. An example is a basic linear congruential generator, which has the form:
z(n+1) = (az(n) + b) mod c
where a, b and c are typically large (known) constants, and z(0) is the seed.
An arguably better one is the so-called "mulitply-with-carry" method.

Minimizing duplicates while picking items from multiple sets of arbitrary items

Since I am unsure how to phrase the question I will illustrate it with an example that is very similar to what I am trying to achieve.
I am looking for a way to optimize the amount of time it takes to perform the following task.
Suppose I have three sets of numbers labeled "A", "B", and "C", each set containing an arbitrary number of integers.
I receive a stack of orders that ask for a "package" of numbers, each order asking for a particular combination of integers, one from each set. So an order might look like "A3, B8, C1", which means I will need to grab a 3 from set A, an 8 from set B, and a 1 from set C.
The task is simple: grab an order, look at the numbers, then go collect them and put them together into a "package".
It takes awhile for me to collect the numbers, and often times an order comes in asking for the same numbers as a previous order, so I decide to store all of the packages for later retrieval; this way, the amount of time it takes for me to process a duplicate order would be dramatically reduced rather than having to go and collect the same numbers again.
The amount of time it takes to collect a number is quite long, but not as long as examining each package one by one, if I have a lot of orders that day.
So for example if I have the following sets of numbers and orders
set A: [1, 2, 3]
set B: [4, 5, 6, 12, 18]
set C: [7, 8]
Order 1: A1, B6, C7
Order 2: A3, B5, C8
Order 3: A1, B6, C7
I would put together packages for orders 1 and 2, but then I notice that order 3 is a duplicate order so I can choose to just take the package I put together for the first order and finish this last order quickly.
The goal is to optimize the amount of time taken to process a stack of orders. Currently I have come up with two methods, but perhaps there may be more ways to do things
Gather the numbers for each order, regardless whether it's a duplicate or not. I end up with a lot of packages in the end, and for extreme cases where someone places a bulk order for 50 identical packages, it's clearly a waste of time
check whether the package already exists in cache, perhaps using some sort of hashing method on the orders.
Any ideas?
There is not much detail given about how you fetch the data to compose packages etc. This makes it hard to come up with different solutions to your problem. For example, maybe existing packages could lead you to the data you need to compose new packages, although they differ in one way or another. For this there are actually dedicated hashing methods available like Locality Sensitive Hashing.
Given the two approaches you came up with, it sounds very natural to go for route 2. Hashing in the indices sounds trivial (first order is easily identified by the number 167, or string "167", right?) and therefore you would have no real drawback from using a hash. Maybe memory constraints as you need to keep old packages around. There are also common methods out there to define which packages to keep in the (hashed) cache and which ones to throw away.
Without knowing the exact timings is is not possible to be definitive, but it looks to me as if your idea 2, using some sort of hash table to store previous orders is the way to go.

Resources