Is this number 10,000,000 on Stack Overflow? [closed] - algorithm

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Well, the title pretty much says it.
(Q1) Is this question number 10,000,000 on Stack Overflow?
But, since it is probably not (though I made an 'attempt'):
(Q2) Which question is number 10,000,000 on Stack Overflow?
I realize that this is not exactly a programming question yet, so I will try to turn it into one:
(Q3) How can one determine which is question number N on Stack Overflow without administrator access to the Stack Overflow database?
I came up with all these while visiting Stack Overflow a few hours ago. The question counter caught my attention, because it resembled a price tag (all those 9s). I admit I did not give it too much thought. So maybe there is an obvious answer, but I can't see it right now.
What I've 'tried'
Obviously, the total number of posted questions T is one parameter to consider (i.e. the counter at the top-right of the the questions page). However, it is not directly related to any particular question. I also suspect that it is just an approximation of the real value since I've just seen it going up and down a few times. So it might be precisely evaluated just once every few minutes. One other cause of this behavior could be that questions can also be deleted. I might have caught the system during 'cleaning' time. In consequence, there might be more than one "question number N". But there must be a 'first' one. And a 'last' one, to be determined after the moment when questions that were posted before it stop getting deleted or become 'unlikely' to get deleted after that moment.
Second 'clue' I could think about was the question URL. It has this format:
https://stackoverflow.com/questions/<id>/<title>
The id part and T seem to be correlated. The difference in between the two values is not a constant, though earlier today it was around 22,107,000 and was increasing at a relative 'slow' rate. This might be because the id is an aggregate counter (I suspect it counts both questions and answers, possibly more). So unless one has an overall view of how the number of questions and/or the number of answers evolved over time, this is not the way to go.
Due to the distributed setup of the whole system and due to the fact that the above parameters are hard to track accurately (at least by an average user), it seems hard to come up with answers to any of the questions above (if they are well defined in the first place). All these indicate that there might be no exact answers, but rather a probabilistic discussion should take place. Thus an attempt to make this question number 10,000,000 was hopeless in the first place.
Though I can see no practical use for all these (at least not from the point of view of an average user), chances are that I am not the only one wondering about it. Any other ideas?
It appears my question was a bit unclear for some of the users (though I did receive an appropriate answer, addressing the problem I had in mind, before it got put on hold). So I will make an attempt to rephrase it.
In the image below you can see the question counter (labeled (1), above referred to as T). The value of this counter is changed by some logic. Ideally such a counter would increase by 1 each time a new question meets all the requirements and would never decrease. However, this one goes down from time to time. So, a first question would be: what is the logic that changes the value of the counter?
Additionally, I would like to know which is the question that turned the value of this counter from 9,999,999 into 10,000,000. Due to the nature of logic that modifies the value of T, this change could have taken place multiple times. To simplify things, I am curious about which one question did it first. This question is the last step needed in order for the banner message labeled (2) to be legit and it is what I above referred to as 'question number 10,000,000'.
If there is a way to determine this question an acceptable answer would come in the form of a URL pointing to it and an proof that this is indeed the question I am looking for. Any possibly related ideas are also welcomed (since the logic that modifies the value of T might not be deterministic after all).

http://stackoverflow.com/questions/4
shows me question number 4.
id 1 - id 3 were deleted. If you check the autogenerated URL for that id http://stackoverflow.com/questions/1/where-oh-where-did-the-joel-data-go/ you'll recognize that it were questions from "joel". On the page itself it says: "This question was removed from Stack Overflow for reasons of moderation." --> that's quite selfexplaining (I guess maybe some dev-tests, or useless clutter).
Well, I also recognized that if I use a non-existent ID, it redirects me to the "nearest" ID. Means the ID with the smallest difference to the given URL parameter is shown. For example, ID 341888 will redirect you to id 341743: http://stackoverflow.com/questions/341743/c-string-that-can-be-null/
Regarding the above facts I THINK it's not possible to determine which
question is (or even was) 10,000,000 as a lot of them get deleted
for moderation reasons, etc.
I hope this helps.

Related

Reverse engineer an algorithm

I have an algorithm that uses time, and other variables to create keys that are 30 characters in length.
The FULL results look like this: d58fj27sdn2348j43dsh2376scnxcs
Because it uses time, and other variables, it will always change.
All you see is the first 6: d58fj2
So basically, if you monitor my results, each time you would see different once, and only the first 6:
d58fj2
kfd834
n367c6
9vh364
vc822c
c8u23n
df8m23
jmsd83
QUESTION: Would you ever be able to reverse engineer and figure out the algorithm calculating them? REMEMBER, you NEVER see the full result, just the first 6 digits out of 30.
QUESTION 2: To those saying it's possible, how many keys would you need in order to do that? (And I still mean, just the first 6 digits out of the 30)
Thanks
I'm willing to risk the downvotes, but somehow this quickly started smelling like school (home)work.
The question itself - "Is this reverse-engineerable? REMBER, you never see the full result" is suspicious enough; if you can see the full result, so can I. Whether you store it locally so i can take my time inspecting it, or whether it goes thru the wire so i have to hunt it down is another matter - having to use wireshark or not, I can still see what's being transmitted to and from the app.
Remember, at some point WEP used to be "unbreakable" while now alot of lowend laptops can crack them easily.
The second question however - "how many samples would you need to see to figure it out" sounds like one of those dumb impractical teacher questions. Some people guess their friends' passwords on first try, some take a few weeks... The number of tries, unfortunatelly, isn't the deciding factor in reverse-engineering. Only having the time to try them all is; which is why people use expensive locks on their doors - because they're not unbreakable, but because it takes more than a few seconds to break them which increases the chances that the neighbours will see suspicious activity.
But asking the crowd "how many keys would you need to see to crack this algorithm you know nothing about" leads nowhere, as it's merely a defensive move that does not provide any guarantees; the author of the algorithm very well knows how many samples one needs to break it using statistical analysis. (in case of WEP, that's anywhere between 5000 - 50000 - 200000 IVs). Some passwords break down with 5k, some hardly breaking with 200k...
Answering your questions in more detail with academic proof requires more info from your side; much more than the ambiguous "can you do it, and if yes, how long would it take?" question which is what it currently is.

Which data structure to choose for scenario A and B as described below [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I found the below in a question bank and I'm looking for some help with it.
For each of the following situations, select the best data structure and justify your selection.
The data structures should be selected from the following possibilities: unordered list, ordered array, heap, hash table, binary search tree.
(a) (4 points) Your structure needs to store a potentially very large number of records, with the data being added as it arrives. You need to be able to retrieve a record by its primary key, and these keys are random with respect to the order in which the data arrives. Records also may be deleted at random times, and all modifications to the data need to be completed just after they are submitted by the users. You have no idea how large the dataset could be, but the data structure implementation needs to be ready in a few weeks. While you are designing the program, the actual programming is going to be done by a co-op student.
For the answer, I thought BST would be the best choice.
Since the size is not clear, hashtable is not a good choice.
Since there is a matter of deletion, heap is not acceptable either.
Is my reasoning correct?
(b) (4 points) You are managing data for the inventory of a large warehouse store. New items (with new product keys) are added and deleted from the inventory system every week, but this is done while stores are closed for 12 consecutive hours.
Quantities of items are changed frequently: incremented as they are stocked, and decremented as they are sold. Stocking and selling items requires the item to be retrieved from the system using its product key.
It is also important that the system be robust, well-tested, and have predictable behaviour. Delays in retrieving an item are not acceptable, since it could cause problems for the sales staff. The system will potentially be used for a long time, though largely it is only the front end that is likely to be modified.
For this part I thought heapsort, but I have no idea how to justify my answer.
Could you please help me?
(a) needs fast insertion and deletion and you need retrieval based on key. Thus I'd go with a hashtable or a binary search tree. However, since the size is not known in advance and there's that deadline constraint, I'd say the binary search tree is the best alternative.
(b) You have enough time to process data after insertion/deletion but need an O(1) random access. An ordered array should do the trick.

How can I distinguish between two different users who live near to each other? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
How can I distinguish between two different users, like two different neighbours who lives in a same address and goes to the same office, but they have different patterns of driving and have different office schedules. I wanted to find out the probability of two persons who behaves more or less exactly. Depending on the resolution of the map, I wants to figure them, where they are, how often they are. Can I create a pattern ´for each drivers into some signatures, where their identity can be traced upon.
I assume, by the way that you asked your question, that you haven't had any plausible ideas yet. So I'll make an answer which is purely based on an idea that you might like to try out.
I initially thought of suggesting something along the line of word-similarity metrics, but because order is not necessarily important here, maybe it's worth trying something simpler to start. In fact, if I ever find myself considering something complex when developing a model, I take a step back and try to simplify. It's quicker to code, and you don't get so attached to something that's a dead end.
So, how about histograms? If you divide up time and space into larger blocks, you can increment a value in the relevant location for each time interval. You get a 2D histogram of a person's location. You can use basic anti-aliasing to make the histograms more representative.
From there, it's down to histogram comparison. You could implement something real basic using only 1D strips. You know, like sum the similarity measure for each of the vertical and horizontal strips. Linear histogram comparison is super-easy, and just a few lines of code in a language like C. Good enough for proof of concept. If it feels you're on the right track, then start looking for more tricky ideas...
The next thing I'd do is further stratify my data, using days of the week and statutory holidays... Maybe even stratify further using seasonal variables. I've found it pretty effective for forecasting electricity load, which is as much about social patterns as it is about weather. The trends become much more distinct when you separate an influencing variable.
So, after stratification you get a stack of 2D 'slices', and your signature becomes a kind of 3D volume. I see nothing wrong with representing the entire planet as a grid. Whether your squares represent 100m or 1km. It's easy to store this sparsely and prune out anything that's outside some number of standard deviations. You might choose only the most major events for the day and end up with a handful of locations.
You can then focus on the comparison metric. Maybe some kind of image-based gradient- or cluster-analysis. I'm sure there's loads of really great stuff out there. This is just the kinds of starting-points I make, having done no research.
If you need to add some temporal information to introduce separation between people with very similar lives, you can maybe build some lags into the system... Such as "where they were an hour ago". At that point (or possibly before), you probably want to switch from my over-simplified approach of averaging out a person's daily activities, and instead use something like classification trees. This kind of thing is very easy and rapid to develop with a tool like MATLAB or R.

Why do so many things run in 'human observable time'? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I've studied complexity theory and I come from a solid programming background and it's always seemed odd that so many things seem to run in times that are intrinsic to humans. I'm wondering if anyone has any ideas as to why this is?
I'm generally speaking of times in the range of 1 second to 1 hour. If you consider how narrow that span of time is proportional the the billions of operations per second a computer can handle, it seems odd that such a large number of things fall into that category.
A few examples:
Encoding video: 20 minutes
Checking for updates: 5 seconds
Starting a computer: 45 seconds
You get the idea...
Don't you think most things should fall into one of two categories: instantaneous / millions of years?
probably because it signifies the cut-off where people consider further optimizations not being worth the effort.
and clearly, having a computer that takes millions of years to boot wouldn't be very useful (or maybe it would, but you just wouldn't know yet, because it's still booting :P )
Given that computers are tools, and tools are meant to be setup, used, and have their results analyzed by humans (mostly), it makes sense that the majority of operations would be created in a way that didn't take longer than the lifespan of a typical human.
I would argue that most single operations are effectively "instantaneous" (in that they run in less than perceptible time), but are rarely used as a single operation. Humans are capable of creating complexity, and given that many computational operations intrinsically contain a balance between speed and some other factor (quality, memory usage, etc), it actually makes sense that many operations are designed in a way where that balance places them into a "times that are intrinsic to humans". However, I'd personally word that as "a time that is assumed to be acceptable to a human user, given the result generated."

Designing a twenty questions algorithm

I am interested in writing a twenty questions algorithm similar to what akinator and, to a lesser extent, 20q.net uses. The latter seems to focus more on objects, explicitly telling you not to think of persons or places. One could say that akinator is more general, allowing you to think of literally anything, including abstractions such as "my brother".
The problem with this is that I don't know what algorithm these sites use, but from what I read they seem to be using a probabilistic approach in which questions are given a certain fitness based on how many times they have lead to correct guesses. This SO question presents several techniques, but rather vaguely, and I would be interested in more details.
So, what could be an accurate and efficient algorithm for playing twenty questions?
I am interested in details regarding:
What question to ask next.
How to make the best guess at the end of the 20 questions.
How to insert a new object and a new question into the database.
How to query (1, 2) and update (3) the database efficiently.
I realize this may not be easy and I'm not asking for code or a 2000 words presentation. Just a few sentences about each operation and the underlying data structures should be enough to get me started.
Update, 10+ years later
I'm now hosting a (WIP, but functional) implementation here: https://twentyq.evobyte.org/ with the code here: https://github.com/evobyte-apps/open-20-questions. It's based on the same rough idea listed below.
Well, over three years later, I did it (although I didn't work full time on it). I hosted a crude implementation at http://twentyquestions.azurewebsites.net/ if anyone is interested (please don't teach it too much wrong stuff yet!).
It wasn't that hard, but I would say it's the non-intuitive kind of not hard that you don't immediately think of. My methods include some trivial fitness-based ranking, ideas from reinforcement learning and a round-robin method of scheduling new questions to be asked. All of this is implemented on a normalized relational database.
My basic ideas follow. If anyone is interested, I will share code as well, just contact me. I plan on making it open source eventually, but once I have done a bit more testing and reworking. So, my ideas:
an Entities table that holds the characters and objects played;
a Questions table that holds the questions, which are also submitted by users;
an EntityQuestions table holds entity-question relations. This holds the number of times each answer was given for each question in relation to each entity (well, those for which the question was asked for anyway). It also has a Fitness field, used for ranking questions from "more general" down to "more specific";
a GameEntities table is used for ranking the entities according to the answers given so far for each on-going game. An answer of A to a question Q pushes up all the entities for which the majority answer to question Q is A;
The first question asked is picked from those with the highest sum of fitnesses across the EntityQuestions table;
Each next question is picked from those with the highest fitness associated with the currently top entries in the GameEntities table. Questions for which the expected answer is Yes are favored even before the fitness, because these have more chances of consolidating the current top ranked entity;
If the system is quite sure of the answer even before all 20 questions have been asked, it will start asking questions not associated with its answer, so as to learn more about that entity. This is done in a round-robin fashion from the global questions pool right now. Discussion: is round-robin fine, or should it be fully random?
Premature answers are also given under certain conditions and probabilities;
Guesses are given based on the rankings in GameEntities. This allows the system to account for lies as well, because it never eliminates any possibility, just decreases its likeliness of being the answer;
After each game, the fitness and answers statistics are updated accordingly: fitness values for entity-question associations decrease if the game was lost, and increase otherwise.
I can provide more details if anyone is interested. I am also open to collaborating on improving the algorithms and implementation.
This is a very interesting question. Unfortunately I don't have a full answer, let me just write down the ideas I could come up with in 10 minutes:
If you are able to halve the set of available answers on each question, you can distinguish between 2^20 ~ 1 million "objects". Your set is probably going to be larger, so it's right to assume that sometimes you have to make a guess.
You want to maximize utility. Some objects are chosen more often than others. If you want to make good guesses you have to take into consideration the weight of each object (= the probability of that object being picked) when creating the tree.
If you trust a little bit of your users you can gain knowledge based on their answers. This also means that you cannot use a static tree to ask questions because then you'll get the answers for the same questions.. and you'll learn nothing new if you encounter with the same object.
If a simple question is not able to divide the set to two halves, you could combine them to get better results: eg: "is the object green or blue?". "green or has a round shape?"
I am trying try to write a python implementation using a naïve Bayesian network for learning and minimizing the expected entropy after the question has been answered as criterium for selecting a question (with an epsilon chance of selecting a random question in order to learn more about that question), following the ideas in http://lists.canonical.org/pipermail/kragen-tol/2010-March/000912.html. I have put what I got so far on github.
Preferably choose questions with low remaining entropy expectation. (For putting together something quickly, I stole from ε-greedy multi-armed bandit learning and use: With probability 1–ε: Ask the question with the lowest remaining entropy expectation. With probability ε: Ask any random question. However, this approach seems far from optimal.)
Since my approach is a Bayesian network, I obtain the probabilities of the objects and can ask for the most probable object.
A new object is added as new column to the probabilities matrix, with low a priori probability and the answers to the questions as given if given or as guessed by the Bayes network if not given. (I expect that this second part would work much better if I would add Bayes network structure learning instead of just using naive Bayes.)
Similarly, a new question is a new row in the matrix. If it comes from user input, probably only very few answer probabilities are known, the rest needs to be guessed. (In general, if you can get objects by asking for properties, you can obtain properties by asking if given objects have them or not, and the transformation between these is essentially Bayes' theorem and breaks down to transposition in the easiest case. The guessing quality should improve again once the network has an appropriate structure.)
(This is a problem, since I calculate lots of probabilities. My goal is to do it using database-oriented sparse tensor calculations optimized for working with weighted directed acyclic graphs.)
It would be interesting to see how good a decision tree based algorithm would serve you. The trick here is purely in the learning/sorting of the tree. I'd like to note that this is stuff I remember from AI class and student work in the AI working group and should be taken with a semi-large grain (or nugget) of salt.
To answer the questions:
You just walk the tree :)
This is a big downside of decision trees. You'd only have one guess that can be attached to the end nodes of the tree at depth 20 (or earlier, if the tree is still sparse).
There are whole books dedicated to this topic. As far as I remember from AI class you try minimize entropy at all times, so you want to ask questions that ideally divide the set of remaining objects into two sets of equal size. I'm afraid you'd have to look this up in AI books.
Decision trees are highly efficient during the query phase, as you literally walk the tree and follow the 'yes' or 'no' branch at each node. Update efficiency depends on the learning algorithm applied. You might be able to do this offline as in a nightly batched update or something like that.

Resources