Gesture detection algorithm based on discrete points

Gesture detection algorithm based on discrete points - algorithm

I am trying to solve the problem of matching a human generated gesture with a known gesture. The human generated gesture will be represented by a sequence of points that will need to be interpolated into a path and compared to the existing path. The image below shows what I am trying to compare
Can you please help point me in the right direction with resources or concepts that I can read into to build an algorithm to match these two paths? I have no experience in doing this before so any insights will be appreciated.

Receiving input
Measure input on some interval. Every xx milliseconds, measure the coordinates of the user's hand/finger/stylus.
Storing patterns and input
Patterns (expected input)
Modify the pattern. It's currently a continuous "function," but measuring input as such is difficult. Use discrete points at some interval. This interval can be very short, depending on how accurate you require gestures to be. In fact, it should be very short; the more points to compare against, the better (I'll explain this a little better in the next section).
Input (received from user)
When input is measured, the input-measurement interval needs to be short enough that each received consecutive pair of input points is close enough to compare to the expected points.
Imagine that the user performs some gesture very quickly (and completes it in the time your input-reader reads only three frames). The pattern and input cannot be reliably compared:
To avoid this, your input-reader must have a relatively short interval. However, this probably isn't a huge concern, since most hardware can read even the fastest human gestures.
Back to patterns: they should always be detailed enough to include more points than any possible input. More expected points allow for better accuracy. If a user moves slowly, the input will have more points; if they move quickly, the input will have fewer.
Consider this: completing a single gesture gives you half as many input frames as the pattern includes. The user has moved at a "normal" speed, so, to simplify the algorithm, you can "dumb down" your pattern by a factor of 2, then compare input coordinates to pattern coordinates directly.
This method is easier than the alternative that comes to mind (see next section).
Pattern "density" (coordinate frequency)
If you have a small number of expected points, you'll have to make approximations to match input.
Here's an "extreme" example, but it proves the concept. Given this pattern and input:
Point 3r can't be reliably compared with point 2 or point 3, so you'd have to use some function of points 2, 3, and 3r, to determine if 3r is on the correct path. Now consider the same input, but where the pattern has higher density:
Now, you don't have to compromise, since 3r is essentially definitely on the gesture's pattern. A slight reduction in the pattern's density causes it to match input quite well.
Positioning
Relative positioning
Instead of comparing absolute positions (such as on a touchscreen), you probably want the gesture to be allowed anywhere in some plane of space. To that end, you must relate the start point of the input to some coordinate system.
Normalization
To be user-friendly, allow gestures to be done in a range of "sizes". You don't want to compare raw data, because chances are the size of the plane of the input doesn't match the size of the plane of the pattern.
Normalize the input in the x- and y-direction to match the size of your pattern. Do not maintain aspect ratio.
Relate the input to a coordinate system, as per previous bullet
Find the largest horizontal and vertical distance between any two input points (call them RecMaxH and RecMaxV)
Find the largest horizontal and vertical distance between any two pattern points (call them ExpMaxH and ExpMaxV)
Multiply all input points' x-coordinates by ExpMaxH/RecMaxH
Multiple all input points' y-coordinates by ExpMaxV/RecMaxV
You now have two more-similar sets of points that can be compared. Normalization can be much more detailed than this; for instance, you could normalize sets of 3 points at a time to get incredibly similar images (but you would probably have to do this for each pattern, then compare the sum of all differences to find the most likely matching pattern).
I suggest storing all gestures' pattern as a graph the same size; that reduces computation when measuring closeness of input to possible pattern matches.
When to measure input
User-driven
Imagine a button that, when clicked/activated, causes your program to begin measuring inputs. This would be similar to Google's Voice Search, which doesn't constantly record and search; instead, you say "Ok Jarvis" or click the handy microphone icon and begin speaking your query.
Benefits:
Simplifies algorithm
Prevents user from unintentionally triggering an event. Imagine if every word you spoke was analyzed and sent to Google as part of a search query. Sometimes you just don't mean to do anything.
Drawbacks:
Less user-friendly. User must go out of his/her way to trigger recording for gestures.
If you're writing, for instance, a gesture-search (ridiculous example), this is probably the better method to implement. Nobody wants every move they make interpreted as an action in your application. However, if you're writing a Kinect-style or gesture-based game, you probably want to be constantly recording and looking for gestures.
Constant
Your program constatly records gesture coordinates at the specified interval (this could be reduced to "records if there's movement, otherwise doesn't store coordinates"). You must make a decision: how many "frames" will you record until deciding that the currently-stored motion is not a recognized gesture?
Store coordinates in a buffer: a queue 1.5 or 2 (to be cautious) times as long as the largest number of frames you're willing to record.
Once you determine that there exists in this buffer a sequence of frames that match a pattern, execute that gesture's result, and clear the queue.
If there's the possibility that the next gesture is an "option" for the most-recent gesture, record the application state as "currently waiting on option for ____ gesture," and wait for the option to appear.
If it's determined that the first x frames in the buffer cannot possibly match a pattern (because of their sequence or positioning), remove them from the queue.
Benefits:
Allows for more dynamic handling of gestures
User input recognized automatically
Drawbacks:
More complex algorithm
Heavier computation
If you're writing a game that runs based on real-time input, this is probably the right choice.
Algorithm
If you're using user-driven recognition:
Record all input in the allowed timeframe (or until the user signifies that they're done)
To evaluate the input, reduce the density of your pattern to match that of the input
Relate the input to a coordinate system
Normalize input
Use a method of function comparison (looseness of this calculation is up to you: standard deviation, variance, total difference in values, etc.), and choose the least-different possibility.
If no possibility is similar enough to meet your required threshold (you must decide this), don't accept the input.
If you're using constant measuring:
In your buffer, treat the sequence of max_sequence_size (you decide) beginning at every multiple of frame_multiples (you decide) as a possible gesture. For instance, if all of my possible gestures are at most 20 frames long, and I believe every 5 frames a new gesture could be starting (and I won't lose any critical data in those 5 frames), I'll compare each portions of the buffer to all possible gestures (portions from 0-19, 5-24, 10-29, etc.). This is heavier computing when frame_multiples decreases. For perfect measurement, frame_multiples is 1 (but this is likely not reasonable).
Hope you've enjoyed reading this answer as much as I enjoyed writing it. I've never done this before, but you've piqued my interest in a way that doesn't often happen. Please edit and improve my answer! If there's a portion that seems incomplete, add to it. I'm very curious in (particularly, more-experienced) responses and criticism.

Related

word2vec window size at sentence boundaries

I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
[
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
...
]
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you

I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)

In matlab, speed up cross correlation

I have a long time series with some repeating and similar looking signals in it (not entirely periodical). The length of the time series is about 60000 samples. To identify the signals, I take out one of them, having a length of around 1000 samples and move it along my timeseries data sample by sample, and compute cross-correlation coefficient (in Matlab: corrcoef). If this value is above some threshold, then there is a match.
But this is excruciatingly slow (using 'for loop' to move the window).
Is there a way to speed this up, or maybe there is already some mechanism in Matlab for this ?
Many thanks
Edited: added information, regarding using 'xcorr' instead:
If I use 'xcorr', or at least the way I have used it, I get the wrong picture. Looking at the data (first plot), there are two types of repeating signals. One marked by red rectangles, whereas the other and having much larger amplitudes (this is coherent noise) is marked by a black rectangle. I am interested in the first type. Second plot shows the signal I am looking for, blown up.
If I use 'xcorr', I get the third plot. As you see, 'xcorr' gives me the wrong signal (there is in fact high cross correlation between my signal and coherent noise).
But using "'corrcoef' and moving the window, I get the last plot which is the correct one.
There maybe a problem of normalization when using 'xcorr', but I don't know.

I can think of two ways to speed things up.
1) make your template 1024 elements long. Suddenly, correlation can be done using FFT, which is significantly faster than DFT or element-by-element multiplication for every position.
2) Ask yourself what it is about your template shape that you really care about. Do you really need the very high frequencies, or are you really after lower frequencies? If you could re-sample your template and signal so it no longer contains any frequencies you don't care about, it will make the processing very significantly faster. Steps to take would include
determine the highest frequency you care about
filter your data so higher frequencies are blocked
resample the resulting data at a lower sampling frequency
Now combine that with a template whose size is a power of 2
You might find this link interesting reading.
Let us know if any of the above helps!

Your problem seems like a textbook example of cross-correlation. Therefore, there's no good reason using any solution other than xcorr. A few technical comments:
xcorr assumes that the mean was removed from the two cross-correlated signals. Furthermore, by default it does not scale the signals' standard deviations. Both of these issues can be solved by z-scoring your two signals: c=xcorr(zscore(longSig,1),zscore(shortSig,1)); c=c/n; where n is the length of the shorter signal should produce results equivalent with your sliding window method.
xcorr's output is ordered according to lags, which can obtained as in a second output argument ([c,lags]=xcorr(..). Always plot xcorr results by plot(lags,c). I recommend trying a synthetic signal to verify that you understand how to interpret this chart.
xcorr's implementation already uses Discere Fourier Transform, so unless you have unusual conditions it will be a waste of time to code a frequency-domain cross-correlation again.
Finally, a comment about terminology: Correlating corresponding time points between two signals is plain correlation. That's what corrcoef does (it name stands for correlation coefficient, no 'cross-correlation' there). Cross-correlation is the result of shifting one of the signals and calculating the correlation coefficient for each lag.

How can I detect these audio abnormalities?

iOS has an issue recording through some USB audio devices. It cannot be reliably reproduced (happens every 1 in ~2000-3000 records in batches and silently disappears), and we currently manually check our audio for any recording issues. It results in small numbers of samples (1-20) being shifted by a small number that sounds like a sort of 'crackle'.
They look like this:
closer:
closer:
another, single sample error elsewhere in the same audio file:
The question is, how can these be algorithmically be detected (assuming direct access to samples) whilst not triggering false positives on high frequency audio with waveforms like this:
Bonus points: after determining as many errors as possible, how can the audio be 'fixed'?
Dirty audio file - pictured
Another dirty audio file
Clean audio with valid high frequency - pictured
More bonus points: what could be causing this issue in the iOS USB audio drivers/hardware (assuming it is there).

I do not think there is an out of the box solution to find the disturbances, but here is one (non standard) way of tackling the problem. Using this, I could find most intervals and I only got a small number of false positives, but the algorithm could certainly use some fine tuning.
My idea is to find the start and end point of the deviating samples. The first step should be to make these points stand out more clearly. This can be done by taking the logarithm of the data and taking the differences between consecutive values.
In MATLAB I load the data (in this example I use dirty-sample-other.wav)
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
and use the following code:
logdata = log(1+data);
difflogdata = diff(logdata);
So instead of this plot of the original data:
we get:
where the intervals we are looking for stand out as a positive and negative spike. For example zooming in on the largest positive value in the plot of logarithm differences we get the following two figures. One for the original data:
and one for the difference of logarithms:
This plot could help with finding the areas manually but ideally we want to find them using an algorithm. The way I did this was to take a moving window of size 6, computing the mean value of the window (of all points except the minimum value), and compare this to the maximum value. If the maximum point is the only point that is above the mean value and at least twice as large as the mean it is counted as a positive extreme value.
I then used a threshold of counts, at least half of the windows moving over the value should detect it as an extreme value in order for it to be accepted.
Multiplying all points with (-1) this algorithm is then run again to detect the minimum values.
Marking the positive extremes with "o" and negative extremes with "*" we get the following two plots. One for the differences of logarithms:
and one for the original data:
Zooming in on the left part of the figure showing the logarithmic differences we can see that most extreme values are found:
It seems like most intervals are found and there are only a small number of false positives. For example running the algorithm on 'clean-highfreq.wav' I only find one positive and one negative extreme value.
Single values that are falsely classified as extreme values could perhaps be weeded out by matching start and end-points. And if you want to replace the lost data you could use some kind of interpolation using the surrounding data-points, perhaps even a linear interpolation will be good enough.
Here is the MATLAB-code I used:
function test20()
clc
clear all
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
logdata = log(1+data);
difflogdata = diff(logdata);
figure,plot(data),hold on,plot(data,'.')
figure,plot(difflogdata),hold on,plot(difflogdata,'.')
figure,plot(data),hold on,plot(data,'.'),xlim([68000,68200])
figure,plot(difflogdata),hold on,plot(difflogdata,'.'),xlim([68000,68200])
k = 6;
myData = difflogdata;
myPoints = findPoints(myData,k);
myData2 = -difflogdata;
myPoints2 = findPoints(myData2,k);
figure
plotterFunction(difflogdata,myPoints>=k,'or')
hold on
plotterFunction(difflogdata,myPoints2>=k,'*r')
figure
plotterFunction(data,myPoints>=k,'or')
hold on
plotterFunction(data,myPoints2>=k,'*r')
end
function myPoints = findPoints(myData,k)
iterationVector = k+1:length(myData);
myPoints = zeros(size(myData));
for i = iterationVector
subVector = myData(i-k:i);
meanSubVector = mean(subVector(subVector>min(subVector)));
[maxSubVector, maxIndex] = max(subVector);
if (sum(subVector>meanSubVector) == 1 && maxSubVector>2*meanSubVector)
myPoints(i-k-1+maxIndex) = myPoints(i-k-1+maxIndex) +1;
end
end
end
function plotterFunction(allPoints,extremeIndices,markerType)
extremePoints = NaN(size(allPoints));
extremePoints(extremeIndices) = allPoints(extremeIndices);
plot(extremePoints,markerType,'MarkerSize',15),
hold on
plot(allPoints,'.')
plot(allPoints)
end
Edit - comments on recovering the original data
Here is a slightly zoomed out view of figure three above: (the disturbance is between 6.8 and 6.82)
When I examine the values, your theory about the data being mirrored to negative values does not seem to fit the pattern exactly. But in any case, my thought about just removing the differences is certainly not correct. Since the surrounding points do not seem to be altered by the disturbance, I would probably go back to the original idea of not trusting the points within the affected region and instead using some sort of interpolation using the surrounding data. It seems like a simple linear interpolation would be a quite good approximation in most cases.

To answer the question of why it happens -
A USB audio device and host are not clock synchronous - that is to say that the host cannot accurately recover the relationship between the host's local clock and the word-clock of the ADC/DAC on the audio interface. Various techniques do exist for clock-recovery with various degrees of effectiveness. To add to the problem, the bus clock is likely to be unrelated to either of the two audio clocks.
Whilst you might imagine this not to be too much of a concern for audio receive - audio capture callbacks could happen when there is data - audio interfaces are usually bi-directional and the host will be rendering audio at regular interval, which the other end is potentially consuming at a slightly different rate.
In-between are several sets of buffers, which can over- or under-run, which is what looks to be happening here; the interval between it happening certainly seems about right.
You might find that changing USB audio device to one built around a different chip-set (or, simply a different local oscillator) helps.
As an aside both IEEE1394 audio and MPEG transport streams have the same clock recovery requirement. Both of them solve the problem with by embedding a local clock reference packet into the serial bitstream in a very predictable way which allows accurate clock recovery on the other end.

I think the following algorithm can be applied to samples in order to determine a potential false positive:
First, scan for high amount of high frequency, either via FFT'ing the sound block by block (256 values maybe), or by counting the consecutive samples above and below zero. The latter should keep track of maximum consecutive above zero, maximum consecutive below zero, the amount of small transitions around zero and the current volume of the block (0..1 as Audacity displays it). Then, if the maximum consecutive is below 5 (sampling at 44100, and zeroes be consecutive, while outstsanding samples are single, 5 responds to 4410Hz frequency, which is pretty high), or the sum of small transitions' lengths is above a certain value depending on maximum consecutive (I believe the first approximation would be 3*5*block size/distance between two maximums, which roughly equates to period of the loudest FFT frequency. Also it should be measured both above and below threshold, as we can end up with an erroneous peak, which will likely be detected by difference between main tempo measured on below-zero or above-zero maximums, also by std-dev of peaks. If high frequency is dominant, this block is eligible only for zero-value testing, and a special means to repair the data will be needed. If high frequency is significant, that is, there is a dominant low frequency detected, we can search for peaks bigger than 3.0*high frequency volume, as well as abnormal zeroes in this block.
Also, your gaps seem to be either highly extending or plain zero, with high extends to be single errors, and zero errors range from 1-20. So, if there is a zero range with values under 0.02 absolute value, which is directly surrounded by values of 0.15 (a variable to be finetuned) or higher absolute value AND of the same sign, count this point as an error. Single values that stand out can be detected if you calculate 2.0*(current sample)-(previous sample)-(next sample) and if it's above a certain threshold (0.1+high frequency volume, or 3.0*high frequency volume, whichever is bigger), count this as an error and average.
What to do with zero gaps found - we can copy values from 1 period backwards and 1 period forwards (averaging), where "period" is of the most significant frequency of the FFT of the block. If the "period" is smaller than the gap (say we've detected a gap of zeroes in a high-pitched part of the sound), use two or more periods, so the source data will all be valid (in this case, no averaging can be done, as it's possible that the signal 2 periods forward from the gap and 2 periods back will be in counterphase). If there are more than one frequency of about equal amplitude, we can plain sample these with correct phases, cutting the rest of less significant frequencies altogether.
The outstanding sample should IMO just be averaged by 2-4 surrounding samples, as there seems to be only a single sample ever encountered in your sound files.

The discrete wavelet transform (DWT) may be the solution to your problem.
A FFT calculation is not very useful in your case since its an average representation of relative frequency content over the entire duration of the signal, and thus impossible to detect momentary changes. The dicrete short time frequency transform (STFT) tries to tackle this by computing the DFT for short consecutive time-blocks of the signal, the length of which is determine by the length (and shape) of a window, but since the resolution of the DFT is dependent on the data/block-length, there is a trade-off between resolution in freqency OR in time, and finding this magical fixed window-size can be tricky!
What you want is a time-frequency analysis method with good time resolution for high-frequency events, and good frequency resolution for low-frequency events... Enter the discrete wavelet transform!
There are numerous wavelet transforms for different applications and as you might expect, it's computationally heavy. The DWT may not be practical solution to your problem, but it's worth considering. Good luck with your problem. Some friday-evening reading:
http://klapetek.cz/wdwt.html
http://etd.lib.fsu.edu/theses/available/etd-11242003-185039/unrestricted/09_ds_chapter2.pdf
http://en.wikipedia.org/wiki/Wavelet_transform
http://en.wikipedia.org/wiki/Discrete_wavelet_transform

You can try the following super-simple approach (maybe it's enough):
Take each point in your wave-form and subtract its predecessor (look at the changes from one point to the next).
Look at the distribution of these changes and find their standard deviation.
If any given difference is beyond X times this standard deviation (either above or below), flag it as a problem.
Determine the best value for X by playing with it and seeing how well it performs.
Most "problems" should come as a pair of two differences beyond your cutoff, one going up, and one going back down.
To stick with the super-simple approach, you can then fix the data by just interpolating linearly between the last good point before your problem-section and the first good point after. (Make sure you don't just delete the points as this will influence (raise) the pitch of your audio.)

Are there any good techniques for keeping nearly-sorted data nearly-sorted?

Short Version:
I'm looking for a technique to keep nearly-sorted data in nearly-sorted order over time, despite the values changing slightly.
Here's the scenario:
In the world of 3D graphics, it is often beneficial to order your objects from front-to-back before drawing. As your scene changes or your view of the scene changes, this data may require re-sorting, however it will usually be very close to the sorted order (i.e. it won't change very much between frames). It's also not critical that the data be exactly in sorted order. The worst thing that will happen is that a polygon will be rendered and then completely hidden. It's a small performance hit, but not the end of the world.
With this in mind, is it possible to sort the data once ahead of time and then apply a minimal patch to the data once per frame to ensure that the data stays mostly sorted? In this scenario, the data would be considered mostly sorted if most of the objects were in ascending order. That is, 1 object that is 10 steps away from it's proper location is much better (10x better) than 10 objects that are 1 step away from their proper location.
It's also worth noting that the data could continue to be patched on a semi regular basis, as the data is typically rendered 30 times per second (or so). As long as the calculation was efficient, it could continue to be done over time until the changes stop and the list was completely sorted.
Existing Idea:
My knee jerk reaction to this problem is:
Apply an n log n sort to the data when it is loaded, and on large changes (which I can track pretty easily).
When the data starts changing slowly (e.g. when the scene is rotated), apply a single (linear) pass of some sort on the data to swap backwards neighbors and try to maintain sort order (I think this is basically shell sort - maybe there is a better algorithm to use for this single pass).
Keep doing a single pass of the partial sort each frame until the changes stop and the data is completely sorted
Go back to step 2 and wait for more changes.

There are a variety of sorts that run in O(n) time if the input is mostly sorted, and O(n log n) if the data is not sorted. It sounds like you can use that pretty easily. Timsort is one such sort and, I believe, is the default sort now in both python and java. Smoothsort is another one that is fairly easy to implement yourself.

From your description it sounds like the sort order changes without you changing the data itself. E.g. you change the camera, so the sort order should change, even though you have not modified any polygons.
If so, you can't detect sort order changes directly when they happen. If you could, I would create buckets for the list of polygons, and resort buckets when 'enough' polygons in that bucket have been touched.
But I'm betting your system doesn't work that way. The sort is determined by the view port. In that case polygons at the front of the sort matter much more than ones at the end.
So I'd segment the poly list into fifths or something like that. Front to back, so that the first fifth is the part closest to the camera. I'd completely sort the first segment every frame. I'd divide the second segment into sub segments - say 5 again - and sort each sub segment every frame, such that every 5 frames the second fifth is completely sorted. segment the third through 5th segments into 15 sub segments and do those every 5 frames each such that the rest get sorted completely every 75 frames. At 60 fps you'd have the display list completely resorted a little more than once per second.
The nice thing about prioritizing the front of the list, is
1. Polys at the front are going to tend to be larger on the screen, and will fail depth test more often. Bad orders at the end of the list will more often than not just not matter.
2. the front of the list is more susceptible sort changes due to camera changes.
Also chose those segment ranges with a little overlap, so that polygons can migrate to their correct segment in 2 sorts.
#OP: Thinking about it a little more. You are probably more concerned with having the sorting cost stay bounded - instead of exploding with scene complexity. Especially since a very complex scene should - surprisingly - be less susceptible to bad sorts ( because generally the polys get smaller ).
You could define a fixed amount of sorting you are willing to do per frame. Use say 50% of the budget for as much of the front of the list as you can afford, 25% of the budget to sort the next region and 25% to spend equally on the rest.
Say you budget 1000 polys sorted per frame, and you have 10000 polys in the scene. Sort the first 500 polys every frame. Sort 250 polys every tenth frame for the next region. So 501-750 on frame 1, 751-1000 on frame 2 etc. And then divide the rest of the list into 250 frame segments and sort them round robin for however many frames you need to.
This keeps the sorting cost fixed s the scene gets more and less complex, and it is easy to tune, you just adjust the sorting budget to what you can afford.

I'll suggest a solution that borrows from a number of others here. Of course we start with a full sort of the objects on initialisation.
What I would do is always perform, say, 10 linear-time runs over your objects for every frame (with early termination if you find out that your objects are already completely sorted). Each run can be, say, one pass of bubble sort with a shell sort-style gap over the whole array: for all i from 0 to n-gap-1, compare A[i] and A[i+gap], and exchange them if they are not sorted. You can use a fixed sequence of gaps, or maybe better, let it vary between frames; either way, if you do sufficiently many frames where the objects do not change, you'll have a fully sorted sequence. You could even mix different types of sub-algorithms to do your runs, as long as each iteration improves the 'sortedness'.
You can add Rafael Baptista's idea of prioritizing the front of the scene easily by doing one extra run on the front segment, or choosing to divide the gap by two for the front half, or something like that.

It doesn't work out as neatly as the problem you've supposed because all you have to do is turn the camera 90 degrees and the basis for being sorted is on a different axis entirely. (X and Y axis are independent, for example -- looking down the X axis will cause the sort order to not rely on the X axis, and looking down the Y axis will cause the sort order to not rely on the Y axis.) Even a 5 degree turn can cause far away "close" (as far as Z-order is concerned) things to be suddenly "far".
Let's be honest -- generating the draw calls for the objects is normally going to take much more time than sorting them, especially if you have an optimized sorting algorithm for your scenario and your game is of modern visual complexity.
Sorting can be practically O(n), especially with histogram-based algorithms or radix-style algorithms. (Yes, radix sort applies to integers, so you'd have to scale your world coordinates to integers, but normally that's more than good enough unless you have a gigantic world.)
That being said, since you're already doing O(n) ops for everything you're drawing, resorting per frame isn't going to be a huge problem, especially with both high and low level optimization.
Another common way of addressing this issue is with a scene graph, but for your purposes it ends up essentially being a re-sort per frame. However, you can build frustum culling, shadow culling, and level of detail calculations into the scene graph traversal.
If you're looking for approximations, instead of doing a z-distance sort do a true distance sort and update the sort order more often for close by objects and less often for further objects (depending on distance the camera has traveled). This can work because if you're further away from an object, moving doesn't cause the angle to the viewer to change as often which, in turn, means the old sorting data is more likely to be valid. I'm not a fan of this because I like algorithms which allow my game to teleport across the map without any issues. (Mind you, streaming assets from disk becomes the real issue for teleporting.)

Shell sort is good for lists with few unique values and some scenarios that "need short code and do not use the call stack".
In your case, you need something called Adaptive sort, which means algorithms "takes advantage of existing order in its input".
If your space is tight, you can just use Straight Insertion Sort, which is adaptive and in place.
Otherwise you can try Timsort and Smoothsort as #RunningWild suggested, they are both adaptive sort algorithms.

Fuzzy matching/chunking algorithm

Background: I have video clips and audio tracks that I want to sync with said videos.
From the video clips, I'll extract a reference audio track.
I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.
I need to manipulate the target track to look like (sound like, in this case) the ref track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.
Example:
0 1 2
012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)
Output:
[(2,6), (5,9) # part1
(13, 17), (14, 18)] # part2
My idea is, starting from the beginning:
Fingerprint 2 large chunks* of audio and see if they match:
If yes: move on to the next chunk
If not:
Go down both tracks looking for the first non-silent portion of each
Offset the target to match the original
Go back to the beginning of the loop
# * chunk size determined by heuristics and modifiable
The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.
Ideally I want to them as few times as possible. Ideas?

Sounds like you're not looking to spend a lot of time delving into audio processing/engineering, and hence you want something you can quickly understand and just works. If you're willing to go with something more complex see here for a very good reference.
That being the case, I'd expect simple loudness and zero crossing measures would be sufficient to identify portions of sound. This is great because you can use techniques similar to rsync.
Choose some number of samples as a chunk size and march through your reference audio data at a regular interval. (Let's call it 'chunk size'.) Calculate the zero-crossing measure (you likely want a logarithm (or a fast approximation) of a simple zero-crossing count). Store the chunks in a 2D spatial structure based on time and the zero-crossing measure.
Then march through your actual audio data a much finer step at a time. (Probably doesn't need to be as small as one sample.) Note that you don't have to recompute the measures for the entire chunk size -- just subtract out the zero-crossings no longer in the chunk and add in the new ones that are. (You'll still need to compute the logarithm or approximation thereof.)
Look for the 'next' chunk with a close enough frequency. Note that since what you're looking for is in order from start to finish, there's no reason to look at -all- chunks. In fact, we don't want to since we're far more likely to get false positives.
If the chunk matches well enough, see if it matches all the way out to silence.
The only concerning point is the 2D spatial structure, but honestly this can be made much easier if you're willing to forgive a strict window of approximation. Then you can just have overlapping bins. That way all you need to do is check two bins for all the values after a certain time -- essentially two binary searches through a search structure.
The disadvantage to all of this is it may require some tweaking to get right and isn't a proven method.

If you can reliably distinguish silence from non-silence as you suggest and if the only differences are insertions of silence, then it seems the only non-trivial case is where silence is inserted where there was none before:
ref: --part1part2--
syn: ---part1---part2----
If you can make your chunk size adaptive to the silence, your algorithm should be fine. That is, if your chunk size is equivalent to two characters in the above example, your algorithm would recognize "pa" matches "pa" and "rt" matches "rt" but for the third chunk it must recognize the silence in syn and adapt the chunk size to compare "1" to "1" instead of "1p" to "1-".
For more complicated edits, you might be able to adapt a weighted Shortest Edit Distance algorithm with removing silence have 0 cost.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio