What is TimeHash and How it Works? - time

I am doing my rapport of a final year project.
I have used time-hash algorithm in my project, unlike geo-hash, i didn't find a clear explanation about time-hash and how that works ?

From my github.com project timehash:
timehash is an algorithm (with multiple reference implementations) for calculating variable precision sliding windows of time. When performing aggregations and correlations on large-scale data sets, the ability to convert precise time values into 'malleable intervals' allows for many novel analytics.
Using sliding windows of time is a common practice in data analysis but prior to the timehash algorithm it was more of an art than a science.
convert epoch miliseconds into an interval of time, depicted by an ASCII character 'hash' (a 'timehash')
timehash values are well suited to referencing time intervals in
key-value stores (e.g. Hbase, Acculumo, Redis)
The creation of a compound key of space and time (e.g.
geohash_timehash) is a powerful primitive for understanding
geotemporal patterns
https://github.com/abeusher/timehash

Related

What type of algorithm should I use for forecasting with only very little historic data?

The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?
Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804
In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html

Bring Word2Vec models efficiently into Production Service

This is kind of a long shot, but I am hoping that someone has been in a similar situation as I am looking for some advice how to efficiently bring a set of large word2vec models into a production environment.
We have a range of trained w2v models with a dimensionality of 300. Due to the underlying data - huge corpus with POS tagged words; specialized vocabularies with up to 1 mio words - these models became quite large and we are currently looking into effective ways how to expose these to our users w/o paying a too high price in infrastructure.
Besides trying to better control the vocabulary size, obviously, dimensionality reduction on the feature vectors would be an option. Is anyone aware of publications around that, particularly on how this would affect model quality, and how to best measure this?
Another option is to pre-calculate the top X most similar words to each vocabulary word and to provide a lookup table. With the model size being that big, this is currently also very inefficient. Are there any heuristics known that could be used reduce the number of necessary distance calculations from n x n-1 to a lower number?
Thank you very much!
There are pre-indexing techniques for similarity-search in high-dimensional spaces which can speed nearest-neighbor discovery, but usually at a cost of absolute accuracy. (They also need more memory for the index.)
An example is the ANNOY library. The gensim project includes a demo notebook showing its use with Word2Vec.
I once did some experiments using just 16-bit (rather than 32-bit) floats in a Word2Vec model. It saved memory in the idle state, and nearest-neighbor top-N results were nearly unchanged. But, perhaps because some behind-the-scenes up-conversion to 32-bit floats was still occurring during the one-against-all distance-calculations, speed of operations was actually reduced. (And this suggests that each distance-calculation may have caused a temporary memory expansion offsetting any idle-state savings.) So it's not a quick fix, but further research here – perhaps involving finding/implementing the right routines for float16 array operations – could maybe mean 50% model-size savings and equivalent or even better speed.
For many applications, discarding the least-frequent words doesn't hurt much – or even, when done before training, can improve the quality of the remaining vectors. As many implementations, including gensim, sort the word-vector array in most-to-least-frequent order, you can discard the tail-end of the array to save memory, or limit most_similar() searches to the first-N entries to speed calculations.
Once you've minimized the vocabulary size, you want to be sure the full set is in RAM, and no swapping is triggered during the (typical) full-sweep distance-calculations. If you need multiple processes to serve answers from the same vector set, as in a web service on a multicore machine, gensim's memory-mapping operations can prevent each process from loading its own redundant copy of the vectors. You can see a discussion of this technique in this answer about speeding gensim Word2Vec loading time.
Finally, while precomputing top-N neighbors for a larger vocabulary is both time-consuming and memory-intensive, if your pattern of access is such that some tokens are checked far more than others, a cache of the N most-recently or M most-frequently requested top-N could improve perceived performance a lot – making only less-frequently-requested neighbor-lists require the full distance calculations to every other token.

Classifying Multivariate Time Series

I currently am working on a time series witch 430 attributes and approx. 80k instances. Now I would like to binary classify each instance (not the whole ts). Everything I found about classifying TS talked about labeling the whole thing.
Is it possible to classify each instance with something like a SVM completely disregarding the sequential nature of the data or would that only result in a really bad classifier?
Which other options are there which classify each instance but still look at the data as a time series?
If the data is labeled, you may have luck by concatenating attributes together, so each instance becomes a single long time series, and by applying the so-called Shapelet Transform. This would result in a vector of values for each of time series which can be fed into SVM, Random Forest, or any other classifier. It could be that picking a right shapelets will allow you to focus on a single attribute when classifying instances.
If it is not labeled, you may try the unsupervised shapelets application first to explore your data and proceed with aforementioned shapelet transform after.
It certainly depends on the data within the 430 attributes,
data types, and especially the problem you want to solve.
In time series analysis, you usually want to exploit the dependencies between the neighboring points, i.e., how they change in time. The examples you may find in books usually talk about a single function f(t): Time -> Real. If I understand it correctly, you want to focus just on the dependencies among the 430 attributes (vertical dependencies) and disregard the horizontal dependencies.
If I were you, I would first try to train multiple classifiers (SVM, Maximum entropy model, Multi-layer perceptron, Random forest, Probabilistic Neural Network, ...) and compare their prediction performance in the frame of your problem.
For training, you can start by feeding all 430 attributes as features to Maxent classifier (can easily handle millions of features).
You also need to perform some N-fold cross-validation to see whether the classifiers are not overfitted. Then pick the best that solves your problem "good enough".
Other ideas if this approach does not perform well:
include features from t-1, t-2...
perform feature selection by trying different subsets of features
derive new time series such as moving averages, wavelet spectrum ... and use them as new features
A nice implementation of Maxent classifier can be found in openNLP.

What are the differences between genetic algorithms and evolution strategies?

I've read a couple of introductory sections of books as well as a few papers on both topics, and it looks to me that these two methods are pretty much exactly the same. That said, I haven't had the time to actually deeply research the topics yet, so I might be wrong.
What are the distinctions between genetic algorithms and evolution strategies? What makes them different, and where are they similar?
In evolution strategies, the individuals are coded as vectors of real numbers. On reproduction, parents are selected randomly and the fittest offsprings are selected and inserted in the next generation. ES individuals are self-adapting. The step size or "mutation strength" is encoded in the individual, so good parameters get to the next generation by selecting good individuals.
In genetic algorithms, the individuals are coded as integers. The selection is done by selecting parents proportional to their fitness. So individuals must be evaluated before the first selection is done. Genetic operators work on the bit-level (e.g. cutting a bit string into multiple pieces and interchange them with the pieces of the other parent or switching single bits).
That's the theory. In practice, it is sometimes hard to distinguish between both evolutionary algorithms, and you need to create hybrid algorithms (e.g. integer (bit-string) individuals that encodes the parameters of the genetic operators).
Just stumbled on this thread when researching Evolution Strategies (ES).
As Paul noticed before, the encoding is not really the difference here, as this is an implementation detail of specific algorithms, although it seems more common in ES.
To answer the question, we first need to do a small step back and look at internals of an ES algorithm.
In ES there is a concept of endogenous and exogenous parameters of the evolution. Endogenous parameters are associated with individuals and therefore are evolved together with them, exogenous are provided from "outside" (e.g. set constant by the developer, or there can be a function/policy which sets their value depending on the iteration no).
The individual k consists therefore of two parts:
y(k) - a set of object parameters (e.g. a vector of real/int values) which denote the individual genotype
s(k) - a set of strategy parameters (e.g. a vector of real/int values again) which e.g. can control statistical properties of mutation)
Those two vectors are being selected, mutated, recombined together.
The main difference between GA and ES is that in classic GA there is no distinction between types of algorithm parameters. In fact all the parameters are set from "outside", so in ES terms are exogenous.
There are also other minor differences, e.g. in ES the selection policy is usually one and the same and in GA there are multiple different approaches which can be interchanged.
You can find a more detailed explanation here (see Chapter 3): Evolution strategies. A comprehensive introduction
In most newer textbooks on GA, real-valued coding is introduced as an alternative to the integer one, i.e. individuals can be coded as vectors of real numbers. This is called continuous parameter GA (see e.g. Haupt & Haupt, "Practical Genetic Algorithms", J.Wiley&Sons, 1998). So this is practically identical to ES real number coding.
With respect to parent selection, there are many different strategies published for GA's. I don't know them all, but I assume selection among all (not only the best has been used for some applications).
The main difference seems to be that a genetic algorithm represents a solution using a sequence of integers, whereas an evolution strategy uses a sequence of real numbers -- reference: http://en.wikipedia.org/wiki/Evolutionary_algorithm#
As the wikipedia source (http://en.wikipedia.org/wiki/Genetic_algorithm) and #Vaughn Cato said the difference in both techniques relies on the implementation. EA use
real numbers and GA use integers.
However, in practice I think you could use integers or real numbers in the formulation of your problem and in your program. It depends on you. For instance, for protein folding you can say the set of dihedral angles form a vector. This is a vector of real numbers, but the entries
are labeled by integers so I think you can formulate your problem and write you program based
on an integer arithmetic. It is just an idea.

Where is binary search used in practice?

Every programmer is taught that binary search is a good, fast way to search an ordered list of data. There are many toy textbook examples of using binary search, but what about in real programming: where is binary search actually used in real-life programs?
Binary search is used everywhere. Take any sorted collection from any language library (Java, .NET, C++ STL and so on) and they all will use (or have the option to use) binary search to find values. While true that you have to implement it rarely, you still have to understand the principles behind it to take advantage of it.
Binary search can be used to access ordered data quickly when memory space is tight. Suppose you want to store a set of 100.000 32-bit integers in a searchable, ordered data structure but you are not going to change the set often. You can trivially store the integers in a sorted array of 400.000 bytes, and you can use binary search to access it fast. But if you put them e.g. into a B-tree, RB-tree or whatever "more dynamic" data structure, you start to incur memory overhead. To illustrate, storing the integers in any kind of tree where you have left child and right child pointers would make you consume at least 1.200.000 bytes of memory (assuming 32-bit memory architecture). Sure, there are optimizations you can do, but that's how it works in general.
Because it is very slow to update an ordered array (doing insertions or deletions), binary search is not useful when the array changes often.
Here some practical examples where I have used binary search:
Implementing a "switch() ... case:" construct in a virtual machine where the case labels are individual integers. If you have 100 cases, you can find the correct entry in 6 to 7 steps using binary search, where as sequence of conditional branches takes on average 50 comparisons.
Doing fast substring lookup using suffix arrays, which contain all the suffixes of the set of searchable strings in lexiographic ordering (I wanted to conserve memory and keep the implementation simple)
Finding numerical solutions to an equation (when you are lazy and do not mind to implement Newton's method)
Every programmer needs to know how to use binary search when debugging.
When you have a program, and you know that a bug is visible at a particular point
during the execution of the program, you can use binary search to pin-point the
place where it actually happens. This can be much faster than single-stepping through
large parts of the code.
Binary search is a good and fast way!
Before the arrival of STL and .NET framework, etc, you rather often could bump into situations where you needed to roll your own customized collection classes. Whenever a sorted array would be a feasible place of storing the data, binary search would be the way of locating entries in that array.
I'm quite sure binary search is in widespread use today as well, although it is taken care of "under the hood" by the library for your convenience.
I've implemented binary searches in BTree implementations.
The BTree search algorithms were used for finding the next node block to read but, within the 4K block itself (which contained a number of keys based on the key size), binary search was used for find either the record number (for a leaf node) or the next block (for a non-leaf node).
Blindingly fast compared to sequential search since, like balanced binary trees, you remove half the remaining search space with every check.
I once implemented it (without even knowing that this was indeed binary search) for a GUI control showing two-dimensional data in a graph. Clicking with the mouse should set the data cursor to the point with the closest x value. When dealing with large numbers of points (several 1000, this was way back when x86 was only beginning to get over 100 MHz CPU frequency) this was not really usable interactively - I was doing a linear search from the start. After some thinking it occurred to me that I could approach this in a divide and conquer fashion. Took me some time to get it working under all edge cases.
It was only some time later that I learned that this is indeed a fundamental CS algorithm...
One example is the stl set. The underlying data structure is a balanced binary search tree which supports look-up, insertion, and deletion in O(log n) due to binary search.
Another example is an integer division algorithm that runs in log time.
We still use it heavily in our code to search thousands of ACLS many thousands of times a second. It's useful because the ACLs are static once they come in from file, and we can suffer the expense of growing the array as we add to it at bootup. Blazingly fast once its running too.
When you can search a 255 element array in at most 7 compare/jumps (511 in 8, 1023 in 9, etc) you can see that binary search is about as fast as you can get.
Well, binary search is now used in 99% of 3D games and applications. Space is divided into a tree structure and a binary search is used to retrieve which subdivisions to display according to a 3D position and camera.
One of its first greatest showcase was Doom. Binary trees and associated search enhanced the rendering.
Answering your question with hands-on example.
In R programming language there is a package data.table. It is known from C-implemented, short syntax, high performance extension for data transformation. It uses binary search. Even without binary search it scales better than competitors.
You can find benchmark vs python pandas and vs R dplyr in project wiki grouping 2E9 - random order data.
There is also nice benchmark vs databases vs bigdata benchm-databases.
In recent data.table version (1.9.6) binary search was extended and now can be used as index on any atomic column.
I just found a nice summary with which I totally agree - see.
Anyone doing R comparisons should use data.table instead of data.frame. More so for benchmarks. data.table is the best data structure/query language I have found in my career. It's leading the way in The R world, and in my way, in all the data-focused languages.
So yes, binary search is being used and world is much better place thanks to it.
Binary search can be used to debug with Git. It's called git bisect.
Amongst other places, I have an interpreter with a table of command names and a pointer to the function to interpret that command. There are about 60 commands. It would not be incredibly onerous to use a linear search - but I use a binary search.
Semiconductor test programs used for measuring digital timing or analog levels make extensive use of binary search. Automatic Test Equipment (ATE) from Advantest, Teradyne, Verigy and the like can be thought of as truth table blasters, applying input logic and verifying output states of a digital part.
Think of a simple gate, with the input logic changing at time = 0 of each cycle, and transitioning X ns after the input logic changes. If you strobe the output before T=X,the logic does not match expected value. Strobe later than time T=X, and the logic does match expected value. Binary search is used to find the threshold between the latest value that the logic does not match, and the earliest part where it does.(A Teradyne FLEX system resolves timing to 39pS resolution, other testers are comparable). That's a simple way to measure transition time. Same technique can be used to solve for setup time, hold time, operable power supply levels, power supply vs. delay,etc.
Any kind of microprocessor, memory, FPGA, logic, and many analog mixed signal circuits use binary search in test and characterization.
-- mike
I had a program that iterated through a collection to perform some calculations. I thought that this was inefficient so I sorted the collection and then used a single binary search to find an item of interest. I returned this item and its matching neighbours. I had in-effect filtered the collection.
Doing this was actually slower than iterating the entire collection and fishing out matching items.
I continued to add items to the collection knowing that the sorting and searching performance would eventually catch up with the iteration. It took a collection of about 600 objects until the speed was identical. 1000 objects had a clear performance benefit.
I would also consider the type of data you are working with, the duplicates and spread. This will have an effect on the sorting and searching.
My answer is to try both methods and time them.
It's the basis for hg bisect
Binary sort is useful in adjusting fonts to size of text with constant dimension of textbox
Finding roots of an equation is probably one of those very easy things you want to do with a very easy algorithm like binary search.
Delphi uses can enjoy binary search while searching string in sorted TStringList.
I believe that the .NET SortedDictionary uses a binary tree behind the scenes (much like the STL map)... so a binary search is used to access elements in the SortedDictionary
Python'slist.sort() method uses Timsort which (AFAIK) uses binary search to locate the positions of elements.
Binary search offers a feature that many readymade map/dictionary implementations don't: finding non-exact matches.
For example, I've used binary search to implement geotagging of photos based on GPS logs: put all GPS waypoints in an array sorted by timestamp, and use binary search to identify the waypoint that lies closest in time to each photo's timestamp.
If you have a set of elements to find in an array you can either search for each of them linearly or sort the array and then use binary search with the same comparison predicate. The latter is much faster.

Resources