node.js serverside 2d search performance - performance

I plan to store and update user locations (lat, lon, userid) via websocket on a nodejs server.
my goal is to broadcast the user locations to every user, as fast as possible.
(like the mytaxi position of each taxi in the app)
my concerns / problems
server side performance on lots of simultanious users
pushing the data back (i only need to know about users in my region)
-> 2d search (get users which lat / lon is in boundingbox )
questions:
whats the best storage solution (mongodb vs js array / object storage)
is read/write on db faster than array searching?
is there a 2d optimized javascript search solution?
my way
i would go for two arrays (arr1 sorted by lat, arr2 sorted by lon)
-> search via divide and conquer -> check on similar ids -> output
is there a better way to do this?
thanks in advance
alex

you can use Redis's pub/sub channels, create a channel for each bounding box and subscribe users to the relevant channels according to their location report, then push messages to entire channels.
this strategy may be naive for big data..

Related

Fuzzy matching a list of coordinates (Golang)

I'm trying to create a simple tool which will allow a user to specify two places around Seattle.
I'm working with the WSDOT traffic data set. An example of the output can be found here: https://gist.github.com/jaxxstorm/0ab818b300f65cf3a46cc01dbc35bf60
What I'd like to be able to do is specify two locations like:
Bellevue
Seattle
and then lookup all traffic times for those locations.
I'm considering doing a reverse geocode like this answer but I want it to be "fuzzy" in that I don't want people to have to specify exact locations. I also suspect the processing time for this might be long as I'd have to loop through the list, and reverse lookup all the coordinates which could take a short while
Is there any better alternatives for processing this data in this way? I'm writing the tool in Go
You have two problems for each set of points (start and end):
Convert locations to lat lon
Fuzzy match lat,lon to this traffic data (which contains lat,lon)
The location to lat,lon conversion is pretty straightforward using a reverse geocoding api like the one available from google.
To match lat,lon fuzzily, you could either truncate lat lon and store that as a hash (so that you're storing approximate matches), then lookup data that way, or you could do a radius calc and pick results within that radius (this requires some math involving the radius of the earth which you can look up easily enough, it can be done in sql if your data is in a db for example).

Calculate distance between list of origin-destination using google api

I have a requirement to calculate the distance between the list of origin and destinations.
Say list has o1-d1, o2-d2, o3-d3 etc.
Is there a way to send all at a time to Google API and get all results instead of single o-d iterating in a loop for the size of the list.
thanks
I'm afraid that the answers is NO.
For now the API is design to give you only this format of response and the way to go is parsing throw it [1].
[1]https://developers.google.com/maps/documentation/javascript/distancematrix#distance_matrix_parsing_the_results

F# Immutable data structures for high frequency real-time streaming data

We are at the beginning of an f# project involving real-time and historical analysis of streaming data. The data is contained in a c# object (see below) and is sent as part of a standard .net event. In real-time, the number of events we typically receive can vary greatly from less than 1/sec to upwards of around 800 events per second per instrument and thus can be very bursty. A typical day might accumulate 5 million rows/elements per insturment
A generic version of the C# event's data structure looks like this:
public enum MyType { type0 = 0, type1 = 1}
public class dataObj
{
public int myInt= 0;
public double myDouble;
public string myString;
public DateTime myDataTime;
public MyType type;
public object myObj = null;
}
We plan to use this data structure in f# in two ways:
Historical analysis using supervised && unsupervised machine learning (CRFs, clustering models, etc)
Real-time classification of data streams using the above models
The data structure needs to be able to grow as we add more events. This rules out array<t> because it does not allow for resizing, though it could be used for the historical analysis. The data structure also needs to be able to quickly access recent data and ideally needs to be able to jump to data x points back. This rules out Lists<T> because of the linear lookup time and because there is no random access to elements, just "forward-only" traversal.
According to this post, Set<T> may be a good choice...
> " ...Vanilla Set<'a> does a more than adequate job. I'd prefer a 'Set' over a 'List' so you always have O(lg n) access to the largest and smallest items, allowing you to ordered your set by insert date/time for efficient access to the newest and oldest items..."
EDIT: Yin Zhu response gave me some additional clarity into exactly what I was asking. I have edited the remainder of the post to reflect this. Also, the previous version of this question was muddied by the introduction of requirements for historical analysis. I have omitted them.
Here is a breakdown of the steps of the real-time process:
A realtime event is received
This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set<T>, or some other structure?
A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
Features for the model are generated and sent to a model for evaluation.
We may prune the data structure of older data to improve performance.
So the question is what is the best data structure to use for storing the real-time streaming events that we will use to generated features.
You should consider FSharpx.Collections.Vector. Vector<T> will give you Array-like features, including indexed O(log32(n)) look-up and update, which is within spitting distance of O(1), as well as adding new elements to the end of your sequence. There is another implementation of Vector which can be used from F# at Solid Vector. Very well documented and some functions perform up to 4X faster at large scale (element count > 10K). Both implementations perform very well up to and possibly beyond 1M elements.
In his answer, Jack Fox suggests using either the FSharpx.Collections Vector<'T> or the Solid Vector<'t> by Greg Rosenbaum (https://github.com/GregRos/Solid). I thought I might give back a bit to the community by providing instructions on how to get up and running with each of them.
Using the FSharpx.Collections.Vector<'T>
The process is pretty straight forward:
Download the FSharpx.Core nuget package using either the Project Manager Console or Manager Nuget Packages for Solution. Both are found in Visual Studio -> tools -> Library Manager.
If you're using it in F# script file add #r "FSharpx.Core.dll". You may need to use a full path.
Usage:
open FSharpx.Collections
let ListOfTuples = [(1,true,3.0);(2,false,1.5)]
let vector = ListOfTuples |> Vector.ofSeq
printfn "Last %A" vector.Last
printfn "Unconj %A" vector.Unconj
printfn "Item(0) %A" (vector.[0])
printfn "Item(1) %A" (vector.[1])
printfn "TryInitial %A" dataAsVector.TryInitial
printfn "TryUnconj %A" dataAsVector.Last
Using the Solid.Vector<'T>
Getting setup to use the Solid Vector<'t> is a bit more involved. But the Solid version has a lot more handy functionality and as Jack pointed out, has a number of performance benefits. It also has a lot of useful documentation.
You will need to download the visual studio solution from https://github.com/GregRos/Solid
Once you have downloaded it you will need to build it as there is no ready to use pre-built dll.
If you're like me, you may run into a number of missing dependencies that prevent the solution from being built. In my case, they were all related to the nuit testing frameworks (I use a different one). Just work through downloading/adding each of the dependencies until the solutions builds.
Once that is done and the solution is built, you will have a shiny new Solid.dll in the Solid/Solid/bin folder. This is where I went wrong. That is the core dll and is only enough for C# usage. If you only include a reference to the Solid.dll you will be able to create a vector<'T> in f#, but funky things will happen from then on.
To use this data structure in F# you will need to reference both the Solid.dll and the Solid.FSharp.dll which is found in \Solid\SolidFS\obj\Debug\ folder. You will only need one open statement -> open Solid
Here is some code showing usage in a F# script file:
#r "Solid.dll"
#r "Solid.FSharp.dll" // don't forget this reference
open Solid
let ListOfTuples2 = [(1,true,3.0);(2,false,1.5)]
let SolidVector = ListOfTuples2 |> Vector.ofSeq
printfn "%A" SolidVector.Last
printfn "%A" SolidVector.First
printfn "%A" (SolidVector.[0])
printfn "%A" (SolidVector.[1])
printfn "Count %A" SolidVector.Count
let test2 = vector { for i in {0 .. 100} -> i }
Suppose your dataObj contains a unique ID field, then any set data structure would be fine for your job. The immutable data structures are primarily used for functional style code or persistency. If you don't need these two, you can use HashSet<T> or SortedSet<T> in the .Net collection library.
Some stream specific optimization may be useful, e.g., keeping a fixed-size Queue<T>for the most recent data objects in the stream and store older objects in the more heavy weight set. I would suggest a benchmarking before switching to such hybrid data structure solutions.
Edit:
After reading your requirements more carefully, I found that what you want is a queue with user-accessible indexing or backward enumerator. Under this data structure, your feature extraction operations (e.g. average/sum, etc) cost O(n). If you want to do some of the operations in O(log n), you can use more advanced data structures, e.g. interval trees or skip lists. However, you will have to implement these data structures yourself as you need to store meta information in the tree nodes which are behind collection API.
This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set, a Queue, or some other structure?
Difficult to say without more information.
If your data are coming in with timestamps in ascending order (i.e. they are never out of order) then you can just use some kind of queue or extensible array.
If your data can come in out of order and you need them reordered then you want a priority queue or indexed collection instead.
to upwards of around 800 events per second
Those are extremely tame performance requirements for insertion rate.
A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
If you only ever want elements near the beginning why do you want random access? Do you really want random access by index or do you actually want random access by some other key like time?
From what you've said I would suggest using an ordinary F# Map keyed on index maintained by a MailboxProcessor that can append a new event and retrieve an object that allows all events to be indexed, i.e. wrap the Map in an object that provides its own Item property and implementation of IEnumerable<_>. On my machine that simple solution takes 50 lines of code and can handle around 500,000 events per second.

Smart sorting by function of geo and int

I'm thinking about ways to solve the following task.
We are developing a service (website) which has some objects. Each object has geo field (lat and long). It's about 200-300 cities with objects can be connected. Amount of objects is thousands and tens of thousands.
Also each object has date of creation.
We need to search objects with sorting by function of distance and freshness.
E.g. we have two close cities A and B. User from city A authorizes and he should see objects from city A and then, on some next pages, from city B (because objects from A are closer).
But, if there is an object from A which was added like a year ago, and an object from B which was added today, then B's object should be displayed befare A's one.
So, for peoeple from city A we can create special field with relevant index like = 100*distance + age_in_days
And then sort by this field and we will get data as we need.
The problem is such relevant index will not work for all other people from other places.
In my example i used linear function but it's just an example, we will need to fit correct function.
The site will work on our servers, so we can use almost any database or any other software (i supposed to use mongodb)
I have following ideas
Recacl relevant index every day and keep it with object like
{
fields : ...,
relindex : {
cityA : 100,
cityB : 120
}
}
And if user belongs to cityA then sort by relindex.cityA
Disadvantages:
Recurrent update of all object, but i dont think it's a hude problem
Huge mongo index. If we have about 300 cities than each object will have 300 indexed fields
Hard to add new cities.
Use 3d spatial index: (lat, long, freshness). But i dont know if any database supports 3d geo-patial
Compact close objects in cluster and search only in cluster but not by whole base. But im not sure that it's ok.
I think there are four possible solutions:
1) Use 3D index - lat, lon, time.
2) Distance is more important - use some geo index and select nearest objects. If the object is too old then discard it and increase allowed distance. Stop after you have enough objects.
3) Time is more important - index by time and discard the objects which are too far.
4) Approximate distance - choose some important points (centre of cities or centre of clusters of objects) and calculate the distances from these important points up front. The query will first find the nearest important point and then use index to find the data.
Alternatively you can create clusters from your objects and then calculate the distance in the query. The point here is that the amount of clusters is limited.

MongoDB geospacial query

I use mongo's "$near" query, it works as expected and saves me a lot of time.
Now I need to perform something more complicated. Imagine, we have a collection of "checkins" (let's use foursquare notation), that contains the geospacial information (nothing unusual: just lat and lng) and time. Given the checkins by two people, how do I find their "were near to each other" checkins? I mean, e.g.: "1/23/12 you've been 100 meters away"
The easiest solution is to select all the checkins by the first user and find nearest checkin for each first user's checkin on the framework side (I use ruby). But is it the most efficient solution?
Do you have better ideaas? May be I need some kind of a special index?
Best,
Roman
The MongoDB GeoSpatial indexes provide two types of queries: $near and $within. The $near query returns all points in the database that are within a certain range of a requested point, while the $within query lists all points in the database that are inside of a particular area (box, circle, or arbitrary polygon).
MongoDB does not currently provide a query that will return all points that are within a certain distance of any member of another set of points, which is what you seem to want.
You could conceivably use the point data from user1 to build a polygon describing the "area of interest" and then use the $within query to see if there were any checkins by other people inside of that area. If you use a compound index on location & date, you could even restrict the query to folks who were inside of that area on a particular day.
References:
http://docs.mongodb.org/manual/core/indexes/#geospatial-indexes
http://docs.mongodb.org/manual/reference/operators/#geospatial

Resources