In the implementation of Degree Centrality algorithms on the data of 4 months on the graph on the PGX server, how can the degree be calculated per day? As far as I understand, the algorithm calculates the degree based on the total of 4 months. Can a specific argument be defined? Or for this purpose, I have to use other algorithms inside PGX.
It is necessary to say that my data is bank transactions between cards.
Thankful
Related
I'm new to graphDB and I'm studying to create a good data model.
I have to manage 10 millions of "Contacts" and I would like to filter them by "gender". I create a POC and all is fine but I don't understand/find if the best solution is to save the gender as vertex:
or as a field on the contacts vertex:
I know that each edge will impact on the data size, but I don't find any reference on performance diff on these two types of data management.
Do you know the right approach?
In this use case, I would put gender as a property on the vertex and add an index on that property to get your answer. While having gender as a separate vertex is more correct from a theoretical perspective it has a few practical issues that leads me to suggest the second approach.
The first model you suggest will introduce a supernode into your graph. A supernode is a node with a disproportionately high number of incident edges. The Gender vertex will have a low selectivity (Male/Female/Unknown) so each vertex will have a branching factor that is in the millions. This level of a branching factor will likely cause all sorts of performance problems resulting in a slow query. Denormalizing the gender on to the vertex and adding an index should resolve most of these issues. The only issue that is likely to remain is the amount of time it will take to return the 3-5 million records you will likely receive.
In the first approach answering the question of "What is a person's gender?" would require traversing out from the contact vertex to the edge to the gender vertex which would be slower than just pulling back the contact vertex. Assuming this is a frequent query you would want to answer than this is a consideration you should take into account.
As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise.
In addition, not all of them are of the same lengths. So maybe although Trajectory A is not the same as Trajectory B, yet it is part of Trajectory B. I wish to present this property after the clustering as well.
I have only a bit knowledge of K-means Clustering and Fuzzy N-means Clustering. How may I choose between them two? Or should I adopt other methods?
Any method that takes the "belongness" into consideration?
(e.g. After the clustering, I have 3 clusters A, B and C. One particular trajectory X belongs to cluster A. And a shorter trajectory Y, although is not clustered in A, is identified as part of trajectory B.)
=================== UPDATE ======================
The aforementioned trajectories are the pedestrians' trajectories. They can be either presented as a series of (x, y) points or a series of step vectors (length, direction). The presentation form is under my control.
It might be a little late but I am also working on the same problem.
I suggest you take a look at TRACLUS, an algorithm created by Jae-Gil Lee, Jiawei Han and Kyu-Young Wang, published on SIGMOD’07.
http://web.engr.illinois.edu/~hanj/pdf/sigmod07_jglee.pdf
This is so far the best approach I have seen for clustering trajectories because:
Can discover common sub-trajectories.
Focuses on Segments instead of points (so it filters out noise-outliers).
It works over trajectories of different length.
Basically is a 2 phase approach:
Phase one - Partition: Divide trajectories into segments, this is done using MDL Optimization with complexity of O(n) where n is the numbers of points in a given trajectory. Here the input is a set of trajectories and output is a set of segments.
Complexity: O(n) where n is number of points on a trajectory
Input: Set of trajectories.
Output: Set D of segments
Phase two - Group: This phase discovers the clusters using some version of density-based clustering like in DBSCAN. Input in this phase is the set of segments obtained from phase one and some parameters of what constitutes a neighborhood and the minimum amount of lines that can constitute a cluster. Output is a set of clusters. Clustering is done over segments. They define their own distance measure made of 3 components: Parallel distance, perpendicular distance and angular distance. This phase has a complexity of O(n log n) where n is the number of segments.
Complexity: O(n log n) where n is number of segments on set D
Input: Set D of segments, parameter E that sets neighborhood treshold and parameter MinLns that is the minimun number of lines.
Output: Set C of Cluster, that is a Cluster of segments (trajectories clustered).
Finally they calculate a for each cluster a representative trajectory, which is nothing else that a discovered common sub-trajectory in each cluster.
They have pretty cool examples and the paper is very well explained. Once again this is not my algorithm, so don't forget to cite them if you are doing research.
PS: I made some slides based on their work, just for educational purposes:
http://www.slideshare.net/ivansanchez1988/trajectory-clustering-traclus-algorithm
Every clustering algorithm needs a metric. You need to define distance between your samples. In your case simple Euclidean distance is not a good idea, especially if the trajectories can have different lengths.
If you define a metric, than you can use any clustering algorithm that allows for custom metric. Probably you do not know the correct number of clusters beforehand, then hierarchical clustering is a good option. K-means doesn't allow for custom metric, but there are modifications of K-means that do (like K-medoids)
The hard part is defining distance between two trajectories (time series). Common approach is DTW (Dynamic Time Warping). To improve performance you can approximate your trajectory by smaller amount of points (many algorithms for that).
Neither will work. Because what is a proper mean here?
Have a look at distance based clustering methods, such as hierarchical clustering (for small data sets, but you probably don't have thousands of trajectories) and DBSCAN.
Then you only need to choose an appropriate distance function that allows e.g. differences in time and spatial resolution of trajectories.
Distance functions such as dynamic time warping (DTW) distance can accomodate this.
This is good concept and having possibility for real-time applications. In my view, one can adopt any clustering but need to select appropriate dissimilarity measure, later need to think about computational complexity.
Paper (http://link.springer.com/chapter/10.1007/978-81-8489-203-1_15) used Hausdorff and suggest the technique for reducing complexity, and paper (http://www.cit.iit.bas.bg/CIT_2015/v-15-2/4-5-TCMVS%20A-edited-md-Gotovop.pdf) described the use of "Trajectory Clustering Technique Based on Multi-View Similarity"
I have a quadcopter with some sensors and I want to measure values in set of points on the map (2d problem).
Every measurement takes 30 seconds and I assume copter has constant speed of 60km/h.
It can fly constantly 20 minutes and then it needs to land to charge for an hour.
I would like to write an algorithm, that automatically computes flight paths and minimizes time to take all the samples.
I can represent the points as a full graph (I assume I am flying so high, that there are no obstacles). Then time to reach the point is cost on the edge, but I have also cost of visiting the vertex and limited "fuel". It is some generalization of TSP or VRP, but I am not sure which one.
There are also problems with gas stations, but they usually find path between two points.
Can you name an algorithm that could solve this or come up with something similar. It is NP hard, but there could be some nice approximate solutions.
The problem isn't easy to solve because there is also the fuel constraints and you need to find groups of pods. You can use a combination of a brute force algorithm and a heuristic. For example a quad tree or a spatial index (hilbert curve) can reduce the dimensions and the search space. It looks similar to the capacitated vehicle routing problem.
I have the following scenario:
I want to find a flight between two cities: A and B. There is no direct flight from A to B; so, I need to find a connecting flight with the lowest cost.
In addition, the air ticket is not fixed. It depends on the time that I buy it; for example, the price will be cheaper if I buy it early.
Moreover, the time affects the flight too; for example, there is only one flight from C to D on May 31 at 7 AM. If the plane flies from A to C on May 31 at 8 AM, I miss the flight. For this reason, I represent the cities as vertices of a graph. The path AB exists if there is a valid flight from A to B. The weight will be the ticket fee.
Is there any idea or suggestion for my problem?
Thanks
I answered once a very similar question I am pretty sure same idea can be used here. The idea is to use a routing algorithm, designed for internet routers - which are dynamic and constantly changing. The algorithm designed for it is Distance Vector Routing Protocol.
The suggested implementation is basically a distributed version of Bellman-Ford algorithm, that modifies itself once there is a change on the weights of each edge in order to get the new optimal path.
Note that the algorithm has draw backs, mainly the count to infinity problem.
The usual way to deal with not being in the right place at the right time is to make the nodes represent a specific place at a specific time. Then a flight from C to D that departs on May 30 at 9PM and arrives May 31 at 7AM corresponds to an arc from node C_May30_9PM to D_May31_7AM. You also need arcs that correspond to waiting around, e.g., D_May31_7AM to D_May31_8AM.
I'm not sure there's much to say about purchasing tickets at the level of detail you've described.
I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.
It seems like K nearest neighbor or k means are the best way to do this.
I can't seem to understand how to actually implement this. I've seen some mathematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.
Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?
K-Nearest Neighbor (aka KNN) is a classification algorithm.
Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.
Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.
How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.
A trivial example:
Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.
If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?
Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.
The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.
What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.
k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values
1,2,3,4,6,7,8,9,10,11,12,21,22,33,40
Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.
cluster-1
1,2,3,4,5,6,7,8
cluster-2
10,11,12
cluster-3
21,22
cluster-4
33
cluster-5
40
Remember I just made up these cluster centers (cluster 1-5).
So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.
When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.
Read one of my questions Here
Hope this helps.
To do k-nearest neighbors you mostly need a notion of distance and a way of finding the k nearest neighbours to a point that you can afford (you probably don't want to search through all your data points one by one). There is a library for approximate nearest neighbour at http://www.cs.umd.edu/~mount/ANN/. It's a very simple classification algorithm - to classify a new point p, find its k nearest neighbours and classify p according to the most popular classes amongst those k neighbours.
I guess in your case you could provide somebody with a list of similar posts as soon as you decide what nearest means, and then monitor click-through from this and try to learn from that to predict which of those alternatives would be most popular.
If you are interested in finding a particularly good learning algorithm for your purposes, have a look at http://www.cs.waikato.ac.nz/ml/weka/ - it allows you to try out a large number of different algorithms, and also to write your own as plug-ins.
Here is a very simple example of KNN for the MINST dataset
Once you are able to calculate distance between your documents, the same algorithm would work
http://shyamalapriya.github.io/digit-recognition-using-k-nearest-neighbors/