Oracle SQL RDBMS tables data to mind map node structure - oracle

I have done a lot of searching on this but not sure what my approach should be and looking for community advice. I would like to plot the graphical points of dependent data. The data already resides in a RDBMS and I have made some layered queries to plot the points and adapted to a node tree structure in Tableau. I have gotten to the second level, but it is quite convoluted and inefficient to continue down that path.
I am surprised a hierarchical db would be required for this. It's mostly algebraic logic.
Heres the concept:
Adapted from (and credence too) Data + Science: Node-Link Tree Diagram in Tableau
I took this data and queried against a table with the t value. Then I applied the Sigmoid function in Tableau to create the connecting lines.
Here are my results so far.
Note the plot points are based off a 3rd level not shown which has 71 data points to it. This is why Level 1 y points are 36, 18, -18, 36. Level 2 is just plotted badly due to the algebra.
The goal is to get POSITION_1 & POSITION_2 to dynamically plot so that any changes in the tables automatically update the final result, ie. a living node-tree based off the RDBMS data.
Is my approach way off kilter, or is there a better way?
BTW, it's not to be blown to millions of nodes, just to probably 5 levels, somewhere in the hundreds.

Related

How to partition a file to smaller size for performing KNN in hadoop mapreduce

In KNN like algorithm we need to load model Data into cache for predicting the records.
Here is the example for KNN.
So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.
Example:
Inorder to predict 1 otcome, we need to find the distnce between that single record with all the records in model result and find the min distance. So we need to get the model result in our hands. And if it is large file it cannot be loaded into Distributed cache for finding distance.
The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.
How can we parttion the file and perform the operation on these partition ?
ie 1 record <Distance> file1,file2,....filen
2nd record <Distance> file1,file2,...filen
This is what came to my thought.
Is there any further way.
Any pointers would help me.
I think the way you partitionin the data mainly depends on your data itself.
Being that you have a model with a bunch of rows, and that you want to find the k closes ones to the data on your input, the trivial solution is to compare them one by one. This can be slow because of going through 1-2GB of data millions of times (I assume you have large numbers of records that you want to classify, otherwise you don't need hadoop).
That is why you need to prune your model efficiently (your partitioning) so that you can compare only those rows that are most likely to be the closest. This is a hard problem and requires knowledge of the data you operate on.
Additional tricks that you can use to fish out performance are:
Pre-sorting the input data so that the input items that will be compared from the same partition come together. Again depends on the data you operate on.
Use random access indexed files (like Hadoop's Map files) to find the data faster and cache it.
In the end it may actually be easier for your model to be stored in lucene index, so you can achieve effects of partitioning by looking up the index. Pre-sorting the data is still helpful there.

What is the relative performance of 1 geometry column vs 4 decimals in Sql Server 2008?

I need to represent the dimensions of a piece of a quadrilateral rectangle surface in a SQL Server 2008 database. I will need to perform queries based on the distance between different points and the total area of the surface.
Will my performance be better using a geometry datatype or 4 decimal columns? Why?
If the geometry datatype is unnecessary in this situation, what amount of complexity in the geometrical shape would be required for using the geometry datatype to make sense?
I have not used the geometry datatype, and have never had reason to read up on it. Even so, it seems to me that if you’re just doing basic arithmatic on a simple geometric object, the mundane old SQL datatypes should be quite effiicient, particularly if you toss in some calculated columns for frequently used calculations.
For example:
--DROP TABLE MyTable
CREATE TABLE MyTable
(
X1 decimal not null
,Y1 decimal not null
,X2 decimal not null
,Y2 decimal not null
,Area as abs((X2-X1) * (Y2-Y1))
,XLength as abs((X2 - X1))
,YLength as abs((Y2 - Y1))
,Diagonal as sqrt(power(abs((X2 - X1)), 2) + power(abs((Y2 - Y1)), 2))
)
INSERT MyTable values (1,1,4,5)
INSERT MyTable values (4,5,1,1)
INSERT MyTable values (0,0,3,3)
SELECT * from MyTable
Ugly calculations, but they won’t be performed unless and until they are actually referenced (or unless you choose to index them). I have no statistics, but performing the same operations via the Geometry datatype probably means accessing rarely used mathematical subroutines, possibly embedded in system CLR assemblies, and I just can’t see that being significantly faster than the bare-bones SQL arithmatic routines.
I just took a look in BOL on the Geometry datatype. (a) Zounds! (b) Cool! Check out the entries under “geomety Data Type Method Reference” (online here , but you want to look at the expanded treeview under this entry.) If that’s the kind of functionality you’ll be needing, by all means use the Geometry data type, but for simple processing, I’d stick with the knucklescraper datatypes.
the geometry data types are more complex than simple decimals so there just be an overhead. But they do provide functions that calculate distance between two points and i would assume these have been optermised. The question might be if you implemented the distance between points logic - would this take longer than having the data in appropriate format in the first place.
As every DB question might relate to the ratio of inserts v selects/calc's
Geometry datatype is Spatial and decimal isn't,
Spatial vs. Non-spatial Data
Spatial data includes location, shape, size, and orientation.
For example, consider a particular square:
its center (the intersection of its diagonals) specifies its location
its shape is a square
the length of one of its sides specifies its size
the angle its diagonals make with, say, the x-axis specifies its orientation.
Spatial data includes spatial relationships. For example, the arrangement of ten bowling pins is spatial data.
Non-spatial data (also called attribute or characteristic data) is that information which is independent of all geometric considerations.
For example, a person?s height, mass, and age are non-spatial data because they are independent of the person?s location.
It?s interesting to note that, while mass is non-spatial data, weight is spatial data in the sense that something?s weight is very much dependent on its location!
It is possible to ignore the distinction between spatial and non-spatial data. However, there are fundamental differences between them:
spatial data are generally multi-dimensional and autocorrelated.
non-spatial data are generally one-dimensional and independent.
These distinctions put spatial and non-spatial data into different philosophical camps with far-reaching implications for conceptual, processing, and storage issues.
For example, sorting is perhaps the most common and important non-spatial data processing function that is performed.
It is not obvious how to even sort locational data such that all points end up ?nearby? their nearest neighbors.
These distinctions justify a separate consideration of spatial and non-spatial data models. This unit limits its attention to the latter unless otherwise specified.
Here's some more if you're interested:
http://www.ncgia.ucsb.edu/giscc/units/u045/u045_f.html
Heres a link i found about Benchmarking Spatial Data Warehouses: http://hpc.ac.upc.edu/Talks/dir08/T000327/paper.pdf

Database to store sparse matrix

I have a very large and very sparse matrix, composed of only 0s and 1s. I then basically handle (row-column) pairs. I have at most 10k pairs per row/column.
My needs are the following:
Parallel insertion of (row-column) pairs
Quick retrieval of an entire row or column
Quick querying the existence of a (row-column) pair
A Ruby client if possible
Are there existing databases adapted for these kind of constraints?
If not, what would get me the best performance :
A SQL database, with a table like this:
row(indexed) | column(indexed) (but the indexes would have to be constantly refreshed)
A NoSQL key-value store, with two tables like this:
row => columns ordered list
column => rows ordered list
(but with parallel insertion of elements to the lists)
Something else
Thanks for your help!
A sparse 0/1 matrix sounds to me like an adjacency matrix, which is used to represent a graph. Based on that, it is possible that you are trying to solve some graph problem and a graph database would suit your needs.
Graph databases, like Neo4J, are very good for fast traversal of the graph, because retrieving the neighbors of an vertex takes O(number of neighbors of a given vertex), so it is not related to the number of vertices in the whole graph. Neo4J is also transactional, so parallel insertion is not a problem. You can use the REST API wrapper in MRI Ruby, or a JRuby library for more seamless integration.
On the other hand, if you are trying to analyze the connections in the graph, and it would be enough to do that analysis once in a while and just make the results available, you could try your luck with a framework for graph processing based on Google Pregel. It's a little bit like Map-Reduce, but aimed toward graph processing. There are already several open source implementations of that paper.
However, if a graph database, or graph processing framework does not suit your needs, I recommend taking a look at HBase, which is an open-source, column-oriented data store based on Google BigTable. It's data model is in fact very similar to what you described (a sparse matrix), it has row-level transactions, and does not require you to retrieve the whole row, just to check if a certain pair exists. There are some Ruby libraries for that database, but I imagine that it would be safer to use JRuby instead of MRI for interacting with it.
If your matrix is really sparse (i.e. the nodes only have a few interconnections) then you would get reasonably efficient storage from a RDBMS such as Oracle, PostgreSQL or SQL Server. Essentially you would have a table with two fields (row, col) and an index or key each way.
Set up the primary key one way round (depending on whether you mostly query by row or column) and make another index on the fields the other way round. This will only store data where a connection exists, and it will be proportional to the number ot edges in the graph.
The indexes will allow you to efficiently retrieve either a row or column, and will always be in sync.
If you have 10,000 nodes and 10 connections per node the database will only have 100,000 entries. 100 ednges per node will have 1,000,000 entries and so on. For sparse connectivity this should be fairly efficient.
A back-of-fag-packet estimate
This table will essentially have a row and column field. If the clustered index goes (row, column, value) then the other covering index would go (column, row, value). If the additions and deletions were random (i.e. not batched by row or column), the I/O would be approximatley double that for just the table.
If you batched the inserts by row or column then you would get less I/O on one of the indexes as the records are physically located together in one of the indexes. If the matrix really is sparse then this adjacency list representation is by far the most compact way to store it, which will be much faster than storing it as a 2D array.
A 10,000 x 10,000 matrix with a 64 bit value would take 800MB plus the row index. Updating one value would require a write of at least 80k for each write (writing out the whole row). You could optimise writes by rows if your data can be grouped by rows on inserts. If the inserts are realtime and random, then you will write out an 80k row for each insert.
In practice, these writes would have some efficiency because the would all be written out in a mostly contiguous area, depending on how your NoSQL platform physically stored its data.
I don't know how sparse your connectivity is, but if each node had an average of 100 connections, then you would have 1,000,000 records. This would be approximately 16 bytes per row (Int4 row, Int4 column, Double value) plus a few bytes overhead for both the clustered table and covering index. This structure would take around 32MB + a little overhead to store.
Updating a single record on a row or column would cause two single disk block writes (8k, in practice a segment) for random access, assuming the inserts aren't row or column ordered.
Adding 1 million randomly ordered entries to the array representation would result in approximately 80GB of writes + a little overhead. Adding 1m entries to the adjacency list representation would result in approximately 32MB of writes (16GB in practice because the whole block will be written for each index leaf node), plus a little overhead.
For that level of connectivity (10,000 nodes, 100 edges per node) the adjacency list will
be more efficient in storage space, and probably in I/O as well. You will get some optimisation from the platform, so some sort of benchmark might be appropriate to see which is faster in practice.

Graph plotting: only keeping most relevant data

In order to save bandwith and so as to not to have generate pictures/graphs ourselves I plan on using Google's charting API:
http://code.google.com/apis/chart/
which works by simply issuing a (potentially long) GET (or a POST) and then Google generate and serve the graph themselves.
As of now I've got graphs made of about two thousands entries and I'd like to trim this down to some arbitrary number of entries (e.g. by keeping only 50% of the original entries, or 10% of the original entries).
How can I decide which entries I should keep so as to have my new graph the closest to the original graph?
Is this some kind of curve-fitting problem?
Note that I know that I can do POST to Google's chart API with up to 16K of data and this may be enough for my needs, but I'm still curious
The flot-downsample plugin for the Flot JavaScript graphing library could do what you are looking for, up to a point.
The purpose is to try retain the visual characteristics of the original line using considerably fewer data points.
The research behind this algorithm is documented in the author's thesis.
Note that it doesn't work for any kind of series, and won't give meaningful results when you want a downsampling factor beyond 10, in my experience.
The problem is that it cuts the series in windows of equal sizes then keep one point per window. Since you may have denser data in some windows than others the result is not necessarily optimal. But it's efficient (runs in linear time).
What you are looking to do is known as downsampling or decimation. Essentially you filter the data and then drop N - 1 out of every N samples (decimation or down-sampling by factor of N). A crude filter is just taking a local moving average. E.g. if you want to decimate by a factor of N = 10 then replace every 10 points by the average of those 10 points.
Note that with the above scheme you may lose some high frequency data from your plot (since you are effectively low pass filtering the data) - if it's important to see short term variability then an alternative approach is to plot every N points as a single vertical bar which represents the range (i.e. min..max) of those N points.
Graph (time series data) summarization is a very hard problem. It's like deciding, in a text, what is the "relevant" part to keep in an automatic summarization of it. I suggest you use one of the most respected libraries for finding "patterns of interest" in time series data by Eamonn Keogh

Searching geocoded information by distance

I have a database of addresses, all geocoded.
What is the best way to find all addresses in our database within a certain radius of a given lat, lng?
In other words a user enters (lat, lng) of a location and we return all records from our database that are within 10, 20, 50 ... etc. miles of the given location.
It doesn't have to be very precise.
I'm using MySQL DB as the back end.
There are Spatial extensions available for MySQL 5 - an entry page to the documentation is here:
http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html
There are lots of details of how to accomplish what you are asking, depending upon how your spatial data is represented in the DB.
Another option is to make a function for calculating the distance using the Haversine formula mentioned already. The math behind it can be found here:
www.movable-type.co.uk/scripts/latlong.html
Hopefully this helps.
You didn't mention your database but in SQL Server 2008 it is as easy as this when you use the geography data types
This will find all zipcodes within 20 miles from zipcode 10028
SELECT h.*
FROM zipcodes g
JOIN zipcodes h ON g.zipcode <> h.zipcode
AND g.zipcode = '10028'
AND h.zipcode <> '10028'
WHERE g.GeogCol1.STDistance(h.GeogCol1)<=(20 * 1609.344)
see also here SQL Server 2008 Proximity Search With The Geography Data Type
The SQL Server 2000 version is here: SQL Server Zipcode Latitude/Longitude proximity distance search
This is a typical spatial search problem.
1> what db are you using, sql2008, oracle, ESRI geodatabase, and postgis are some spatial db engine which has this functionaliyt.
2> Otherwise, you probably look for some spatial Algo library if you want to achieve this. You could code for yourself, but I won't suggest because computation geometry is a complicated issue.
If you're using a database which supports spatial types, you can build the query directly, and the database will handle it. PostgreSQL, Oracle, and the latest MS SQL all support this, as do some others.
If not, and precision isn't an issue, you can do a search in a box instead of by radius, as this will be very fast. Otherwise, things get complicated, as the actual conversion from lat-long -> distances needs to happen in a projected space (since the distances change in different areas of the planet), and life gets quite a bit nastier.
I don't remember the equation off the top of my head, but the Haversine formula is what is used to calculate distances between two points on the Earth. You may Google the equation and see if that gives you any ideas. Sorry, I know this isn't much help, but maybe it will give a place to start.
If it doesn't have to be very accurate, and I assume you have an x and y column in your table, then just select all rows in a big bounding rectangle, and use pythagorus (or Haversine) to trim off the results in the corners.
eg. select * from locations where (x between xpos-10 miles and xpos+10miles) and (y between xpos -10miles and ypos+10miles).
Remember pythagorus is sqrt(x_dist^2 + y_dist^2).
Its quick and simple, easy to understand and doesn't need funny joins.

Resources