Oracle Spatial - spatial index on time-varying data - oracle

I am trying to design a datamodel on geodetic data which performs decently as the data grow over time. However all ideas I got have hit a couple of limitations by Oracle.
These are the requirements:
the data to store are 2-dimensional points with latitude and longitude (being able to handle any geometry would be a plus).
New data are added on a monthly basis. This conceptually updates the position of the points, deletes old points or creates new ones. New data for new time-instants come in batches, and as such are conceptually ordered and labelled. Say: t1, t3, t4. (it is not a form of asset tracking, it is more of an evolving snapshot of data).
SELECTs on current or historical data must execute in real time (e.g. to be be depicted on an interactive map). SELECT statements will query data like "return all points belonging to a given region, as of t3", expecting to return an image of the initial data with the changes applied at times t1, t3
It is not known in advance what proportion of the original points are left unchanged. So for instance 100% of the points might be altered when receiving a new batch of data, or just 0%.
The data are conceptually tuples of (t, geometry), and the main problem is that you cannot create a spatial index on t and geometry, but only on geometry.
Overall, the point is that indexing geographical data that are conceptually "sliced", by another column, is apparently not supported.
By the way, in spite of appearance, saving all data as complete snapshots, or saving the changes as incremental deltas is not really the point of the whole matter, and makes no real difference.
Below are my failed attempts to solve the problem. If anyone has good ideas on how to efficiently model the data, just skip the remainder of this post (otherwise I would appreciate if one could elaborate on the options I have tried so far.)
First unsuccessful attempt - multi-column spatial index
I would save complete "snapshots" of the data at times t1,t3,t4, and the data would be identified by two columns, the geometry and an identifier of the slice of data:
create table geocoded_data (
geom SDO_GEOMETRY,
snapshot_id number(5,0)
);
Of course, geometries require a spatial index to efficiently be operated on, and the obvious choice would have been a two-columns index. This is where the idea fails, because multi-column index cannot be spatially indexed (such as a spatial one):
CREATE INDEX my_index ON geocoded_data(snapshot_id, geom)
INDEXTYPE IS mdsys.spatial_index;
SQL Error: ORA-29851: "cannot build a domain index on more than one column"
Second unsuccessful attempt - adding another dimension to the geometries
Another option would be to model the column snapshot_id as a third dimension embedded in the geom field. However unless documented somewhere, one cannot assume that the resulting index will work properly on such a data structure.
After all, the third-dimension would be just a marker, with no geometrical significance, so potentially hindering the performance of the index.
This option would be conceptually similar to using LSR (Linear Referencing Systems) on points.
And in fact It might be no coincidence that the docs about LSR indexing state:
Do not include the measure dimension in a spatial index, because this causes additional processing overhead and produces no benefit
Third unsuccessful attempt - interval partitioning
A third way to go would be partitioning the data according to the column snapshot_id, and creating a local spatial index. In this case, hopefully, a partition elimination would help using the relevant portion of the spatial index, disregarding the data in other snapshots.
The partition would be an "interval partition" (a new partition would be nicely and automatically created upon receiving a new batch of data). However this is what I get when I try to create a spatial index on an interval-partitioned table:
SQL Error: ORA-14762: "Domain index creation on interval partitioned tables is not permitted"
That's true, intended and documented: partitions are generally ok except that interval partitions are not supported by spatial indexes. Indeed Using Partitioned Spatial Indexes explicitly says:
Only range partitioning is supported on the underlying table. All other kinds of partitioning are not currently supported for partitioned spatial indexes.
So, I should make do with a range partition. However I'd rather exclude this option because it would entail in some amount of "system" maintenance (creating new partitions manually, or as part of the application logic, which would be awkward).
Ideally I wanted a new partition for each snapshot, and I'd like to have the partition created automatically whenever a new snapshot is introduced.
Fourth unsuccessful attempt - representing the data in an incremental fashion
The last option, which is the most CPU intensive, would persist the initial snapshot of data, while saving the new batches in form of deltas.
However, as a matter of fact, this wouldn't address the fundamental issue at the hearth of the problem (unability to spatially index geometries discriminated by the content of another column).
For instance, when the application had to reconstruct the content of a given portion of the map up to t2, it would have to retrieve all data that are related to that map portion, up to the deltas belonging to t2.
Unfortunately the spatial would fetch all deltas in the relevant portion of the map, including those that are irrelevant because were added later than t2. For instance the index would identify:
point A, OK: part of the initial snapshot
point B, OK: was added on the same map portion by t2
point C, D, E, F: KO!... were added to the same map portion but by a later t3, so will be excluded only by a predicate, not by the index.
On the other hand, even if the index returned only the records we need, this wouldn't be sustainable, because changes would add up over time and the cost of returning the current image after, say, 10 years of changes might be outrageous (to get the right image, one should combine all deltas that were introduced after the initial snapshot).

Related

Elasticsearch Data in Grafana without timestamp

I am wondering if it is possible to have data from elasticsearch indices without timestamp attached to them.
I need a list of two columns as a drop down. This list is cross checked against another index to generate maps but if I zoom into the graph breaks cause the drop down list exists from time a to be but not from c to d. (lol)
My macgyver solution to this is to just add the list every few minutes into the index so on the graph, the data is reasonably dense. This allows the user to zoom in pretty well into different parts of the graph. But overtime this is going to make my index unreasonably large.

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

Bad performance when writing log data to Cassandra with timeuuid as a column name

Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
UPDATE
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
key VARCHAR,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family

Database to store sparse matrix

I have a very large and very sparse matrix, composed of only 0s and 1s. I then basically handle (row-column) pairs. I have at most 10k pairs per row/column.
My needs are the following:
Parallel insertion of (row-column) pairs
Quick retrieval of an entire row or column
Quick querying the existence of a (row-column) pair
A Ruby client if possible
Are there existing databases adapted for these kind of constraints?
If not, what would get me the best performance :
A SQL database, with a table like this:
row(indexed) | column(indexed) (but the indexes would have to be constantly refreshed)
A NoSQL key-value store, with two tables like this:
row => columns ordered list
column => rows ordered list
(but with parallel insertion of elements to the lists)
Something else
Thanks for your help!
A sparse 0/1 matrix sounds to me like an adjacency matrix, which is used to represent a graph. Based on that, it is possible that you are trying to solve some graph problem and a graph database would suit your needs.
Graph databases, like Neo4J, are very good for fast traversal of the graph, because retrieving the neighbors of an vertex takes O(number of neighbors of a given vertex), so it is not related to the number of vertices in the whole graph. Neo4J is also transactional, so parallel insertion is not a problem. You can use the REST API wrapper in MRI Ruby, or a JRuby library for more seamless integration.
On the other hand, if you are trying to analyze the connections in the graph, and it would be enough to do that analysis once in a while and just make the results available, you could try your luck with a framework for graph processing based on Google Pregel. It's a little bit like Map-Reduce, but aimed toward graph processing. There are already several open source implementations of that paper.
However, if a graph database, or graph processing framework does not suit your needs, I recommend taking a look at HBase, which is an open-source, column-oriented data store based on Google BigTable. It's data model is in fact very similar to what you described (a sparse matrix), it has row-level transactions, and does not require you to retrieve the whole row, just to check if a certain pair exists. There are some Ruby libraries for that database, but I imagine that it would be safer to use JRuby instead of MRI for interacting with it.
If your matrix is really sparse (i.e. the nodes only have a few interconnections) then you would get reasonably efficient storage from a RDBMS such as Oracle, PostgreSQL or SQL Server. Essentially you would have a table with two fields (row, col) and an index or key each way.
Set up the primary key one way round (depending on whether you mostly query by row or column) and make another index on the fields the other way round. This will only store data where a connection exists, and it will be proportional to the number ot edges in the graph.
The indexes will allow you to efficiently retrieve either a row or column, and will always be in sync.
If you have 10,000 nodes and 10 connections per node the database will only have 100,000 entries. 100 ednges per node will have 1,000,000 entries and so on. For sparse connectivity this should be fairly efficient.
A back-of-fag-packet estimate
This table will essentially have a row and column field. If the clustered index goes (row, column, value) then the other covering index would go (column, row, value). If the additions and deletions were random (i.e. not batched by row or column), the I/O would be approximatley double that for just the table.
If you batched the inserts by row or column then you would get less I/O on one of the indexes as the records are physically located together in one of the indexes. If the matrix really is sparse then this adjacency list representation is by far the most compact way to store it, which will be much faster than storing it as a 2D array.
A 10,000 x 10,000 matrix with a 64 bit value would take 800MB plus the row index. Updating one value would require a write of at least 80k for each write (writing out the whole row). You could optimise writes by rows if your data can be grouped by rows on inserts. If the inserts are realtime and random, then you will write out an 80k row for each insert.
In practice, these writes would have some efficiency because the would all be written out in a mostly contiguous area, depending on how your NoSQL platform physically stored its data.
I don't know how sparse your connectivity is, but if each node had an average of 100 connections, then you would have 1,000,000 records. This would be approximately 16 bytes per row (Int4 row, Int4 column, Double value) plus a few bytes overhead for both the clustered table and covering index. This structure would take around 32MB + a little overhead to store.
Updating a single record on a row or column would cause two single disk block writes (8k, in practice a segment) for random access, assuming the inserts aren't row or column ordered.
Adding 1 million randomly ordered entries to the array representation would result in approximately 80GB of writes + a little overhead. Adding 1m entries to the adjacency list representation would result in approximately 32MB of writes (16GB in practice because the whole block will be written for each index leaf node), plus a little overhead.
For that level of connectivity (10,000 nodes, 100 edges per node) the adjacency list will
be more efficient in storage space, and probably in I/O as well. You will get some optimisation from the platform, so some sort of benchmark might be appropriate to see which is faster in practice.

TSql, building indexes before or after data input

Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.

Resources