Why Cassandra secondary indexes are so slow on just 350k rows? - performance

I have a column family with the secondary index. The secondary index is basically a binary field, but I'm using a string for it. The field called is_exported and can be 'true' or 'false'. After request all loaded rows are updated with is_exported = 'false'.
I'm polling this column table each ten minutes and exporting new rows as they appear.
But here the problem: I'm seeing that time for this query grows pretty linear with amount of data in column table, and currently it takes from 12 to 20 seconds (!!!) to find 5000 rows. From my understanding, indexed request should not depend on number of rows in CF but from number of rows per one index value (cardinality), as it's just another hidden CF like:
"true" : rowKey1 rowKey2 rowKey3 ...
"false": rowKey1 rowKey2 rowKey3 ...
I'm using Pycassa to query the data, here the code I'm using:
column_family = pycassa.ColumnFamily(cassandra_pool, column_family_name, read_consistency_level=2)
is_exported_expr = create_index_expression('is_exported', 'false')
clause = create_index_clause([is_exported_expr], count = 5000)
column_family.get_indexed_slices(clause)
Am I doing something wrong, but I expect this operation to work MUCH faster.
Any ideas or suggestions?
Some config info:
Cassandra 1.1.0
RandomPartitioner
I have 2 nodes and replication_factor = 2 (each server has a full data copy)
Using AWS EC2, large instances
Software raid0 on ephemeral drives
Thanks in advance!

I don't know the internals of indexing in Cassandra but I'm under the assumption it behaves in a similar fashion to PostgreSQL / MySQL, were indexing boolean, true/false columns is redundant in many scenarios. If cardinality is low (true & false = 2 unique values) and data is distributed quite evenly, e.g. ~50% true and ~50% false, then the database engine will likely perform a full table scan (which doesn't utilize the indexes).
The linear relationship between query execution and data set size would further support that Cassandra is performing a full table (keyspace) scan.

Related

ClickHouse - Inserting more than a hundred entries per query

I do not figure out how to increase the max number of entries per query. I would like to insert a thousand entries per query, and the default value is 100.
According to the doc, the parameter max_partitions_per_insert_block defines the limit of simultaneous entries.
I've tried to modify it from the ClickHouse client, but my insertion still fails :
$ clickhouse-client
my-virtual-machine :) set max_partitions_per_insert_block=1000
*SET* max_partitions_per_insert_block = 1000
Ok.
0 rows in set. Elapsed: 0.001 sec.
Moreover, this is no max_partitions_per_insert_block field in the /etc/clickhouse-server/config.xml file.
After modifying max_partitions_per_insert_block, I've tried to insert my data, but I'm stuck with this error :
infi.clickhouse_orm.database.ServerError: Code: 252, e.displayText() = DB::Exception: Too many partitions for single INSERT block (more than 100). The limit is controlled by 'max_partitions_per_insert_block' setting. Large number of partitions is a common misconception. It will lead to severe negative performance impact, including slow server startup, slow INSERT queries and slow SELECT queries. Recommended total number of partitions for a table is under 1000..10000. Please note, that partitioning is not intended to speed up SELECT queries (ORDER BY key is sufficient to make range queries fast). Partitions are intended for data manipulation (DROP PARTITION, etc). (version 19.5.3.8 (official build))
EDIT: I'm still stuck with this. I cannot even manually set the parameter to the value I want with SET max_partitions_per_insert_block = 1000: the value is changed but goes back to 100 after exiting and reopening clickhouse-client (even with sudo, so it does not look like a permission problem).
I figured it out when reading again the documentation, especially this document. I have recognized in the web profile settings I saw in the system.settings table. I just tried to insert the following in my default's profile, reloaded, and my insert of a thousand entries wen well : <max_partitions_per_insert_block>1000</max_partitions_per_insert_block>
I guess it was obvious for some, but probably not for unexperimented people.
Most likely you should change the partitioning scheme. Each partition generates several files on the file system, which can lead to disruption of the OS. In addition, this may be the cause of long mergers.

Does the number of columns in a Vertica table impact query performance?

We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance
It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.
I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.

Neo4J Cypher: performance of matching multiple properties and creating relationships

A little context: I'm experimenting with Neo4J (as a newbie, but experienced in other database technologies) for possible use as a master data management system within our business of identity intelligence, in particular looking at building up a graph of places, identity attributes (eg: email addresses, telephone numbers, electoral roll data, etc.) with relationships between these nodes that express something meaningful, for example where an email address has been used, or where a telephone number is registered.
Desired system properties: I would like this system to have some specific properties that are valuble to us:
Fast ingestion of information from a significant number of providers (100+), this precludes lengthy (hours) ETL processes, short ones are ok!
On line at all times, this precludes use of the batch importer, we are most likely to use a fault tolerant cluster, sharding would be good :)
Capacity to eventually ingest ~30G records / year (~1000/second) and retain them, creation and retention of ~100G relationships / year, right now we are ingesting ~1/10 of this load.
Where I'm stuck: I have been experimenting with a single node in Azure, 32GB RAM, 4 cores, with non-local disk, running Debian 8 and Neo4J 3.1.1. This happily ingests and relates back together the UK postal address file (PAF), around 29M records, in a few 10s of minutes using either LOAD CSV or home-brew Java and bolt. I have also ingested but not related a test set of email address data, around 20M records, and now need to build relationships based on matching postcodes, building numbers, and possibly other fields between the two data sets. This is where things get much slower when using Cypher, here's the fastest query I have been able to create thus far:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i WITH e
MATCH(s:SUBBNAME) USING INDEX s:SUBBNAME(SBNA)
WHERE upper(e.Building) = s.SBNA WITH e,s
MATCH(m:MAINFILE)
WHERE trim(split(e.Postcode,' ')[0]) = m.OUTC AND
trim(split(e.Postcode,' ')[1]) = m.INCO AND
right('0000'+e.HouseNo,4) = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Indexes are:
ON :DDSEMAIL(HouseNo) ONLINE
ON :DDSEMAIL(Postcode) ONLINE
ON :DDSEMAIL(Building) ONLINE
ON :MAINFILE(OUTC) ONLINE
ON :MAINFILE(INCO) ONLINE
ON :MAINFILE(BNUM) ONLINE
ON :SUBBNAME(SBNA) ONLINE
Please note that the {list} parameter is being supplied through bolt from a Java client that has already enumerated all the ~20M DDSEMAIL nodes, and is batching into transactions (typically 1000 IDs at a time).
This is taking between 100-200msecs per ID, over a test run of 157000 IDs it took 7.3 hours, indicating a full execution time of ~760 hours or >1 month. The underlying machine appears CPU bound (no significant IO wait time).
Looking at the EXPLAIN for this query, there are no full scans, it's all schema index matching (once I had included the explicit index statement), so I'm not sure where to look for more speed..
(edited to add this PROFILE output):
PROFILE part 1
PROFILE part 2
This shows that the match to both parts of the postcode is filtering a lot of rows (56k), it may be better to re-order these fields to reduce the filter input size.
(end of edit)
As a (very unfair) comparision, I pushed both sets of data from CSV files into a custom Bloom filter written in C#/.NET, which performs similar field reformatting as above then concatenates to generate textual keys, and matches these keys together. This completed convolving all 20M email records against all 29M PAF records in under 5 minutes on a single core of my laptop. It was largely IO bound.
Right now I'm considering using an external application or a user procedure to perform the record matching, and just creating relationships using Cypher, but it feels wrong to avoid a well-written query engine that should be able to do this much, much quicker than it is.
What should I be looking at to improve performance please?
If I recall correctly, the index won't be utilized correctly when there are transformations occurring on the comparison values (such as UPPER() or LOWER() or TRIM()) when they're sourced from another node property. You may need to perform these operations first and alias them, then do the match.
Providing the index hint gets around this, I think, so your match to s.SBNA should be correctly using the index, but if there's an index on any of the matched properties on m:MAINFILE, that may not be using the index.
Test to see if this makes a difference, comparing this query to the older query on a smaller data set:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Also, if you could add a screenshot of a PROFILE or EXPLAIN of the query to your description (after expanding all plan nodes) that may help to see where things could improve.
EDIT
As you mentioned in your description, batching these may be a good idea. APOC Procedures has apoc.periodic.iterate(), which may help here.
Let's see if we can apply that to your query. Try this out:
WITH {list} AS list
CALL apoc.periodic.iterate('
UNWIND {list} as list
RETURN list
', '
WITH {list} as i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
MERGE (e)-[:USED_AT]->(m)
', {batchSize:1000, iterateList:true, params:{list:list}}) YIELD batches, total, committedOperations, failedOperations, failedBatches, errorMessages
RETURN batches, total, committedOperations, failedOperations, failedBatches, errorMessages
We have to sacrifice returning the total number of relationships created, however, as we can't return values from the batched query.

Oracle text index (Context) keep growing

In my application I needed to search through many varchar columns from differents tables.
So I created a materialized view in which I concatenate those columns, since they exceed the 4000 characters I merged them concatenating the columns with the TO_CLOBS(column1) || TO_CLOB(column)... || TO_CLOB(columnN).
The query is complex, so the refresh is complete on demand for the view. We refresh it every 2 minutes.
The CONTEXT index is created with the sync on commit parameter.
The index then is synchronized every two minutes.
But when we run the optimize index it does not defrag the index. So it keeps growing.
In ctx_user_indexes I see how optimize drops the docid count but tokens doesnt shrinks. But when I use the REBUILD parameter in the index optimization it works correctly (drops down number of rows in DR$TEXT_INDEX_IDX$I).
Any idea ?
Thanks, and sorry for my poor english.
By adding a job to decrease the number of rows works.

MongoDB collection used for log data: index or not?

I am using MongoDB as a temporary log store. The collection receives ~400,000 new rows an hour. Each row contains a UNIX timestamp and a JSON string.
Periodically I would like to copy the contents of the collection to a file on S3, creating a file for each hour containing ~400,000 rows (eg. today_10_11.log contains all the rows received between 10am and 11am). I need to do this copy while the collection is receiving inserts.
My question: what is the performance impact of having an index on the timestamp column on the 400,000 hourly inserts verses the additional time it will take to query an hours worth of rows.
The application in question is using written in Ruby running on Heroku and using the MongoHQ plugin.
Mongo indexes the _id field by default, and the ObjectId already starts with a timestamp, so basically, Mongo is already indexing your collection by insertion time for you. So if you're using the Mongo defaults, you don't need to index a second timestamp field (or even add one).
To get the creation time of an object id in ruby:
ruby-1.9.2-p136 :001 > id = BSON::ObjectId.new
=> BSON::ObjectId('4d5205ed0de0696c7b000001')
ruby-1.9.2-p136 :002 > id.generation_time
=> 2011-02-09 03:11:41 UTC
To generate the object ids for a given time:
ruby-1.9.2-p136 :003 > past_id = BSON::ObjectId.from_time(1.week.ago)
=> BSON::ObjectId('4d48cb970000000000000000')
So, for example, if you wanted to load all docs inserted in the past week, you'd simply search for _ids greater than past_id and less than id. So, through the Ruby driver:
collection.find({:_id => {:$gt => past_id, :$lt => id}}).to_a
=> #... a big array of hashes.
You can, of course, also add a separate field for timestamps, and index it, but there's no point in taking that performance hit when Mongo's already doing the necessary work for you with its default _id field.
More information on object ids.
I have an application like yours, and currently it has 150 million log records. At 400k an hour, this DB will get large fast. 400k inserts an hour with indexing on timestamp will be much more worthwhile than doing an unindexed query. I have no problem with inserting tens of millions of records in an hour with indexed timestamp, yet if I do an unindexed query on timestamp it takes a couple of minutes on a 4 server shard (cpu bound). Indexed query comes up instantly. So definitely index it, the write overhead on indexing is not that high and 400k records an hour is not much for mongo.
One thing you do have to look out for is memory size though. At 400k records an hour you are doing 10 million a day. That would consume about 350MB of memory a day to keep that index in memory. So if this goes for a while your index can get larger than memory fast.
Also, if you are truncating records after some time period using remove, I have found that removes create a large amount of IO to disk and it is disk bound.
Certainly on every write you will need to update the index data. If you're going to be doing large queries on the data you will definitely want an index.
Consider storing the timestamp in the _id field instead of a MongoDB ObjectId. As long as you are storing unique timestamps you'll be OK here. _id doesn't have to be an ObjectID, but has an automatic index on _id. This may be your best bet as you won't add an additional index burden.
I'd just use a capped collection, unindexed, with space for, say 600k rows, to allow for slush. Once per hour, dump the collection to a text file, then use grep to filter out rows that aren't from your target date. This doesn't let you leverage the nice bits of the DB, but it means you don't have to ever worry about collection indexes, flushes, or any of that nonsense. The performance-critical bit of it is keeping the collection free for inserts, so if you can do the "hard" bit (filtering by date) outside of the context of the DB, you shouldn't have any appreciable performance impact. 400-600k lines of text is trivial for grep, and likely shouldn't take more than a second or two.
If you don't mind a bit of slush in each log, you can just dump and gzip the collection. You'll get some older data in each dump, but unless you insert over 600k rows between dumps, you should have a continuous series of log snapshots of 600k rows apiece.

Resources