How to create multiple indexes at once in RethinkDB? - rethinkdb

In RethinkDB, is it possible to create multiple indexes at once?
Something like (which doesn't work):
r.db('test').table('user').indexCreate('name').indexCreate('email').run(conn, callback)

Index creation is a fairly heavyweight operation because it requires scanning the existing documents to bring the index up to date. It's theoretically possible to allow creation of 2 indexes at the same time such that they both perform this process in parallel and halve the work We don't support that right now though.
However I suspect that's not what you're asking about. If you're just looking for a way to not have to wait for the index to complete and then come back and start the next one the best way would be to do:
table.index_create("foo").run(noreply=True)
# returns immediately
table.index_create("bar").run(noreply=True)
# returns immediately
You also can always do any number of writes in a single query by putting them in an array like so:
r.expr([table.index_create("foo"), table.index_create("bar")]).run()
I can't actually think of why this would be useful for index creation because index writes don't block until the index is ready but hey who knows. It's definitely useful in table creation.

Related

How do you perform hitless reindexing in elastic search that avoids races and keeps consistency?

I'm implementing Elastic Search for a blog where documents can be updated.
I need to perform hitless reindexing in Elastic Search that avoids races and keeps consistency. (By consistency, I mean if the application does a write followed by a query, the query should show the changes even during reindexing).
The best advice I've been able to find is that you use aliases to atomically switch which index the application is using and that the application writes to both the old index (via an write_alias) and the new index (via a special write_next_version alias) when writing during a reindexing operation, and reads from the old index (via a read_alias). Any races in the concurrent writes between reindex and the application are resolved by the document version numbers as long as the application writes to the old index first and the new index second. When the reindexing is done, just atomically switch the application's read and write aliases to the new index and delete the write_next_version alias.
However, there are still races and performance issues.
My application doesn't know there's a reindex occurring, reindex and the alias switching involved is a separate long-running process. I could use a HEAD command to find out if a special write_next_version alias exists and only write if it exists. However, that's an extra round trip to the ES servers. There's also still a race between the HEAD command and the reindex process described above that deletes the second write_next_version alias. I could just do both writes every time and silently handle the error to the usually non-existent write_next_version alias. I'd do this via a bulk API if my documents were small, but they are blog entries, they could be fairly large.
So should I just write twice every time and swallow the error on the second write? or should I use the HEAD API to determine whether the application needs to perform the second write for consistency? Or is there some better way of doing this?
The general outline of this strategy is shown in this article. This older article also shows how to do it but they don't use aliases which is not acceptable. There is a related issue on the Elastic Search github but they do not address the problem that two writes need to be done in order to maintain consistency. They also don't address the races or performance issues. (they closed the issue...)

Neo4j bulk import and indexing

I'm importing a big dataset (far over 10m nodes) into neo4j using the neo4j-import tool. After importing my data I run several queries over it. One of those queries performs very badly. I optimized it (PROFILING, using relationship types, splitting up for multicore support and so on) as much as I could.
Still it takes too long, so my idea was to tell neo4j to start at a specific type of nodes by using the USING INDEX clause. I then could check how my db hits change and possibly make it work. Right now my database doesn't have indexes though.
I wanted to create indexes when I'm done writing all the queries I need, it seems I need to start using them already though.
I'm wondering if I can create those indexes during the bulk import process. That seems to be a good solution to me. How would I do that?
Also I wonder if it's possible to actually write a statement that would create indexes for an attribute that exists on every single one of my nodes (let's call it "type").
CREATE INDEX ON :(type);
doesn't work (label is missing but I want to omit it)
Indexes are on Labels + Properties. You need indexes right after your import and before you start trying to optimize queries. Anything your query will use to find a starting point should be indexed (user_id, object_id, etc) and probably any dates or properties used for range queries (modified_on, weight, etc).
CREATE INDEX ON :Label(property)
Cypher queries are single threaded so I have no idea what you mean by multi-core support. What did you read about that, got a link? You can multi-thread Neo4j, but at this point you have to do it manually. See https://maxdemarzi.com/2017/01/06/multi-threading-a-traversal/
Most of the time, the queries can be greatly optimized with an index or expressing it differently. But sometimes you need to redo your model to fit the query. Take a look at https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/ for some hints.

ActiveRecord in batches? after_commit produces O(n) trouble

I'm looking for a good idiomatic rails pattern or gem to handle the problem of inefficient after_commit model callbacks. We want to stick with a callback to guarantee data integrity but we'd like it to run once whether it's for one record or for a whole batch of records wrapped in a transaction.
Here's a use-case:
A Portfolio has many positions.
On Position there is an after_commit hook to re-calculate numbers in reference to its sibling positions in the portfolio.
That's fine for directly editing one position.
However...
There's we now have an importer for bringing in lots of positions spanning many portfolios in one big INSERT. So each invocation of this callback queries all siblings and it's invoked once for each sibling - so reads are O(n**2) instead of O(n) and writes are O(n) where they should be O(1).
'Why not just put the callback on the parent portfolio?' Because the parent doesn't necessarily get touched during a relevant update. We can't risk the kind of inconsistent state that could result from leaving a gap like that.
Is there anything out there which can leverage the fact that we're committing all the records at once in a transaction? In principle it shouldn't be too hard to figure out which records changed.
A nice interface would be something like after_batch_commit which might provide a light object with all the changed data or at least the ids of affected rows.
There are lots of unrelated parts of our app that are asking for a solution like this.
One solution could be inserting them all in one SQL statement then validating them afterwards.
Possible ways of inserting them in a single statement is suggested in this post.
INSERT multiple records using ruby on rails active record
Or you could even build the sql to insert all the records in one trip to the database.
The code could look something like this:
max_id = Position.maximum(:id)
Postion.insert_many(data) # not actual code
faulty_positions = Position.where("id > ?", max_id).reject(&:valid?)
remove_and_or_log_faulty_positions(faulty_positions)
This way you only have have to touch the database three times per N entries in your data. If it is large data sets it might be good to do it in batches as you mention.

performance issues while processing 2 tables in lockstep based on orderedBy from-to

Title is probably not very clear so let me explain.
I want to process a in-process join (nodeJs) on 2 tables*, Session and SessionAction. (1-N)
Since these tables are rather big (millions of records both) my idea was to get slices based on an orderBy sessionId (which they both share), and sort of lock-step walk through both tables in batches.
This however proves to be awefully slow. I'm using pseudo code as follows for both the tables to get the batches:
table('x').orderBy({index:"sessionId"}.filter(row.sessionId > start && row.sessionId < y)
It seems that even though I'm essentially filtering on a attribute sessionId which has got an index, the query planner is not smart enough to see this and every query does a complete tablescan to do the orderby before filtering afterwards (or so it seems)
Of course, this is incredibly wasteful but I don't see another option. E.g.:
Order after filter is not supported by Rethink.
Getting a slice of the ordered table doesn't work either, since slice-enumeration (i.e.: the xth until the yth record) for lack of a better work doesn't add up between the 2 tables.
Questions:
Is my approach indeed expected to be slow, due to having to do a table scan at each iteration/batch?
If so, how could I design my queries to get it working faster?
*) It's too involved to do it using Rethink Reql only.
filter is never indexed in RethinkDB. (In general a particular command will only use a secondary index if you pass index as one of its optional arguments.) You can write that query like this to avoid scanning over the whole table:
r.table('x').orderBy({index: 'sessionID'}).between(start, y, {index: 'sessionId'})

Postgres tsvector_update_trigger sometimes takes minutes

I have configured free text search on a table in my postgres database. Pretty simple stuff, with firstname, lastname and email. This works well and is fast.
I do however sometimes experience looong delays when inserting a new entry into the table, where the insert keeps running for minutes and also generates huge WAL files. (We use the WAL files for replication).
Is there anything I need to be aware of with my free text index? Like Postgres maybe randomly restructuring it for performance reasons? My index is currently around 400 MB big.
Thanks in advance!
Christian
Given the size of the WAL files, I suspect you are right that it is an index update/rebalancing that is causing the issue. However I have to wonder what else is going on.
I would recommend against storing tsvectors in separate columns. A better way is to run an index on to_tsvector()'s output. You can have multiple indexes for multiple languages if you need. So instead of a trigger that takes, say, a field called description and stores the tsvector in desc_tsvector, I would recommend just doing:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description));
Now, if you need a consistent search interface across a whole table, there are more elegant ways of doing this using "table methods."
In general the functional index approach has fewer issues associated with it than anything else.
Now a second thing you should be aware of are partial indexes. If you need to, you can index only records of interest. For example, if most of my queries only check the last year, I can:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description))
WHERE created_at > now() - '1 year'::interval;

Resources