Is there a way to use OT or CRDT (or something similar) for relational data? - algorithm

I'm working on a syncing process between offline-first databases and a central server. As a simple example, there are items and departments and an item belongs to a department. Each client can modify any of the entities.
I know for text documents there are algorithms/technology for handling conflicts like OT and CRDT:
https://en.wikipedia.org/wiki/Operational_transformation
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type
Differences between OT and CRDT
But I'm wondering if you can either use these for more complex structures like you might have in a database. In my case, let's keep it simple and say you have :
items - id, name, department_id
departments - id, name
Changes in properties like "name" in individual elements are manageable (maybe using a version, delta, timestamp). Deletes are a little tricker, but you might just discard the name change because the element is deleted.
And it's even more tricky when there are relations. What happens when one client moves items to a department and the other deletes the department.
At a certain level, some of these conflicts are similar to those that could happen in text using OT. Someone changes a title and someone else deletes it. Or someone adds an element to a bulleted list and someone else moves the list to a different part of the document.
My question is, can you use OT or CRDT for relational data and if so, how would you do it? If not, are there other similar algorithms or techniques to handle conflicts in relational data?

Related

Data-structure to store database record

I want to store employees record. I don't want to use any external libraries or framework. I am trying to build the data structure from scratch.
There will be three fields,
EmployeeName
Age
Salary
We also want to query like,
Get all the salary where EmployeeName = "Bill"
Get all the EmployeeName where salary > 2000
Get all the Salary where age='50'
I am open to use any language but not any built-in package. What is the recommended data-structure to achieve it ?
I assume that the purpose of this exercise is self-education.
If so, Where to begin reading SQLite source code? is a great place to start reading to understand how this kind of software can be built.
If you really want to roll your own, I would suggest storing your data in an array of structs/objects/dictionaries (what they are called will depend on your language), hidden behind an object so that your insert/update/delete methods on the table go through well-defined access functions. Your operations can be implemented inefficiently with grep, filter, etc depending on your language. In addition to the obvious fields, include deleted as a field. That way you can just update that to delete a record, rather than try to modify the table.
To make them more efficient, read through https://cstack.github.io/db_tutorial/parts/part7.html for how to write a b-tree. Then create a b-tree mapping EmployeeName to the list of indexes of records with that name, ditto for age and salary. Now modify the access methods to update the indexes for those fields when you modify the table. Your searches can now go through the b-tree to find the indexes of the records that you want, and then you can look in the table for them.
This is massively simplified compared to what a database gives you, but you're on your way to understanding how databases work. Both in terms of why they scale, and also why they aren't magically fast.

What would be the most appropriate data structure given these requirements?

We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
{
string ProductName;
string ProductId; // or int, if ProductId is an integer
}
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.

SQLite database design for music chart tracker

I've been putting together a little SQLite database to track the top 100 songs from the iTunes RSS feed. I've built the script in Bash to do all the hard work and it's working finally, but I'm not sure if my database structure is correct, so I'm looking for some feedback on the best way to go as I am only learning SQL as I go at the moment so I don't want to dig myself into a hole when it comes to building the queries to retrieve data in time!
I have 3 tables like so;
artists_table
artist_id - PK
artist_name
songs_table
song_id - PK
artist_id - FK (from the artists table)
charts_table
chart_id - PK
song_id - FK (from the songs table)
position - (chart position 1-100)
date - (date of chart position xxxx-xx-xx)
The artists and songs table seem good to me, got foreign key constraint working...etc but I'm not sure about the charts table, anything obviously wrong with this structure?
I want to track songs/artists/positions over time so I can generate some stats...etc
Thanks,
Initial Response
I ask you about the data, in order to answer your Question, but you keep telling me about the process. No doubt, that is very important to you. And now you wish to ensure that the Record Filing System is correct.
Personally, I never write a line of code until I have the database designed. Partly because I hate to rewrite code (and I love to code). You have the sequence reversed, an unfortunate trend these days. Which means, whatever I give you, you will have to rewrite large chunks of your code.
(b.1) How exactly does it check if the artist[song] already exists ?
(b.2) How do you know that there is NOT more than occ of a specific artist/song on file ?
Right now, given the details in your Question, let's say that you have incoming, that Pussycat Dolls place 66 on the MTV chart today:
INSERT artist VALUES ( "Pussycat Dolls" ) -- succeeds, intended
INSERT artist VALUES ( "Pussycat Dolls" ) -- succeeds, unintended
INSERT artist VALUES ( "Pussycat Dolls" ) -- succeeds, unintended
Exactly which Pussycat Dolls record placed 66th today ?
When you RFS grows, and you have more fields in artist, eg. birth_date, which of the three records would you like to update ?
Ditto for Song.
How is Chart identified, is it something like US Top 40 ?
(b.1) How exactly does it check if the artist[song] already exists ?
When you execute code, it runs inside the sqLite program. What is the exact SQL string that you pass it ? Let's say you do this:
SELECT $artist_id = artist_id
FROM artist
WHERE artist_name = $artist_name
IF $artist_id = NULL
INSERT artist VALUES ( $artist_name )
Then you have going to have a few surprises when the system goes "live". Hopefully this interaction will eliminate them. Right now you have a few hundred artists.
when you have a few thousand artists, the system will slow down to a snails pace.
when things go awry, you will have duplicate artists, songs, charts.
Record Filing System
Right now, you have a pre-1970's ISAM Record Filing System, with no Relational integrity, power, or speed.
If you would like to understand more about the dangers of an RFS, in todays Relational context, please read this Answer.
Relational Database
As I understand it, you want the integrity, power, and speed of a Relational Database. Here is what you are heading towards. Obviously, it is incomplete, unconfirmed, there are may details missing, many questions remain open. But we have to model the data, only as data (as opposed to what you are going to do with it, the process), and nothing but the data.
This approach will ensure many things:
as the data grows and is added to (in terms of structure, not population), the existing data and code will not change
you will have data and referential integrity
you can obtain each of your stats via a single SELECT command.
you can execute any SELECT against the data, even SELECTs that you are not capable of dreaming about, meaning unlimited stats. As long as the data is stored in Relational form.
A database is a collection of facts about the real world, limited to the subject area of concern. Thus far we don't have facts, we have a recording of an incoming RSS stream. And the recording has no integrity, there is nothing that your code can rely on. This is heading in the direction of facts:
First Draft Music Chart TRD (Obsolete due to progression, see below.)
Response to Comments 1
Currently, I am only tracking one chart, but I see in your model that it also has the ability to track several, that is nice!
Not really. It is a side-effect of Doing Things Properly. The issue here is one of Identification. A Chart Position is not identified by RSS Feed ID, or chart_table.id, plus a PositionNo plus a DateTime. No. A Chart Position is identified as US Top 100/27 Apr 15/1… The side effect is that ChartName is part of the Identifier, and that allows multiple Charts, with no additional coding.
In these dark days of IT, people often write systems for one Country, and implement a StateCode all over the place. And then experience massive problems when they open up to an international customer base. The point is, there is no such thing as a State that does not have a Country, a State exists only in the context of a Country. So the Identifier for State must include a Country Identifier, it is (CountryCode, StateCode). Both Australia and Canada have NT for a StateCode.
If I can explain how I store the data from the rss feed, it might clear things up somewhat.
No, please. This is about the data, and only the data. Please review my previous comments on that issue, and the benefits.
I am away from my main computer at the moment, but I will respond within the next couple of hours if thats ok.
No worries. I will get to it tomorrow.
Your model does make sense to me though,
That is because you know the data values intimately, but you do not understand the data, and when someone lays it out for you, correctly, you experience pleasurable little twitches of recognition.
I don't mind having to recode everything, its a learning curve!
That's because you put the cart before the horse, and coded against data laid out in a spreadsheet, instead of designing the database first and coding against that second.
If you are not used to the Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
Response to Comments 2
Just one more quick question.
Fire away, until you are completely satisfied.
In the diagram, would there be any disadvantage to putting the artist table above the song table and making the song table a child of the parent artist instead? As artists can have many songs, but each song can only have 1 artist. Is there any need for the additional table to contain just the artistPK and songPK. Could I not store the artistPK into the songs table as a FK, as a song can only exist if there is an associated artist?
Notice your attachment to the way you had it organised. I repeat:
A database is a collection of facts about the real world, limited to the subject area of concern.
Facts are logical, not physical. When those facts are organised correctly (Normalised, designed):
You can execute any SELECT against the data, even SELECTs that you are not capable of dreaming about, meaning unlimited stats. As long as the data is stored in Relational form.
When they aren't, you cant. All SQL (not only reports that are envisioned) against the data is limited to the limitations in the model, which boils down to one thing: discrete facts being recorded in logical form, or not.
With the TRD we have progressed to recording facts about the real world, limited only by the scope of the app, and not by the non-discretion of facts.
Could I not store the artistPK into the songs table as a FK, as a song can only exist if there is an associated artist?
In your working context, at this moment, that is true. But that is not true in the real world that you are recording. If the app or your scope changes, you will have to change great slabs of the db and the app. If you record the facts correctly, as they exist, not as limited to your current app scope, no such change will be necessary when the the app or your scope changes (sure, you will have to add objects and code, but not modify existing objects and code).
In the real world, Song and Artist are discrete facts, each can exist independent of the other. Your proposition is false.
Ave Maria existed for 16 centuries before Karen Carpenter recorded it.
And you already understand and accept that an Artist exists without a `Song.
Is there any need for the additional table to contain just the artistPK and songPK.
It isn't an "additional table to contain just the artistPK and songPK", it is recording a discrete fact (separate to the independent existence of Artist and Song), that a specific Artist recorded a specific Song. That is the fact that you will count on in theChartDatePosition`
Your proposition places Song as dependent on, subordinate to, Artist, and that is simply not true. Any and all stats (dreamed of or not) that are based on Song will have to navigate Artist::ArtistSong, then sort or ORDER BY, etc.
artists can have many songs, but each song can only have 1 artist.
That is half-true (true in your current working context, but not true in the real world). The truth is:
Each Artist is independent
Each Song is independent
Each Artist recorded 1-to-n Songs (via ArtistSong)
Each Song was recorded by 1-to-n Artists (via ArtistSong)
For understanding, changing your words above to form correct propositions (as opposed to stating technically correct Predicates):
Artists can have many RecordedSongs
Each RecordedSong can only have 1 Artist
Each RecordedSong can only have 1 Song
So yes, there are disadvantages, significant ones.
Which is why I state, you must divorce yourself from the app, the usage, and model the data, as data, and nothing but data.
Solution 2
I have updated the TRD.
Second Draft Music Chart TRD
Courier means example data; blue indicates a Key (Primary is always first); pipe indicates column separation; slash indicates Alternate Key (only the columns that are not in the PK are shown); green indicates non-key.
I am now giving you the Predicates. These are very important, for many reasons. The main reason here, is that it disambiguate the issues we are discussing.
If you would like more information on Predicates, visit this Answer, scroll down (way down!) to Predicate, and read that section. Also evaluate that TRD and those Predicates against it.
The index on ChartDateSong needs explanation. At first I assumed:
PK ( Chart, Date, Rank )
But then for Integrity purposes, as well as search, we need:
AK ( Chart, Date, ArtistId, SongId )
Which is a much better PK. So I switched them. We do need both. (I don't know about NONsqLite, if it has clustered indices, the AK, not the PK should be clustered.)
PK ( Chart, Date, ArtistId, SongId )
AK ( Chart, Date, Rank )
Response to Comments 3
What about the scenario when a song enters the charts with the same song_name as a record in the song_table but is completely unrelated (not a cover, completely original, but just happens to share the same name)
In civilised countries that is called fraud, obtaining benefit by deception, but I will try to think in devilish terms for a moment and answer the question.
Well, if it happens, then you have to cater for it. How does the feed inform you of such an event ? I trust it doesn't. So then your Song Identifier is still the Name.
and instead of a unique song record being created, the existing song_id is added to the artistssongs_table with the artist id, wouldn't this be a problem?
We don't know any better, so it is not a problem. No one watching that feed knows any better either. If and when you receive data informing you of that issue, through whatever channel, and you can specify it, you can change it.
Normally we have an app that allows us to navigate the hierarchies, and to change them, eg. A ReferenceMaintenance app, with an Exporer-type window on the left, and combo dialogues (list of occs on top, plus detail of one occ on the bottom) on the right .
Until then, it is not a form of corruption, because the constraint that prevents such corruption is undefined. You can't be held guilty of breaking a law that hasn't been written yet. Except in rogue states.
Although a song can have the same name, it doesn't necessarily mean it's the same record.
Yes.
Wouldn't it be better to differentiate a song by the artist?
They are differentiated by Artist.
You do appreciate that the fact of a Song, and the fact of an Artist playing a song, are two discrete facts, yes ? Please question any Predicates that do not mean perfect sense, those are the propositions that the database supports.
Ave Maria exists as an independent fact, in Song
Karen Carpenter, Celine Dion, and Yours Truly exist as three independent facts, in Artist
Karen Carpenter-Ave Maria, Celine Dion-Ave Maria, and Yours Truly-Ave Maria exist as three discrete facts in ArtistSong.
That is seven separate facts, about one Song, about three Artists.
Response to Comments 4
I do understand it now. The artistsong_table is where the 2 items "meet" and a relationship actually exists and is unique.
Yes. I just wouldn't state it in that way. The term Fact has a technically precise meaning, over and above the English meaning.
A database is a collection of facts about the real world, limited to the subject area of concern.
Perhaps read my Response 3 again, with that understanding of Fact in mind.
Each ArtistSong row is a Fact. That depends on the Fact of an Artist, and the Fact of a Song. It establishes the Fact that that Artist recorded that Song. And that ArtistSong Fact is one that other Facts, lower in the hierarchy, will depend upon.
"Relationship ... actually". I think you mean "instance". The relationship exists between the tables, because I drew a line, and you will implement a Foreign Key Constraint. Perhaps think of Fact as an "instance".
Just to make sure I understand the idea correctly, if I were to add "Genre" into the mix, would I be correct in thinking that a new 'independent' table genre_table would be created and the artistsong_table would inherit its PK as an FK?
Yes. It is a classic Reference or Lookup table, the Relationship will be Non-identifying. I don't know enough about the music brothelry to make any declarations, but as I understand it, Genre applies to a Song; an Artist; and an ArtistSong (they can play a Song in a Genre that is different to the Song.Genre). You have given me one, so I will model that.
The consequence of that is, when you are inserting rows in ArtistSong, you will have to have the Genre. If that is in the feed, well and good, if not, you have a processing issue to deal with. The simple method to overcome that is, implement a Genre "", which indicates to you that you need to determine it from other channels.
It is easy enough to add a classifier (eg. Genre) later, because it is a Non-identifying Relationship. But Identifying items are difficult to add later, because they force the Keys to change. Refer para 3 under my Response 1.
You are probably ready for a Data Model:
Third Draft Music Chart Data Model
It all depends on the relationships (one-to-one, one-to-many, many-to-many) your data is going to have.
The way you implemented your charts table indicates that:
Each chart has only/belongs to one song
A song can have many charts
It is a one-to-many relationship. And if that was what you intended then everything seems fine.
However:
If your charts can have many songs and a song will have only one
chart (also a one-to-many relationship but reversed), the song_id column needs to
be taken out from the charts table and the songs table needs
chart_id column in.
If your charts can have many songs and your songs can have many charts as well (many-to-many relationship), then you need a "joint table" which could be something like this:
TABLE: charts_songs, COLUMNS: id, chart_id, song_id, position

Hbase Schema Nested Entity

Does anyone have an example on how to create an Hbase table with a nested entity?
Example
UserName (string)
SSN (string)
+ Books (collection)
The books collection would look like this for example
Books
isbn
title
etc...
I cannot find a single example are how to create a table like this. I see many people talk about it, and how it is a best practice in certain scenarios, but I cannot find an example on how to do it anywhere.
Thanks...
Nested entities isn't an official feature of HBase; it's just a way some people talk about one usage pattern. In this pattern, you use the fact that "columns" in HBase are really just a big map (a bunch of key/value pairs) to let you to model a dimension of cardinality inside the row by adding one column per "row" of the nested entity.
Schema-wise, you don't need to do much on the table itself; when you create a table in HBase, you just specify the name & column family (and associated properties), like so (in hbase shell):
hbase:001:0> create 'UserWithBooks', 'cf1'
Then, it's up to you what you put in it, column wise. You could insert values like:
hbase:002:0> put 'UsersWithBooks', 'userid1234', 'cf1:username', 'my username'
hbase:003:0> put 'UsersWithBooks', 'userid1234', 'cf1:ssn', 'my ssn'
hbase:004:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_12345', '<isbn>12345</isbn><title>mary had a little lamb</title>'
hbase:005:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_67890', '<isbn>67890</isbn><title>the importance of being earnest</title>'
The column names are totally up to you, and there's no limit to how many you can have (within reason: see the HBase Reference Guide for more on this). Of course, doing this, you have to do your own legwork re: putting in and getting out values (and you'd probably do it with the java client in a more sophisticated way than I'm doing with these shell commands, they're just for explanatory purposes). And while you can efficiently scan just a portion of the columns in a table by key (using a column pagination filter), you can't do much with the contents of the cells other than pull them and parse them elsewhere.
Why would you do this? Probably just if you wanted atomicity around all the nested rows for one parent row. It's not very common, your best bet is probably to start by modeling them as separate tables, and only move to this approach if you really understand the tradeoffs.
There are some limitations to this. First, this technique only works to
one level deep: your nested entities can’t themselves have nested entities. You can still
have multiple different nested child entities in a single parent, and the column qualifier is their identifying attributes.
Second, it’s not as efficient to access an individual value stored as a nested column
qualifier inside a row, as compared to accessing a row in another table, as you learned
earlier in the chapter.
Still, there are compelling cases where this kind of schema design is appropriate. If
the only way you get at the child entities is via the parent entity, and you’d like to have transactional protection around all children of a parent, this can be the right way to go.

Very slow search of a simple entity relationship

We use CRM 4.0 at our institution and have no plans to upgrade presently as we've spend the last year and a half customising and extending the CRM to work with our processes.
A tiny part of model is a simply hierarchy, we have a group of learning rooms that has a one-to-many relationship with another entity that describes the courses available for that learning room.
Another entity has a list of all potential and enrolled students who have expressed an interest in whichever course.
That bit's all straightforward and works pretty well and is modelled into 3 custom entities.
Now, we've got an Admin application that reads the rooms and then wants to show the courses for that room, but only where there are enrolled students.
In SQL this is simplified to:
SELECT DISTINCT r.CourseName, r.OtherInformation
FROM Rooms r
INNER JOIN Students S
ON S.CourseId = r.CourseId
WHERE r.RoomId = #RoomId
And this indeed is very close to the eventual SQL that CRM generates.
We use a Crm QueryEntity, a Filter and a LinkEntity to represent this same structure.
The problem now is that the CRM normalizes the a customize entity into a Base Table which has the standard CRM entity data that all share, and then an ExtensionBase Table which has our customisations. To Give a flattened access to this, it creates a view that merges both tables.
This view is what is used by the Generated SQL.
Now the base tables have indices but the view doesn't.
The problem we have is that all we want to do is return Courses where the inner join is satisfied, it's enough to prove there are entries and CRM makes it SELECT DISTINCT, so we only get one item back for Room.
At first this worked perfectly well, but now we have thousands of queries, it takes well over 30 seconds and of course causes a timeout in anything but SMS.
I'm given to believe that we can create and alter indices on tables in CRM and that's not considered to be an unsupported modification; but what about Views ?
I know that if we alter an entity then its views are recreated, which would of course make us redo our indices when this happens.
Is there any way to hint to CRM4.0 that we want a specific index in place ?
Another source recommends that where you get problems like this, then it's best to bring data closer together, but this isn't something I'd feel comfortable in trying to engineer into our solution.
I had considered putting a new entity in that only has RoomId, CourseId and Enrolment Count in to it, but that smacks of being incredibly hacky too; After all, an index would resolve the need to duplicate this data and have some kind of trigger that updates the data after every student operation.
Lastly, whilst I know we're stuck on CRM4 at the moment, is this the kind of thing that we could expect to have resolved in CRM2011 ? It would certainly add more weight to the upgrading this 5 year old product argument.
Since views are "dynamic" (conceptually, their contents are generated on-the-fly from the base tables every time they are used), they typically can't be indexed. However, SQL Server does support something called an "indexed view". You need to create a unique clustered index on the view, and the query analyzer should be able to use it to speed up your join.
Someone asked a similar question here and I see no conclusive answer. The cited concerns from Microsoft are Referential Integrity (a non-issue here) and Upgrade complications. You mention the unsupported option of adding the view and managing it over upgrades and entity changes. That is an option, as unsupported and hackish as it is, it should work.
FetchXml does have aggregation but the query execution plans still uses the views: here is the SQL generated from a simple select count from incident:
'select
top 5000 COUNT(*) as "rowcount"
, MAX("__AggLimitExceededFlag__") as "__AggregateLimitExceeded__" from (select top 50001 case when ROW_NUMBER() over(order by (SELECT 1)) > 50000 then 1 else 0 end as "__AggLimitExceededFlag__" from Incident as "incident0" ...
I dont see a supported solution for your problem.
If you are building an outside admin app and you are hosting CRM 4 on-premise you could go directly to the database for your query bypassing the CRM API. Not supported but would allow you to solve the problem.
I'm going to add this as a potential answer although I don't believe its a sustainable or indeed valid long-term solution.
After analysing the indexes that CRM had defined automatically, I realised that selecting more information in my query would be enough to fulfil the column requirements of an Index and now the query runs in less then a second.

Resources