Can I sort a bunch of values without retaining the actual content of the string? Two-Key sort one from on-premise another in the cloud - sorting

What do I want to do
I want to sort a bunch of strings, simple enough.
What are my constraints
I have the original text stored on-premises which has the real text which I want to sort, the cloud has some other "columns" of data which is not on-premises and for security reasons I cannot take the original text from on-premises to the cloud.
The real constraint is that I cannot have all the data in one place which causes sorting, paging on values across on-premises & cloud data difficult.
What I thought of (and where I need help)
Maybe I can take a hash or some other way of extracting certain data from the string in such a way that the original string cannot be reproduced (takes care of the security thing) but the extracted string would be enough that I can do sorting on it.
Example
on-premises data:
[
{
"id": 1,
"name": "abcd"
}
]
cloud data:
[
{
"id": 1,
"price": "20"
}
]
I need to sort on both price and name in the above example (imagine a 100,000 rows of such data).

What you need to do is to store pairs of a string and the corresponding id, e.g. in two lists/arrays (whatever the programming language of your choice offers).
Then start sorting the strings, but each time you move a string, move the id the same way.
Alternatively, most programming languages offer constructs which allow to make pairs and then you sort those pairs according to strings, which will automatically move the ids around.
Both ways mean that after sorting, you can still find the id for each string, then with that id you can access the corresponding cloud data as usual.
As an example, the programming language C offers the compound data type construct
struct IdStringPair
{
int id;
char* string;
/* actually just the address of where the full string is stored,
but basically what you probably want to use */
};
Hardly any programming language exists which does not offer something similar.
If conversely the data to sort by is in the cloud, then sorting has to take place in the cloud, i.e. by something being able to execute the sorting algorithm. Make sure that you sort the id along with the key. Then finding the non-cloud string is again the same as before. Whatever you previously did to find a string to an id, do it with the id you got from the cloud sorted data.
This is the same as the first situation/solution, just mirrored.
The core concept is to always sort the ids along with the key (and other data) and thereby dispose of the need to have the data from the other side of the gap between clould and premise. That is applicable to all versions of the sorting of separated data.

Related

GraphQL: idiomatic way to accept an ordered list of sort parameters?

I'm building a GraphQL API. I'm want to allow users to specify how records should be sorted, using multiple sorts.
What should be possible
If I expose the ability to sort monsters by name, birthdate, and id, users should be able to get the API to sort by any combination of those, including (in SQL terms):
ORDER BY name ASC
ORDER BY name ASC, birthdate DESC, id ASC
... etc
What won't work
A single map, like this:
sorts: [{name: DESC, id: ASC}]
...would not tell me the order in which the sorts should be applied (maps are unordered).
What works OK, but isn't ideal
Currently, I accept a list of maps, like this:
sorts: [{name: DESC}, {id: ASC}]
Each map represents an input object, which has fields like name and id, which are enums with possible values ASC and DESC. I expect only one field to be filled per input object. ~But I don't know a way to enforce that.~
This is awkward because:
It would be easy for users to typo their sort parameters as a single map
I can't specify a default (like ASC for id) without having it added to every input object map
Is there a more idiomatic way of accepting an ordered list of sort parameters?
Update
I've now added a user-facing explanatory error when there is more than one key per map. With this change, I think this strategy is OK, but I'm still happy to hear of better ways in the future if they arise.
I think the short answer is "no".
Long answer: You can introspect your types at startup, and generate the schema dynamically according to some made-up convention using custom scalars to represent sorting directive, for example:
monsters(sortBy: [name___ASC, birthdate___DESC, id___ASC]) { name }
but this is also "just a convention". The "list of maps" ("array of objects") model that you listed a non-ideal might be the best option at this point:
# the query
monsters(
sortBy: [
{name: asc},
{birthdate: desc_nulls_last},
{id: asc}
]
) {
name
}
BUT, irrespective of which way you choose, avoid the temptation of starting hacking these things in manually - your server code will become convoluted due to this cross-cutting concern, as will your schema.
Instead, I have seen some GraphQL-to-ORM-bridging libraries make use of Directives to control the runtime schema generation (one example of this). That should be much more viable than hand-carving stuff like this.

Redis - how to store my data?

In the redis site , in "memory optimization" it says that small hashes use way less memory than a few keys so it is better to store a small hash with few fields instead of a few keys so I thought of making,for example, a users hash and storing the users in fields as json serialized data but how about my hash is REALLY big meaning I have a lot of fields.
Is it better to store the users as a single hash with a lot of fields or as several small hashes ??
Im asking this because in the redis site it says that "small" hashes are better than several keys for storing a couple of values but I dont know if it still aplies for really big hashes.
I would say your best solution is creating a key per user, perhaps named by the users id and storing the json data.
We tried storing each user as 1 hash per user and then fields for each of the users properties, but we found we never really used the fields individually and in most cases needed most of the data (HGETALL), so switched to storing json - which also helps with preserving data types.
Would need more detail as to what and how you're trying to store the data to give more suggestions really.
Let's say you have a user like this:
{"ID": "344", "Name": "Blah", "Occupation": "Engineer", "Telephone": [ "550-33...", ...] }
You would serialize the JSON and store it as what Redis says String. I.e. you would use the "GET" and "SET" commands.
e.g.
SET "user:344" "<SERIALIZED>
Since "users" is one of your main objects, it is not a small hash.
The gist of the documentation is about having hashes will small number of elements. For example, let's say that in your whole system you have 10 colors, and you want to associate some data with each of them. Instead of doing:
color:blue -> DATA, color:white -> DATA
It is better to you the hash.
colors -> blue -> DATA
colors -> while -> DATA

CouchDb filter and sort in one view

I'm new to the CouchDb.
I have to filter records by date (date must be between two values) and to sort the data by the name or by the date etc (it depends on user's selection in the table).
In MySQL it looks like
SELECT * FROM table WHERE date > "2015-01-01" AND date < "2015-08-01" ORDER BY name/date/email ASC/DESC
I can't figure out if I can use one view for all these issues.
Here is my map example:
function(doc) {
emit(
[doc.date, doc.name, doc.email],
{
email:doc.email,
name:doc.name,
date:doc.date,
}
);
}
I try to filter data using startkey and endkey, but I'm not sure how to sort data in this way:
startkey=["2015-01-01"]&endkey=["2015-08-01"]
Can I use one view? Or I have to create some views with keys order depending on my current order field: [doc.date, doc.name, doc.email], [doc.name, doc.date, doc.email] etc?
Thanks for your help!
As Sebastian said you need to use a list function to do this in Couch.
If you think about it, this is what MySQL is doing. Its query optimizer will pick an index into your table, it will scan a range from that index, load what it needs into memory, and execute query logic.
In Couch the view is your B-tree index, and a list function can implement whatever logic you need. It can be used to spit out HTML instead of JSON, but it can also be used to filter/sort the output of your view, and still spit out JSON in the end. It might not scale very well to millions of documents, but MySQL might not either.
So your options are the ones Sebastian highlighted:
view sorts by date, query selects date range and list function loads everything into memory and sorts by email/etc.
views sort by email/etc, list function filters out everything outside the date range.
Which one you choose depends on your data and architecture.
With option 1 you may skip the list function entirely: get all the necessary data from the view in one go (with include_docs), and sort client side. This is how you'll typically use Couch.
If you need this done server side, you'll need your list function to load every matching document into an array, and then sort it and JSON serialize it. This obviously falls into pieces if there are soo many matching documents that they don't even fit into memory or take to long to sort.
Option 2 scans through preordered documents and only sends those matching the dates. Done right this avoids loading everything into memory. OTOH it might scan way too many documents, trashing your disk IO.
If the date range is "very discriminating" (few documents pass the test) option 1 works best; otherwise (most documents pass) option 2 can be better. Remember that in the time it takes to load a useless document from disk (option 2), you can sort tens of documents in memory, as long as they fit in memory (option 1). Also, the more indexes, the more disk space is used and the more writes are slowed down.
you COULD use a list function for that, in two ways:
1.) Couch-View is ordered by dates and you sort by e-amil => but pls. be aware that you'd have to have ALL items in memory to do this sort by e-mail (i.e. you can do this only when your result set is small)
2.) Couch-View is ordered by e-mail and a list function drops all outside the date range (you can only do that when the overall list is small - so this one is most probably bad)
possibly #1 can help you

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Query core data store based on a transient calculated value

I'm fairly new to the more complex parts of Core Data.
My application has a core data store with 15K rows. There is a single entity.
I need to display a subset of those rows in a table view filtered on a calculated search criteria, and for each row displayed add a value that I calculate in real time but don't store in the entity.
The calculation needs to use a couple of values supplied by the user.
A hypothetical example:
Entity: contains fields "id", "first", and "second"
User inputs: 10 and 20
Search / Filter Criteria: only display records where the entity field "id" is a prime number between the two supplied numbers. (I need to build some sort of complex predicate method here I assume?)
Display: all fields of all records that meet the criteria, along with a derived field (not in the the core data entity) that is the sum of the "id" field and a random number, so each row in the tableview would contain 4 fields:
"id", "first", "second", -calculated value-
From my reading / Googling it seems that a transient property might be the way to go, but I can't work out how to do this given that the search criteria and the resultant property need to calculate based on user input.
Could anyone give me any pointers that will help me implement this code? I'm pretty lost right now, and the examples I can find in books etc. don't match my particular needs well enough for me to adapt them as far as I can tell.
Thanks
Darren.
The first thing you need to do is to stop thinking in terms of fields, rows and columns as none of those structures are actually part of Core Data. In this case, it is important because Core Data supports arbitrarily complex fetches but the sqlite store does not. So, if you use a sqlite store your fetches are restricted those supported by SQLite.
In this case, predicates aimed at SQLite can't perform complex operations such as calculating whether an attribute value is prime.
The best solution for your first case would be to add a boolean attribute of isPrime and then modify the setter for your id attribute to calculate whether the set id value is prime or not and then set the isPrime accordingly. That will be store in the SQLite store and can be fetched against e.g. isPrime==YES &&((first<=%#) && (second>=%#))
The second case would simply use a transient property for which you would supply a custom getter to calculate its value when the managed object was in memory.
One often overlooked option is to not use an sqlite store but to use an XML store instead. If the amount of data is relatively small e.g. a few thousand text attributes with a total memory footprint of a few dozen meg, then an XML store will be super fast and can handle more complex operations.
SQLite is sort of the stunted stepchild in Core Data. It's is useful for large data sets and low memory but with memory becoming ever more plentiful, its loosing its edge. I find myself using it less these days. You should consider whether you need sqlite in this particular case.

Resources