How do I dig into Active Record association meta info? - ruby

I have a table with data I pull from an external source. It has three different columns that are hierarchical in nature and reference other tables. The foreign keys are NOT constrained however, so data in those fields is not necessarily valid.
I'm trying to write something generic that will print out any records with values that don't reference an existing row in the parent table for a given date.
I've basically got something like this:
klass.where(:date => date).each do |rec|
next if rec.send(parent)
# do stuff with rec
end
Where klass is the model for the table and parent is a symbol of a declared 'belongs_to' association.
This method works, however, there may be tens of thousands of records for the day, but unique values on the key are fewer than 100. The repeated lookups into the parent table are unpleasantly time consuming. What I'd like to do is stash all the keys I've already looked up and only perform the lookup on new keys.
Towards this end, I'd like to be able to retrieve the field name that has the foreign key reference at runtime. Ideally this would work regardless of whether or not it's using the default naming convention.

You don't show your schema, but this needs to be attacked in your database table, not in code, especially if you're dealing with "tens of thousands" of records.
As a first try, I'd have a "last_checked" field which is a DateTime value. At the start of the program's run, I'd capture the maximum value for that field, + 1 second, as my last_ran value. Any record that is older than that time is a candidate for looking at. Each record you look at gets the last_checked field updated to the current DateTime.now value. You can whittle down the list of candidate records without beating up Ruby that way.

Related

Queries in Dynamodb

I have an application written in Nodejs that needs to find ONE row based on a city name (this could just be the table's name, different cities will be categorized as different tables), and a field named "currentJobLoads" which is a number. For example, a user might want to find ONE row with the city name "Chicago" and the lowest currentJobLoads. How can I achieve this in Dynamodb without scan operations(since scan would be slower and can only read so much data before it gets terminated)? Any suggestions would be highly appreciated.
You didn't specify what your current partition key and sort key for the table are, but I'm guessing the currentJobLoads field isn't one of them. So you would need to create a Global Secondary Index on the currentJobLoads field, at which point you will be able to run query operations against that field.

How to get the last "row" in a cassandra's long row

In Cassandra, a row can be very long and store units of time relevant data. For example, one row could look like the following:
RowKey: "weather"
name=2013-01-02:temperature, value=90,
name=2013-01-02:humidity, value=23,
name=2013-01-02:rain, value=false",
name=2013-01-03:temperature, value=91,
name=2013-01-03:humidity, value=24,
name=2013-01-03:rain, value=false",
name=2013-01-04:temperature, value=90,
name=2013-01-04:humidity, value=23,
name=2013-01-04:rain, value=false".
9 columns of 3 days' weather info.
time is a primary key in this row. So the order of this row would be time based.
My question is, is there any way for me to do a query like: what is the last/first day's humidity value in this row? I know I could use a Order By statement in CQL but since this row is already sorted by time, there should be some way to just get the first/last one directly, instead of doing another sort. Or is cassandra optimizing it already with Order By statement under the hood?
Another way I could think of is, store another column in this row called "last_time_stamp" that always updates itself as new data is inserted in. But that would require one more update every time I insert new weather data.
Thanks for any suggestion!:)
Without seeing more of your actual table, I suggest using a timestamp (or timeuuid if there is a possibility for collisions) as the second component in a compound primary key. Using this, you can get the last "row" by selecting ORDER BY t DESC LIMIT 1.
You could also change the clustering order in your schema to order it naturally for "last N" queries.
Please see examples and linked resource in this answer.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Can you sort a GET on a Cassandra column family by the Timestamp value created for each column entry, rather than the column Keys?

Basically I have a 'thread line' where new threads are made and a TimeUUID is used as a key. Which obviously provides sorting of a new thread quite easily, espically when say making a query of the latest 20 threads etc.
My problem is that when a new 'post' is made to a thread I want to be able to 'bump' that thread to the front of the 'thread line' which is where the problem comes in, how do I basically make this happen so I can still make queries that can still be selected in the right order without providing any kind of duplicates etc.
The only way I can see this working is if rather than a column family sorting via a TimeUUID I need the column family to sort via the insertion Timestamp, therefore I can use the unique thread IDs for column keys and retrieve these in the order they are inserted or reinserted rather than by TimeUUID? Is this possible or am I missing a simple trick that allows for this? As far as I know you have to set a particular comparitor or otherwise it defaults to bytes?
Columns within a row are always sorted by name with the given comparator. You cannot sort by timestamp or value or anything else, or Cassandra would not be able to merge multiple updates to the same column correctly.
As to your use case, I can think of two options.
The most similar to what you are doing now would be to create a second columnfamily, ThreadMostRecentPosts, with timeuuid columns (you said "keys" but it sounds like you mean "columns"). When a new post arrives, delete the old most-recent column and add a new one.
This has two problems:
The unit of replication is the row, so having this grow indefinitely could be problematic. (Using expiring columns to age out no-longer-relevant thread information might help.)
You need a lock manager so that multiple posts to the same thread don't race and possibly leave multiple entries in this row.
I would suggest instead creating a row per day (for instance), whose columns are the thread IDs and whose values are the most recent post. Adding a new post just updates the value in that column; no delete/re-add is done, so the race is not a problem anymore. You don't get sorting for free anymore but that's okay because you're limiting it to a small enough set that you can do that sort in memory (say, yesterday's threads and today's).
(Finally, I would add that I can say from experience that having a cutoff past which old threads don't get bumped to the front by a new reply is a Good Thing.)

Enumerate indexes on a Extensible Storage Engine (ESENT) table

Background
I'm writing an adapter for ESE to .NET and LINQ in a Google Code project called eselinq. One important function I can't seem to figure out is how to get a list of indexes defined for a table. I need to be able to list available indexes so the LINQ part can automatically determine when indexes can be used. This will allow much more efficient plans for user queries if appropriate indexes can be found.
There are two related functions for querying index information:
JetGetTableIndexInfo - get index information by tableID
JetGetIndexInfo - get index information by tableName
These only differ in how the related table is specified (name or tableid). It sounds like these would support the function I want but all the info levels seem to require that I already have a certain index to query information for. The only exception is JET_IdxInfoCount, but that only counts how many indexes are present.
JET_IdxInfo with its JET_INDEXLIST sounds plausible but it only lists the columns on a specific index.
Alternatives
I am aware that I could get the index information another way, like annotations on .NET types corresponding to database tables, or by requiring a index mapping be provided ahead of time. I think there's enough introspection implemented to make everything else work out of the box without the user supplying extra information, except for this one function.
Another option may be to examine the system tables to find related index objects, but this is would mean depending on an undocumented interface.
To satisfy this question, I want a supported method of enumerating the indexes (just the name would be sufficient) on a table.
You are correct about JetGetTableIndexInfo and JetGetIndexInfo and JET_IdxInfo. The twist is that the data is returned in a somewhat complex: a temporary table is returned containing a row for the index and then a row for each column in the table. To just get the index names you will need to skip the column rows (the column count is given by the value of the columnidcColumn column in the first row).
For a .NET example of how to decipher this, look at the ManagedEsent project. In the MetaDataHelpers.cs file there is a method called GetIndexInfoFromIndexlist that extracts all the data from the temporary table.

Resources