I have 3 columns in my mongodb named as days (long), startDate (java.util.Date), endDate (java.util.Date). What all I want to fetch the records between startDate and (endDate-days) OR (endDate-startDate) <= days.
Can you please let me know how could i achieve this using mongoTemplate spring.
I don't want to fetch all the records from table and then resolve this on java side since in future my table may have million of records.
Thanks
Jitender
There is no way to do this in the query on the DB side (the end minus start part). What I recommend if this is an important feature for your application is that you alter the schema to maintain in the document the delta between the two fields in the format you need it. Since you can update that field when you update endDate (or if you populate both dates at the same time you can just compute the field then).
If you receive this data in bulk from another source, or if you do multi-updates of the endDate then you will probably need another job to run and periodically compute the delta of the documents where it's not computed (then you can start with always setting delta to 99999 and update it in this job to accurate value once endDate is set).
While you can use $where clause, it will be a very slow full collection scan so I would not suggest its use - it's probably better to come up with a more performant alternative even if it requires altering the schema.
http://docs.mongodb.org/manual/reference/operator/where/
Related
I want to scrub(or encrypt) the email information from a few tables which are older than a few years.
This I am planning to do as part of a job, next time when I run the job how can I omit the rows which are already scrubbed or encrypted.
I am looking for an approach which will be having good performance.
"I want to scrub(or encrypt) the email information from a few tables which are older than a few years"
I hope this means you have a date column on these tables which you can use to determine which ones need to be scrubbed. The most efficient way of tackling the job is to track that date in an operational table, recording the most recent date scrubbed.
For example you have ten years' worth of data, and you need to scrub records which are more than four years old. Now this would work:
update t23
set email = null
where date_created < add_months(sysdate, -48);
But it seems like you want to batch things up. So build a tracking table, which at its simplest would be
create table tracker (
last_date_scrubbed);
Populate the last_date_scrubbed with a really old date say date '2010-01-01'
Now you can write a query like this
update t23
set email = null
where date_created
< (select last_date_scrubbed + interval '1' year from tracker);
That will clean all records older than 2011. Increment the date in the tracker table by one year. Run the query again to clean stuff from 2011. Repeat until you get to your target state of cleanliness. At which point you can switch to running the query monthly , with an interval of one month , or whatever.
Obviously you should proceduralize this. A procedure is the best way to encapsulate the steps and make sure everything is kept in step. Also you can use the database scheduler to run the procedure.
"there is one downside to this approach. I thought that you want to be free upon choosing which rows to be updated."
I don't see any requirement to track which individual rows have been scrubbed. After all, the end state is that every record older than a certain date has been scrubbed. When I have done jobs like this previously all anybody wanted to know was, "how many rows have we done so far and how many have we still got to do?" Which can be answered by tracking the sql%rowcount for each run.
For The best performance, you can add a Flag Column to your main table. a Column like IsEncrypted. then every time you try to run any query for the "not Encrypted rows" you easily use WHERE when IsEncrypted Column is false to condition on those rows only. there are other ways though.
EDIT
another way is to create a logger table. basically what this table does, is that it records any more information you want about a certain ID in another table. have another table called EncryptionLogger, in it you would have at least two columns: EmailTableId, IsEncrypted. then in any query you can simply get any rows WHERE their Ids are NOT IN this table.
In Cassandra, a row can be very long and store units of time relevant data. For example, one row could look like the following:
RowKey: "weather"
name=2013-01-02:temperature, value=90,
name=2013-01-02:humidity, value=23,
name=2013-01-02:rain, value=false",
name=2013-01-03:temperature, value=91,
name=2013-01-03:humidity, value=24,
name=2013-01-03:rain, value=false",
name=2013-01-04:temperature, value=90,
name=2013-01-04:humidity, value=23,
name=2013-01-04:rain, value=false".
9 columns of 3 days' weather info.
time is a primary key in this row. So the order of this row would be time based.
My question is, is there any way for me to do a query like: what is the last/first day's humidity value in this row? I know I could use a Order By statement in CQL but since this row is already sorted by time, there should be some way to just get the first/last one directly, instead of doing another sort. Or is cassandra optimizing it already with Order By statement under the hood?
Another way I could think of is, store another column in this row called "last_time_stamp" that always updates itself as new data is inserted in. But that would require one more update every time I insert new weather data.
Thanks for any suggestion!:)
Without seeing more of your actual table, I suggest using a timestamp (or timeuuid if there is a possibility for collisions) as the second component in a compound primary key. Using this, you can get the last "row" by selecting ORDER BY t DESC LIMIT 1.
You could also change the clustering order in your schema to order it naturally for "last N" queries.
Please see examples and linked resource in this answer.
I have a situation where a table in the database has a date field defined as date where time also is important (for sorting later).
At first, all times for the date where coming as 000000 but I updated the code to use timestamp and when inserting new records, it's working fine.
Update on the other hand will not change the database if the date is the same (but different time). Apparently, while inserting, hibernate doesn't take into consideration the time and the record is not change (or at least this is what I discovered from my testing).
I can't change the database structure to use timestamp or add a time field.
Any help is really appreciated :)
Thanks
To improve my skills on Hector and cassandra I'm trying diffrent methods to query data out of cassandra.
Currently I'm trying to make a simple message system. I would like to get the posted messages in chronological order with the last posted message first.
In plain sql it is possible to use 'order by'. I know it is possible if you use the OrderPreservingPartitioner but this partioner is deprecated and less-efficient than the RandomPartioner. I thought of creating an index on a secondary column with a timestamp als value, but I can't figure out how to obtain the data. I'm sure that I have to use at least two queries.
My column Family looks like this:
create column family messages
with comparator = UTF8Type
and key_validation_class=LongType
and compression_options =
{sstable_compression:SnappyCompressor, chunk_length_kb:64}
and column_metadata = [
{column_name: message, validation_class: UTF8Type}
{column_name: index, validation_class: DateType, index_type: KEYS}
];
I'm not sure if I should use DataType or long for the index column, but I think that's not important for this question.
So how can I get the data sorted? If possible I like to know hows its done white the CQL syntax and whitout.
Thanks in advance.
I don't think there's a completely simple way to do this when using RandomPartitioner.
The columns within each row are stored in sorted order automatically, so you could store each message as a column, keyed on timestamp.
Pretty soon, of course, your row would grow large. So you would need to divide up the messages into rows (by day, hour or minute, etc) and your client would need to work out which rows (time periods) to access.
See also Cassandra time series data
and http://rubyscale.com/2011/basic-time-series-with-cassandra/
and https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
and http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/
Hello I have an app with an Event entity that has a to-many-relationship to a Date entity that contains MULTIPLE startDates and endDates for each event.
In my list view I need to sort the events by the next available startDate (or endDate) from the to-many-relationship.
First I created a transient property in the Date entity that made all the necessary calculations (comparing to present date etc) but then quickly realized that
you cannot sort the fetchedResultsController using a transient property.
I cannot make the calculations at the time the start and end dates are created, because there is more than one startDate and endDate for each event and which ones to use
can only be determined on demand by comparing them to the present date.
Any guidance on which way to go with this would be greatly appreciated.
You might need to go about this backwards.
The simplest solution would be to fetch Date objects that fall into the needed range and then display the Event objects they relate to.
Otherwise, you will have to use a SUBQUERY in your predicate.