Cassandra path for range query with CL=2 - cassandra-2.0

I'm modelling a toy email application in Cassandra. Say I'm storing emails in the emails table:
create table emails (
recipient_uid int,
timestamp timestamp,
content text,
PRIMARY KEY (recipient_uid, timestamp) );
Let's say we run Cassandra on a cluster of three nodes, N1, N2, and N3, 'replication_factor': 3, and the quorum config W=2, R=2.
Given that W=2, under concurrent operations it is possible that nodes diverge and they store different sets of emails. For instance, we could have the following rows on all nodes:
recipient_uid | timestamp | content
---------------+--------------------------+-------
1 | 2010-02-09 23:49:07+0000 | c1
1 | 2010-02-04 22:11:16+0000 | c2
And then an INSERT occurs which propagates only at nodes N2 and N3; these two will have the following rows then:
recipient_uid | timestamp | content
---------------+--------------------------+-------
1 | 2010-02-09 23:49:07+0000 | c1
1 | 2010-02-04 22:11:16+0000 | c2
1 | 2010-12-13 05:52:46+0000 | c3
Note that email with content c3 is not replicated at N1, yet. Then a client does a SELECT (R=2) at N1:
select * from emails where recipient_uid = 1;
When this happens, node N1 coordinates with the closest other node, say N2, and return all the three rows. (Assuming no speculative executor.)
Question:
What are the internals of this coordination step between N1 and N2? Is there any documentation or can somebody point me to the relevant code chunk?
According to my current understanding, the relevant classes should be roughly:
ReadCallback.java, and
AbstractReadExecutor.java.
But I'm relying mainly on information from ArchitectureInternals which is incomplete and specific for single-row requests.
To be more precise: my main concern is if the coordinator N1 is able to selectively pull just the missing row (in this case, the third row) from N2, then merge this row with the first two rows (which he has already) and finally reply to the client. What digests and/or data does N1 exchange with N2 to achieve this, if this is the case indeed?
EDIT: This doc on Read Path also seems relevant, but it also assumes single-row requests, which I expect to be simpler than range queries.
EDIT2: #ArielWeisberg kindly shared this information:
Range queries are described by PartitionRangeReadCommand.
StorageProxy has RangeIterator which uses StorageProxy.getRestrictedRanges() to calculate a covering set of endpoints and ranges.
StorageProxy.RangeCommandIterator is what materializes the results.

Related

How to understand part and partition of ClickHouse?

I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?

Elastic search storing hierarchical data and querying it

Let me break down the problem it will take some time.
Consider that you have an entities A, B, C in your system.
A is the parent of everything
B is the child of A
C can be child of A or B, Please note there are some more entities like D,E,F which are same as C. So lets consider C only for time being
So basically its a tree alike structure like
```
A
/ \
/ \
B C(there are similar elements like D, E, F)
|
|
C
```
Now we need are using Elastic Search as secondary DB to store this. In the data base the structure is completely different since A, B, C have dynamic fields, so they are different tables and we join them to get data, but from business prospective this is design.
Now when we try to flat it and store in es for under set
We have a entity A1 who has 2 children C1 and B1, B1 has further children C2
A B C
1 A1 null null
2 A1 null C1
3 A1 B1 null
4 A1 B1 C2
Now what your can query
use says he wants All columns of A,B,C where value of columns A is A1, so adding some null removing rules we can give him row number 2,3,4
now the problem set , now user says he want all As where value of A is A1 , so basically we will return him all rows 1,2,3,4 or 2,3,4 so we will see values like
A
A1
A1
A1
but logically he should see only one column A1 since that is only unique value. As ES doesn't have the ability to group by things.
So how we solved things.
We solved this problem by creating multiple indices and one nested index
So when we need to group by index we go to nested index and other index work as flat index
so we have different index, like index for A and B, A or B and C . But we have more elements so it lead to creation of 5 indices.
As data started increasing its becoming difficult to maintain 5 indices and indexing them from scratch takes too much time.
So to solve this we started to look for other options and we are testing cratedb. But on the first place we are still trying to figure is there any way to do that in ES since need to use many feature of ES as percolation, watcher etc. Any clues on that?
Please also note that we need to apply pagination also. That's why single nested index will not work

Spotfire: Using one selection as a range for another datatable

I've searched quite a bit for this and can't find a good solution anywhere to what seems to me like a normal problem for this product.
I've got a data table (in memory) that is from a rollup table(call it 'Ranges'). Basically like so:
id | name | f1 | f2 | totals
0 | Channel1 | 450 | 680 | 51
1 | Channel2 | 890 | 990 | 220
...and so on
Which creates a bar chart with Name on the X and Totals on the Y.
I have another table that is an external link to a large (500M+ rows) table. That table (call it 'Actuals') has a column ('Fc') that can fit inside the F1 and F2 values of Ranges.
I need a way for Spotfire Analyst (v7.x) to use the selection of the the bar chart for Ranges to trigger this select statement:
SELECT * FROM Actuals WHERE Actuals.Fc between [Ranges].[F1] AND [Ranges].[F2]
But there aren't any relationships (Foreign keys) between the two data sources, one is in memory (Ranges) and the other is dynamic loaded.
TLDR: How do I use the selected rows from one visualization as a filter expression for another visualization's data?
My choice for the workaround:
Add a button which says 'Load Selected Data'
This will run the following code, which will store the values of F1 and F2 in a Document Property, which you can then use to filter your Dynamically Loaded table and trigger a refresh (either with the refresh code or by setting it to load automatically).
rowIndexSet=Document.ActiveMarkingSelectionReference.GetSelection(Document.Data.Tables["IL_Ranges"]).AsIndexSet()
if rowIndexSet.IsEmpty != True:
Document.Properties["udF1"] = Document.Data.Tables["IL_Ranges"].Columns["F1"].RowValues.GetFormattedValue(rowIndexSet.First)
Document.Properties["udF2"] = Document.Data.Tables["IL_Ranges"].Columns["F2"].RowValues.GetFormattedValue(rowIndexSet.First)
if Document.Data.Tables.Contains("IL_Actuals")==True:
myTable=Document.Data.Tables["IL_Actuals"]
if myTable.IsRefreshable and myTable.NeedsRefresh:
myTable.Refresh()
This is currently operating on the assumption that you will not allow your user to view multiple ranges at a time, and simply shows the first one selected.
If you DO want to allow them to view multiple ranges, you can run a cursor through your IL_Ranges table to either get the Min and Max for each value, and limit the Actuals between the min and max, or you can create a string that will essentially say 'Fc between 450 and 680 or Fc between 890 and 990', pass that through to a stored procedure as a string, which will execute the quasi-dynamic statement, and grab the resulting dataset.

Dynamic database design for variable length spatio-temporal data in Oracle (need a schema design)

Currently I am working on a research project, where I need to store spatio-temporal data and analyze them efficiently. I am giving the exact requirement below.
The research is going on meteorological data, so the data attributes are temperature, humidity, pressure, wind-speed, wind-direction etc. The number of attributes is previously unknown to us, depending on requirement we may need to add more attributes (Table having dynamic attribute and different datatype nature). Again the data is captured from various locations, from various height and in a certain time duration as well as time interval.
So, what should be the best way to design a schema for the requirement? We must have to find out relation efficiently.
The purpose of the project is not only to store database, also need to manipulate the data.
Sample data in table format -
location | time | height | pressure | temperature | wind-direction | ...
L1 | 2011-12-18 08:04:02 | 7 | 1009.6 | 28.3 | east | ...
L1 | 2011-12-18 08:04:02 | 15 | 1008.6 | 27.9 | east | ...
L1 | 2011-12-18 08:04:02 | 27 | 1007.4 | 27.4 | east | ...
L1 | 2011-12-18 08:04:04 | 7 | 1010.2 | 28.4 | north-east | ...
L1 | 2011-12-18 08:04:04 | 15 | 1009.4 | 28.2 | north-east | ...
L1 | 2011-12-18 08:04:04 | 27 | 1008.9 | 27.6 | north-east | ...
L2 | 2011-12-18 08:04:02 | ..... so on
Here I need to design a schema for the above sample data where Location is a spatial location that can be implemented using oracle MDSYS.SDO_GEOMETRY type.
Constraints are:
The no of attributes (table column) is unknown during development. In runtime any new attribute(let say - humidity, refractive index etc.) can be added. So we can't design attribute specific table schema.
    1.1) for this constraint I thought to use a schema like -
           tbl_attributes(attr_id_pk, attr_name, attr_type);          
tbl_data(loc, time, attr_id_fk, value);
     The my design the attribute value must be varchar type, and as required I thought to cast (not a good idea at all).
     But finding relational data with this schema is very difficult using SQL query only. For example I want to find -
          1.1.1) avg pressure for location L1 when wind direction is east and temperature in between 27-28
         1.1.2) locations, where pressure is maximum at 15 height.
     1.2) I am also thinking to edit table schema during runtime, which is again not a good idea I think.
We will use a loader application, which will be taking care of this dynamic insertion depending on the schema (what ever it maybe).
Need to retrieve statistical data efficiently as some example is given above [1.1.*].
I am not completely sure I understand what you mean when you say that
The no of attributes (table column) is unknown during development. In
runtime any new attribute(let say - humidity, refractive index etc)
can be added.
first of all, I suppose that this is not really happening at random: i.e. when you get a new bunch of data from the field you know (before importing) that these have an extra dimension or two. Correct?
Also, the fact that in this new data batch you get "refractive index" will not make the older data magically acquire a proper value for this dimension.
Therefore I would go for a classical Object-to-RDBMS mapping where you have:
a header table with things that exist for every measurement: i.e. time and space, possibly the source (i.e. lab, sensor, team which provided the data) and an autogenerated key.
one or more detail table where the values are defined as proper fields.
Example:
Header
location | time | height | source |Key |
L1 | 2011-12-18 08:04:02 | 7 | team-1 | 002020013 |
L1 | 2011-12-18 08:04:02 | 15 | team-1 | 002020017 |
L1 | 2011-12-18 08:04:02 | 27 | Lab-X | 002020018 |
L1 | 2011-12-18 08:04:04 | 7 | Lab-Y | 002020021 |
L1 | 2011-12-18 08:04:04 | 15 | Lab-X | 002020112 |
Atmospheric data (basic)
Key | pressure | temp | wind-dir |
002020013 | 1009.6 | 28.3 | east |
002020017 | 1019.3 | 29.2 | east |
002020018 | 1011.6 | 26.9 | east |
Light-sensor data
Key | refractive-ind | albedo | Ultraviolet |
002020017 | 79.6 | .37865 | 7.0E-34 |
002020018 | 67.4 | .85955 | 6.5E-34 |
002020021 | 91.6 | .98494 | 8.1E-34 |
In other words: every different set of data will use one or more subtables (these you can add "dynamically", if needed) and you can still create queries by standard means, you will just have to join subtables (where possible: i.e. if you want to analyze by Wind Directions AND refractive index, you can - but only when you have set of data which have both values) by using the reference keys to keep these consistent).
I believe this more efficient than using text fields with CSV inside, or data blobs or using a key-values associations.
I would definitely go with 1.2 (edit table schema during runtime), at least to begin with. Any sufficiently advanced configuration is indistinguishable from programming; don't think you can magically avoid making changes to your program.
Don't be scared of alter table. Yes, the upfront costs are higher - you may need a process (not just a program) to ensure your schema stays clean. And there are some potential locking problems (that have solutions). But if you do it right you only have to pay the price once for each change.
With a completely generic solution you will pay a small price with every query. Your queries will be complicated, slower, ugly, and more likely to fail. You can never write a query like select avg(value) ..., it may or may not work, depending on how the data is accessed. You can use a PL/SQL function to catch exceptions, or use inline views and hints to force a specific access pattern. Either way, your queries are more complicated and slower, and you have to make sure that everybody understands these problems before they use the data.
And with a generic solution the optimizer will suck because it knows nothing about your data. Oracle can't predict how many rows will be returned by where attr_name = 'temperature' and is_number(value) = 28.4. But it can make a very good guess for where temperature = 28.4. You may have significantly more bad plans (i.e. slow queries) with generic columns.
Thank you for the quick response and good guidance. I have gotten some concepts from the both answers and decided to go with a mix model. I don't know whether I am in the write path or not. I want comments on the model. Below I am describing the complete conceptual model with MySQL code snippet.
Conceptual model
For dynamicity - (no of column is not defined previously) I have created 4 tables as follows -
geolocation(locid int, name varchar, geometry spatial_type) - to store information of a particular location, may be defined with spatial feature.
met_loc_event(loceventid int, locid* int, record_time timestamp, height float) - this is to identify a perticular event in a place with sudden height.
metfeatures(featureid int, name varchar, type varchar) - to store feature (ie. Column) details with a data type, that type field will help to cast data as required.
metstore(loceventid* int, featureid* int, value varchar) - to store an atom value for a feature at a particular time.
Up to that part I design a column orientation to store a dynamic nature of table. But as you suggest this is not a good design for quering (some will not work like arithmetic functions) the database. This is also not good if we consider performance.
For efficient query needs (to avoid to much joining and to avoid casting value during query) - I extend the model with some helper view, I write store procedure to generate views from the stored database.
First I created views for each feature (by taking value from feature table, so no of entry will be no of feature view initially) with the help of met_loc_event, metfeatures and metstore tables. These views store locid, record_time, height, and caste value according to feature type
Next from these views, I created a row oriented view named metrelview - which consist of all relation data row wise as like normal table. I have planned to fire query to the view, so the query performance will be improved.
This view generation procedure needs to execute whenever any insert, update or delete operation will be there in features table.
Below is the MySQL procedure that I have developed for the view generation
CREATE PROCEDURE `buildModel`()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE fid INTEGER;
DECLARE fname VARCHAR(45);
DECLARE ftype VARCHAR(45);
DECLARE cur_fatures CURSOR FOR SELECT `featureid`, `name`, `type` FROM `metfeatures`;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
SET #viewAlias = 'v_';
SET #metRelView = "metrelview";
SET #stmtCols = "";
SET #stmtJoin = "";
START TRANSACTION;
OPEN cur_fatures;
read_loop: LOOP
FETCH cur_fatures INTO fid, fname, ftype;
IF done THEN
LEAVE read_loop;
END IF;
IF fname IS NOT NULL THEN
SET #featureView = CONCAT(#viewAlias, LOWER(fname));
IF ftype = 'float' THEN
SET #featureCastStr = "`value`+0.0";
ELSEIF ftype = 'int' THEN
SET #featureCastStr = "CAST(`value` AS SIGNED)";
ELSE
SET #featureCastStr = "`value`";
END IF;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #featureView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #featureView, "` AS SELECT le.`loceventid` AS loceventid, le.`locid`, le.`rectime`, le.`height`, ", #featureCastStr, " AS value FROM `metlocevent` le JOIN `metstore` ms ON (le.`loceventid`=ms.`loceventid`) WHERE ms.`featureid`=", fid);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
SET #stmtCols = CONCAT(#stmtCols, ", ", #featureView, ".`value` AS ", #featureView);
SET #stmtJoin = CONCAT(#stmtJoin, " ", "LEFT JOIN ", #featureView, " ON (le.`loceventid`=", #featureView,".`loceventid`)");
END IF;
END LOOP;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #metRelView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #metRelView, "` AS SELECT le.`loceventid`, le.`locid`, le.`rectime`, le.`height`", #stmtCols, " FROM `metlocevent` le", #stmtJoin);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
CLOSE cur_fatures;
COMMIT;
END;
N.B. - I tried to call the procedure with any event in features table, so that every thing should be automated. But as MySQL is not supported dynamic query with function or trigger, I cant do it automatically
I also want criticism before i finalize as accepted model, I am not a DBA so, if you can help me how to improve performance for the model will be very helpful for me.
This sounds like a homework assignment whose underlying subject is: use-cases for abandoning strict normal-form design principles.
The solution to this conundrum is to develop a three-stage solution. Stage 1 is runtime adaptability using the flexible AttributeType, AttributeValue approach, so that rapidly incoming data can be captured and put somewhere temporarily in a quasi-structured manner. Stage 2 involves the analysis of that runtime data to see where the model must be extended with additional columns and validation tables to accommodate any new attributes. Stage 3 is the importing of the as-yet-unimported data into the revised model, which never relaxes its strict datatyping and declarative referential integrity constraints.
As they say: Life, friends, is a trade-off.

R - Sorting and Sub-setting Maximum Values within Columns

I am trying to iteratively sort data within columns to extract N maximum values.
My data is set up with the first and second columns containing occupation titles and codes, and all of the rest of the columns containing comparative values (in this case location quotients that had to be previously calculated for each city) for those occupations for various cities:
*occ_code city1 ... city300*
occ1 5 ... 7
occ2 20 ... 22
. . . .
. . . .
occ800 20 ... 25
For each city I want to sort by the maximum values, select a subset of those maximum values matched by their respective occupations titles and titles. I thought it would be relatively trivial but...
edit for clarification: I want end to with a sorted subset of the data for analysis.
occ_code city1
occ200 10
occ90 8
occ20 2
occ95 1.5
At the same time I want to be able to repeat the sort column-wise (so I've tried lots of order commands through calling columns directly: data[,2]; just to be able to run the same analysis functions over the entire dataset.
I've been messing with plyr for the past 3 days and I feel like the setup of my dataset is just not conducive to how plyer was meant to be used.
I'm not exactly sure what your desired output is according to your example snippit. Here's how you could get a data frame like that for every city using plyr and reshape
#using the same df from nico's answer
library(reshape)
df.m <- melt(df, id = 1)
a.cities <- cast(df.m, codes ~ . | variable)
library(plyr)
a.cities.max <- aaply(a.cities, 1, function(x) arrange(x, desc(`(all)`))[1:4,])
Now, a.cities.max is an array of data frames, with the 4 largest values for each city in each data frame. To get one of these data frames, you can index it with
a.cities.max$X13
I don't know exactly what you'll be doing with this data, but you might want it back in data frame format.
df.cities.max <- adply(a.cities.max, 1)
One way would be to use order with ddply from the package plyr
> library(plyr)
> d<-data.frame(occu=rep(letters[1:5],2),city=rep(c('A','B'),each=5),val=1:10)
> ddply(d,.(city),function(x) x[order(x$val,decreasing=TRUE)[1:3],])
order can sort on multiple columns if you want that.
This will output the max for each city. Similar results can be obtained using sort or order
# Generate some fake data
codes <- paste("Code", 1:100, sep="")
values <- matrix(0, ncol=20, nrow=100)
for (i in 1:20)
values[,i] <- sample(0:100, 100, replace=T)
df <- data.frame(codes, values)
names(df) <- c("Code", paste("City", 1:20, sep=""))
# Now for each city we get the maximum
maxval <- apply(df[2:21], 2, which.max)
# Output the max for each city
print(cbind(paste("City", 1:20), codes[maxval]))

Resources