Querying Large Datasets in Cassandra

Querying Large Datasets in Cassandra - jdbc

I am by experience a RDBMS programmer. I am working on a scientific research problem involving genomic data. I was assigned to explore Cassandra since we needed a Big Data, scalable and cheap (free) solution. Setting Cassandra up and loading it with data was seductively trivial and similar to my experience with traditional DBs like Oracle and MySQL. My problem is finding a simple strategy to query data since this is a fundamental requirement for all data repositories. The data I am working with is mutation datasets which contain positional information as well as calculated numerical measures regarding the data. I set up an initial static column family that looks like this:
CREATE TABLE variant (
chrom text,
pos int,
ref text,
alt text,
aa text,
ac int,
af float,
afr_af text,
amr_af text,
an int,
asn_af text,
avgpost text,
erate text,
eur_af text,
ldaf text,
mutation_id text,
patient_id int,
rsq text,
snpsource text,
theta text,
vt text,
PRIMARY KEY (chrom, pos, ref, alt)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
CREATE INDEX af_variant_idx ON variant (af);
As you can see there is a natural primary key of positional data (chrome, pos, ref and alt). This data is not meaningful from a querying point of view. Much more interesting to my clients currently is to extract data with an 'AF' value below a certain value. I am using Java restful services to interact with this database using the CQL JDBC driver. It quickly became apparent that directly querying this table would not work using AF since it seems like the select statement must identify the row keys that you want to look at. I found some confusing discussions on this point but what I decided to do was since the distinct values of AF are below 100 values, I built a lookup table that looks like this:
CREATE TABLE af_lookup (
af_id float,
column1 text,
column2 text,
value text,
PRIMARY KEY (af_id, column1, column2)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
This was meant to be a dynamic table with very wide rows. I populated this table based on those data stored on my static column family. The 'AF' value is the key and the compound key from the other table is concantenate by '-' (i.e.1-129-T-G) and stored as a string as a dynamic column name. This worked OK but I still do not understand how all of these things work together. Dynamic Column Families seem to only work as advertised using CQL -2 but I really need to utilize function like >, <, >=, <=. It seems like this is theoretically possible but I have not found a solution in the last 4 weeks of trying a number of different tools (I tried astyanax as well as the JDBC driver).
I have two primary problems, the first is the rpc timeout limitation for querying these data which could produce 10 of thousands to millions of records. The second problem is how to present these data in HTML by getting the data that has not been presented already (previous - next links). Similar to the way opscenter displays column family record data. This doesn't seem possible with the functional limitations of not being able to use >, <, >=, <=. Based on my experience this is probably a lack of understanding on my part of how this product really works rather than a lack of capability of the product (databases wouldn't be very useful if they were only capable of handling writes well).
Is there anyone out there that has encountered this issue and solved it before? I would really appreciate sharing an example of how to implement a C* solution using java web services to display a large number of results that will have to be paginated through.

You may want to explore and use Playorm for Cassandra as it can resolve your problem of timout limitation and pagination. PlayOrm returns a cursor when you query and as your first page reads in the first 20 results and displays it, the next page can just use the same cursor in your session and it picks up right where it left off without rescanning the first 20 rows again.
Visit http://buffalosw.com/wiki/An-example-to-begin-with-PlayOrm/ to see the example for cursor and http://buffalosw.com/products/playorm/ for all features and more details about playorm

Related

Randomly generated public unique ids

Currently I'm generating unique ids for rows in my database using int and auto_increment. These ids are public facing so in the url you can see something like this https://example.com/path/1 or https://example.com/path/2
After talking with another engineer they've advised me that I should use randomly generated ids so that they're not guessable.
How can I generate a unique ID & without doing a forloop on the database each time to make sure it's unique? e.g. take stripe for example. All of their ids are price_sdfgsdfg or prod_iisdfgsdfg. Whats the best way to generate unique ids for rows like these?

Without knowing which language or database you're using, the simplest way is using uuids.
To prevent downloading all existing database unique keys, and then for looping over them all, simply just try to INSERT INTO whichever table you are using.
If the result fails (e.g. Exception), then the row is taken, continue.
If the result passes, break loop.
This only works when you have a column which is NOT NULL, and UNIQUE.
That's how I "know" without looping over the whole database of IDs, or downloading them into local memory, etc.
Using auto_increment wont lead to duplicates because when a SQL or no-SQL table is in use, it will be locked and given to the next available number in the queue, which is the beauty of databases.
SQL example (mySQL, SQLite, mariadb):
CREATE TABLE `my_db`.`my_table` ( `unique_id` INT NOT NULL , UNIQUE (`unique_id`)) ENGINE = InnoDB;`
Insert a unique_id
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
Great, we have a row
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
If not, then retry:
Error:
#1062 - Duplicate entry '999999999' for key 'unique_id'
If these are public URLs, and the content is sensitive, then I definitely do not recommend int's as someone can trivially guess 1 through 99999999... etc.
In any language, have a look at /dev/urandom.
In shell/bash scripts, I might use uuidgen:
9dccd646-043e-4984-9126-3060b4ced180
In Python, I'll use pandas:
df.set_index(pd.util.hash_pandas_object(df, encoding='utf8'), drop=True, inplace=True)
df.index.rename('hash', inplace=True)
Lastly, UUID's aren't perfect: they are only a-f 0-9 all lowercase, but they are easy to generate: every language has one.
In JavaScript you may want to check out some secure Open Source apps, for example, Jitsi: https://github.com/jitsi/js-utils/blob/master/random/roomNameGenerator.js where they conjugate word:
E.g. Satisfied-Global-Architectural-Bitter

(var)char as the type of the column for performance?

I have a column called "status" in PostgreSQL. First it used to be "status_id" of type integer. The values were kept on client, so there was no table on the server called statuses where I'd keep those statuses and then do inner join with the first table.
I used to send the ids of the statuses from the client (they had the names on the client). However, at some point I understood I'd better make the server hold those statuses. Not in a separate table but in the first one and I want to make them strings. So the initial table will have a status column of type string (varchar, to be more specific). I read it wouldn't be that slow.
In general, is it a good idea? I suppose it is because doing inner join (in case I'd keep statuses in the separate table) each time is expensive as well as sending ids from the client.
1) The only concern I have is that the column status should be of type char, not varchar. It should make it more effective I suppose. Is that so?
2) If the first case is correct then I'm not sure I'll be able to name all the statuses using exactly the same amount of characters, let's say, 5 characters. Some of them might be longer, some shorter. How can I solve this?
UPDATE:
It's not denationalization because I'm talking about 1 single table. There is no and has never been the second table called Statuses with the fields (id, status_name).
What I'm trying to convey is that I could use char(n) for status_name and also add index on it. Then it should be fast enough. However, it might be or not possible to name all the statuses with the certain (n) amount of characters and that's the only concern.

I don't think so using char or varchar instead integer is good idea. It is hard to expect how much slower it will be than integer PK, but this design will be slower - impact will be more terrible when you will join larger tables. If you can, use ENUM types instead.
http://www.postgresql.org/docs/9.2/static/datatype-enum.html
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (
name text,
current_mood mood
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
name | current_mood
------+--------------
Moe | happy
(1 row)
PostgreSQL varchar and char types are very similar. Internal implementation is same - char can be (it is paradox) little bit slower due addition by spaces.

I'd go one step further. Never use the outdated data type char(n), unless you know you have to (for compatibility or some rare exotic reason). The type is utterly useless in a modern database. Padding strings with blank characters is nonsense, and if you have to do it, you can do it in a cheaper fashion with rpad() on data retrieval.
SELECT rpad('short', 10) AS char_10_string;
varchar is basically the same as text and allows a length specifier: varchar(n). I generally use just text. If I need to limit the length, I use a CHECK constraint. Here's one example, why.
Whenever you can use a simple integer (or enum) instead, that's a bit smaller and faster in every respect. Consider #Pavel's answer for enum.
As for:
because doing inner join (...) each time is expensive
Well, it carries a small cost, but it's generally cheaper than redundantly saving text representation of the status instead of a much cheaper integer in the main table. That kind of rumor is spread by people having problems understanding the concept of database normalization. The enum type is a compromise here - for relatively static sets of values.

ORACLE SQLLOADER, referencing calculated values

hope you're having a nice day. I'm learning how to use functions on SQL-LOADER and i have a question about it, lets say i have this table
table a
--------------
code
name
dept
birthdate
secret
the data.csv file contains this data
name
dept
birthdate
and i'm using this code to load data to it with SQLLOADER
LOAD DATA
INFILE "data.csv"
APPEND INTO TABLE a;
FIELDS TERMINATED BY ',' optionally enclosed by '"'
TRAILING NULLCOLS
(code "getCode(:name,:dept)",name,dept,birthdate,secret "getSecret(getCode(:name,:dept),birthdate)")
so this works like a charm it gets the values from my getCode and getSecret functions, however, i would like to reference the previously calculated value (by getCode) so i don't have to nest functions on getSecret, like this:
getSecret(**getCode(:name,:dept)**,birthdate)
i've tried to do it like this:
getSecret(**:code**,birthdate)
but it gets the original value from the file (meaning null) and not the calculated by the function (guess because it does it on the fly), so my question is if there is a way to avoid these nest calls for previously calculated values, so i don't have to loose performance recalculating the same values over and over again (the real table i'm using it's like 10 times bigger and nests a lot of functions for these previously calculated values, so i guess that's reducing performance)
any help would be appreciated, Thanks!!
complement
Sorry, but i haven't used external tables before (kinda new here), how could i implement this using this tables? (considering all the calculated values i need to get from functions i developed, tried trigger (SQL Loader, Trigger saturation?), killed database...)

I'm not aware of a way of doing this.
If you switched to using external tables you'd have a lot more freedom for this sort of thing -- common table expressions, leveraging subquery caching, that sort of stuff.

Should one store a search_data tsvector in the same table or external table?

I am implementing full text search in postgres.
I would like to search all posts in my system. The posts fulltext index is an amalgamation of the post title and post body.
I have two ways of achieving this:
create a tsvector column in the posts table, trigger an update to it.
create a second table (posts_search) with a post_id and tsvector column containing the index data.
create a simple gin index ... (out of the question, cause my real world problem needs data in multiple tables for the index)
What is going to perform better, considering I sometimes need to filter down the search by other attributes in the table (like deleted_at is null and so on).
Is it a better approach to keep the tsvector column in the same table as the data (side effect select * now sucks) or a separate table (side effect, join required, index filtering is complicated)?

In my experiments, typical size of tsvector column is about 1% of the size of text field this tsvector was computed from using to_tsvector().
With this in mind, storing tsvector column in another table should provide performance benefit. For example, even if you do not use SELECT * (and you shouldn't, really), any seqscan in original single table will still have to load pages which contain original text. If you offload tsvector field to separate table, page loading will be faster by 100x.
In other words, I would favor second solution of offloading tsvector field to separate table. Or, alternatively, offloading posts (original text) deeper into your table hierarchy (but I guess it is almost the same thing).
Note that for full text search to work, original text is not necessary. You way want to even not store it in database, or store it in highly compressed format (and not necessarily easily accessible by SQL routines). It would work as long as something can create tsvector based on original text, or update when it changes.

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.

It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.

One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.

A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.

Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio