nested Rowkey in Hbase tables - hadoop

i have a weather data base with 4 tables : province,city,station, instantHarvestinfo,dailyHarvestInfo
and the relation between tables is parent-child:
(province,city): R(1,m)
(city,station):R(1,m)
(statin,istantharvestInfo):R(1,m)
(station,dailyHarvestInfo):R(1,m)
i want put all of them in one bigtable in hbase and for echa one create a column family..but i dont know how define my row key...i think i need a nested row key that in each step get a split of my rowkey that related a comuln family and give me information of same cf..but how i cant define it?
please help me

there.
I guess you are going to save huge amount of istantharvestInfo and dailyHarvestInfo for each station.
Since there is parent-child relationship in your data model, I think you could
design the schema as:
-------------------------------------------------------------------------
**Row-Key**: Province + city + station + timestamp
--------+---------------------+------------------------------------------
Family | Qualifier | Value
--------+---------------------+------------------------------------------
| istantharvestInfo | "value of istantInfo"
F +---------------------+------------------------------------------
| dailyHarvestInfo | "value of dailyInfo"
--------+---------------------+------------------------------------------
Note that there is only one Family, because we should always make #family as small as possible.

Related

How does a multi-column index work in oracle?

I'm building a table to manage some articles:
Table
| Company | Store | Sku | ..OtherColumns.. |
| 1 | 1 | 123 | .. |
| 1 | 2 | 345 | .. |
| 3 | 1 | 123 | .. |
Scenario
Most time company, store and sku will be used to SELECT rows:
SELECT * FROM stock s WHERE s.company = 1 AND s.store = 1 AND s.sku = 123;
..but sometimes the company will not be available when accessing the table.
SELECT * FROM stock s WHERE s.store = 1 AND s.sku = 123;
..Sometimes all articles will be selected for a store.
SELECT * FROM stock s WHERE s.company = 1 AND s.store = 1;
The Question
How to properly index the table?
I could add three indexes - one for each select, but i think oracle should be smart eneugh to re-use other indexes.
Would an Index "Store, Sku, Company" be used if the WHERE-condition has no company?
Would an Index "Company, Store, Sku" be used if the WHERE-condition has no company?
You can think of the index key as conceptually being the 'concatenation' of the all of the columns, and generally you need to have a leading element of that key in order to get benefit from the index. So for an index on (company,store,sku) then
WHERE s.company = 1 AND s.store = 1 AND s.sku = 123;
can potentially benefit from the index
WHERE s.store = 1 AND s.sku = 123;
is unlikely to benefit (but see footnote below)
WHERE s.company = 1 AND s.store = 1;
can potentially benefit from the index.
In all cases, I say "potentially" etc, because it is a costing decision by the optimizer. For example, if I only have (say) 2 companies and 2 stores then a query on company and store, whilst it could use the index is perhaps better suited to not to do so, because the volume of information to be queried is still a large percentage of the size of the table.
In your example, it might be the case that an index on (store,sku,company) would be "good enough" to satisfy all three, but that depends on the distribution of data. But you're thinking the right way, ie, get as much value from as few indexes as possible.
Footnote: There is a thing called a "skip scan" where we can get value from an index even if you do not specify the leading column(s), but you will typically only see that if the number of distinct values in those leading columns is low.
first - do you need index at all? Indexes are not for free. If your table is small enoguh, perhaps you don't need index at all.
Second - what is data structure? You have store column in every scenario - I can imagine situation in which filtering data on store dissects source data to enough degree to be good enough for you.
However if you want to have maximum reasonable performance benefit you need two:
(store, sku, company)
(store, company)
or
(store, company, sku)
(store, sku)
Would an Index "Store, Sku, Company" be used if the WHERE-condition has no company?
Yes
Would an Index "Company, Store, Sku" be used if the WHERE-condition has no company?
Probably not, but I can imagine scenarios in which it might happen (not for the index seek operation which is really primary purpose of indices)
You dissect data in order of columns. So you group data by first element and order them by first columns sorting order, then within these group you group the same way by second element etc.
So when you don't use first element of index in filtering, the DB would have to access all "subgroups" anyway.
I recommend reading about indexes in general. Start with https://en.wikipedia.org/wiki/B-tree and try to draw how it behaves on paper or write simple program to manage simplified version. Then read on indexes in database - any db would be good enough.

How to design querying multiple tags on analytics database

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.

Cassandra slow get_indexed_slices speed

We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful

Dynamic database design for variable length spatio-temporal data in Oracle (need a schema design)

Currently I am working on a research project, where I need to store spatio-temporal data and analyze them efficiently. I am giving the exact requirement below.
The research is going on meteorological data, so the data attributes are temperature, humidity, pressure, wind-speed, wind-direction etc. The number of attributes is previously unknown to us, depending on requirement we may need to add more attributes (Table having dynamic attribute and different datatype nature). Again the data is captured from various locations, from various height and in a certain time duration as well as time interval.
So, what should be the best way to design a schema for the requirement? We must have to find out relation efficiently.
The purpose of the project is not only to store database, also need to manipulate the data.
Sample data in table format -
location | time | height | pressure | temperature | wind-direction | ...
L1 | 2011-12-18 08:04:02 | 7 | 1009.6 | 28.3 | east | ...
L1 | 2011-12-18 08:04:02 | 15 | 1008.6 | 27.9 | east | ...
L1 | 2011-12-18 08:04:02 | 27 | 1007.4 | 27.4 | east | ...
L1 | 2011-12-18 08:04:04 | 7 | 1010.2 | 28.4 | north-east | ...
L1 | 2011-12-18 08:04:04 | 15 | 1009.4 | 28.2 | north-east | ...
L1 | 2011-12-18 08:04:04 | 27 | 1008.9 | 27.6 | north-east | ...
L2 | 2011-12-18 08:04:02 | ..... so on
Here I need to design a schema for the above sample data where Location is a spatial location that can be implemented using oracle MDSYS.SDO_GEOMETRY type.
Constraints are:
The no of attributes (table column) is unknown during development. In runtime any new attribute(let say - humidity, refractive index etc.) can be added. So we can't design attribute specific table schema.
    1.1) for this constraint I thought to use a schema like -
           tbl_attributes(attr_id_pk, attr_name, attr_type);          
tbl_data(loc, time, attr_id_fk, value);
     The my design the attribute value must be varchar type, and as required I thought to cast (not a good idea at all).
     But finding relational data with this schema is very difficult using SQL query only. For example I want to find -
          1.1.1) avg pressure for location L1 when wind direction is east and temperature in between 27-28
         1.1.2) locations, where pressure is maximum at 15 height.
     1.2) I am also thinking to edit table schema during runtime, which is again not a good idea I think.
We will use a loader application, which will be taking care of this dynamic insertion depending on the schema (what ever it maybe).
Need to retrieve statistical data efficiently as some example is given above [1.1.*].
I am not completely sure I understand what you mean when you say that
The no of attributes (table column) is unknown during development. In
runtime any new attribute(let say - humidity, refractive index etc)
can be added.
first of all, I suppose that this is not really happening at random: i.e. when you get a new bunch of data from the field you know (before importing) that these have an extra dimension or two. Correct?
Also, the fact that in this new data batch you get "refractive index" will not make the older data magically acquire a proper value for this dimension.
Therefore I would go for a classical Object-to-RDBMS mapping where you have:
a header table with things that exist for every measurement: i.e. time and space, possibly the source (i.e. lab, sensor, team which provided the data) and an autogenerated key.
one or more detail table where the values are defined as proper fields.
Example:
Header
location | time | height | source |Key |
L1 | 2011-12-18 08:04:02 | 7 | team-1 | 002020013 |
L1 | 2011-12-18 08:04:02 | 15 | team-1 | 002020017 |
L1 | 2011-12-18 08:04:02 | 27 | Lab-X | 002020018 |
L1 | 2011-12-18 08:04:04 | 7 | Lab-Y | 002020021 |
L1 | 2011-12-18 08:04:04 | 15 | Lab-X | 002020112 |
Atmospheric data (basic)
Key | pressure | temp | wind-dir |
002020013 | 1009.6 | 28.3 | east |
002020017 | 1019.3 | 29.2 | east |
002020018 | 1011.6 | 26.9 | east |
Light-sensor data
Key | refractive-ind | albedo | Ultraviolet |
002020017 | 79.6 | .37865 | 7.0E-34 |
002020018 | 67.4 | .85955 | 6.5E-34 |
002020021 | 91.6 | .98494 | 8.1E-34 |
In other words: every different set of data will use one or more subtables (these you can add "dynamically", if needed) and you can still create queries by standard means, you will just have to join subtables (where possible: i.e. if you want to analyze by Wind Directions AND refractive index, you can - but only when you have set of data which have both values) by using the reference keys to keep these consistent).
I believe this more efficient than using text fields with CSV inside, or data blobs or using a key-values associations.
I would definitely go with 1.2 (edit table schema during runtime), at least to begin with. Any sufficiently advanced configuration is indistinguishable from programming; don't think you can magically avoid making changes to your program.
Don't be scared of alter table. Yes, the upfront costs are higher - you may need a process (not just a program) to ensure your schema stays clean. And there are some potential locking problems (that have solutions). But if you do it right you only have to pay the price once for each change.
With a completely generic solution you will pay a small price with every query. Your queries will be complicated, slower, ugly, and more likely to fail. You can never write a query like select avg(value) ..., it may or may not work, depending on how the data is accessed. You can use a PL/SQL function to catch exceptions, or use inline views and hints to force a specific access pattern. Either way, your queries are more complicated and slower, and you have to make sure that everybody understands these problems before they use the data.
And with a generic solution the optimizer will suck because it knows nothing about your data. Oracle can't predict how many rows will be returned by where attr_name = 'temperature' and is_number(value) = 28.4. But it can make a very good guess for where temperature = 28.4. You may have significantly more bad plans (i.e. slow queries) with generic columns.
Thank you for the quick response and good guidance. I have gotten some concepts from the both answers and decided to go with a mix model. I don't know whether I am in the write path or not. I want comments on the model. Below I am describing the complete conceptual model with MySQL code snippet.
Conceptual model
For dynamicity - (no of column is not defined previously) I have created 4 tables as follows -
geolocation(locid int, name varchar, geometry spatial_type) - to store information of a particular location, may be defined with spatial feature.
met_loc_event(loceventid int, locid* int, record_time timestamp, height float) - this is to identify a perticular event in a place with sudden height.
metfeatures(featureid int, name varchar, type varchar) - to store feature (ie. Column) details with a data type, that type field will help to cast data as required.
metstore(loceventid* int, featureid* int, value varchar) - to store an atom value for a feature at a particular time.
Up to that part I design a column orientation to store a dynamic nature of table. But as you suggest this is not a good design for quering (some will not work like arithmetic functions) the database. This is also not good if we consider performance.
For efficient query needs (to avoid to much joining and to avoid casting value during query) - I extend the model with some helper view, I write store procedure to generate views from the stored database.
First I created views for each feature (by taking value from feature table, so no of entry will be no of feature view initially) with the help of met_loc_event, metfeatures and metstore tables. These views store locid, record_time, height, and caste value according to feature type
Next from these views, I created a row oriented view named metrelview - which consist of all relation data row wise as like normal table. I have planned to fire query to the view, so the query performance will be improved.
This view generation procedure needs to execute whenever any insert, update or delete operation will be there in features table.
Below is the MySQL procedure that I have developed for the view generation
CREATE PROCEDURE `buildModel`()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE fid INTEGER;
DECLARE fname VARCHAR(45);
DECLARE ftype VARCHAR(45);
DECLARE cur_fatures CURSOR FOR SELECT `featureid`, `name`, `type` FROM `metfeatures`;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
SET #viewAlias = 'v_';
SET #metRelView = "metrelview";
SET #stmtCols = "";
SET #stmtJoin = "";
START TRANSACTION;
OPEN cur_fatures;
read_loop: LOOP
FETCH cur_fatures INTO fid, fname, ftype;
IF done THEN
LEAVE read_loop;
END IF;
IF fname IS NOT NULL THEN
SET #featureView = CONCAT(#viewAlias, LOWER(fname));
IF ftype = 'float' THEN
SET #featureCastStr = "`value`+0.0";
ELSEIF ftype = 'int' THEN
SET #featureCastStr = "CAST(`value` AS SIGNED)";
ELSE
SET #featureCastStr = "`value`";
END IF;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #featureView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #featureView, "` AS SELECT le.`loceventid` AS loceventid, le.`locid`, le.`rectime`, le.`height`, ", #featureCastStr, " AS value FROM `metlocevent` le JOIN `metstore` ms ON (le.`loceventid`=ms.`loceventid`) WHERE ms.`featureid`=", fid);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
SET #stmtCols = CONCAT(#stmtCols, ", ", #featureView, ".`value` AS ", #featureView);
SET #stmtJoin = CONCAT(#stmtJoin, " ", "LEFT JOIN ", #featureView, " ON (le.`loceventid`=", #featureView,".`loceventid`)");
END IF;
END LOOP;
SET #stmtDeleteView = CONCAT("DROP VIEW IF EXISTS `", #metRelView, "`");
SET #stmtCreateView = CONCAT("CREATE VIEW `", #metRelView, "` AS SELECT le.`loceventid`, le.`locid`, le.`rectime`, le.`height`", #stmtCols, " FROM `metlocevent` le", #stmtJoin);
PREPARE stmt FROM #stmtDeleteView;
EXECUTE stmt;
PREPARE stmt FROM #stmtCreateView;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
CLOSE cur_fatures;
COMMIT;
END;
N.B. - I tried to call the procedure with any event in features table, so that every thing should be automated. But as MySQL is not supported dynamic query with function or trigger, I cant do it automatically
I also want criticism before i finalize as accepted model, I am not a DBA so, if you can help me how to improve performance for the model will be very helpful for me.
This sounds like a homework assignment whose underlying subject is: use-cases for abandoning strict normal-form design principles.
The solution to this conundrum is to develop a three-stage solution. Stage 1 is runtime adaptability using the flexible AttributeType, AttributeValue approach, so that rapidly incoming data can be captured and put somewhere temporarily in a quasi-structured manner. Stage 2 involves the analysis of that runtime data to see where the model must be extended with additional columns and validation tables to accommodate any new attributes. Stage 3 is the importing of the as-yet-unimported data into the revised model, which never relaxes its strict datatyping and declarative referential integrity constraints.
As they say: Life, friends, is a trade-off.

Propel: How the "Affected Rows" Returned from doUpdate is defined

In propel there is this doUpdate function, that will return the numbers of affected rows by this query.
The question is, if there is no need to update the row ( because the set value is already the same as the field value), will those rows counted as the affected row?
Take for example, I have the following table:
ID | Name | Books
1 | S1oon | Me
2 | S1oon | Me
Let's assume that I write a ORM function of the equivalent of the following query:
update `new table` set
Books='Me'
where Name='S1oon';
What will the doUpdate result return? Will it return 0 ( because all the Books column are already Me, hence there is no need to update), or will it be 2 ( because there are 2 rows that fulfill the where condition) ?
Under the hood, Propel is using PDO's PDOStatement::rowCount() method to return the number of affected rows. So, the short answer is that you'll get "2" as you expect here, but the longer answer is that it may depend slightly on how PDO implements that function for your specific database. (I think if you did not get 2, it should be a bug for PDO, however.)
See the description of rowCount() in the PHP manual for more info.
One other thing to bear in mind is that when Propel calls methods (like save() or delete()) which are expected to return number-of-rows-modified and which may result in more than one row being modified (e.g. if you add a Book and its Author and then cause both to be INSERTed by calling book->save()), you will get the total number of rows modified.
It will return 2.

Resources