Keeping historic records as well as current overview - performance

I'm using a small collection of webscrapers to get the current GPS location of various devices. I also want to keep historic records. What's the best way of doing this without storing the data twice? For now i have two tables, both looking like this:
Column | Type | Modifiers | Storage | Description
---------+-----------------------------+---------------+----------+-------------
vehicle | character varying(20) | | extended |
course | real | | plain |
speed | real | | plain |
fix | smallint | | plain |
lat | real | | plain |
lon | real | | plain |
time | timestamp without time zone | default now() | plain |
One is named gps, and another is named gps_log. The function that updates these two does two things: first it performs an INSERT on gps_log, and afterwards it does an UPDATE OR INSERT (a user-defined function) on gps. However, this results in what seems to me as a pointless case of double-storing for other purposes than having easy SELECTable access to the current data.
Is there a simple way of only using gps_log and having a function select only the newest entry for each vehicle? Keep in mind that gps_log currently has 1397150 rows increasing with roughly 150 rows every 15 minutes, so performance is likely to be an issue.
Using PostgreSQL 8.4 via Perl DBI.

If SELECT performance is paramount, your current solution with redundant storage might not be such a bad idea.
If you get rid of the redundant table, you can help SELECT performance with a multi-column index like:
CREATE INDEX gps_log_vehicle_time ON gps_log (vehicle, time DESC);
Assuming that vehicle is your primary key.
Would make this corresponding query pretty fast:
SELECT *
FROM gps_log
WHERE vehicle = 'foo'
ORDER BY time DESC
LIMIT 1;
To SELECT the last entry for multiple or all rows, use this related technique.
Total storage size would probably grow, though, because the index will be bigger that the redundant table (+ index) if you have many rows per vehicle.
It might help storage and performance to add a serial column as a surrogate primary key instead of vehicle. Especially if you have foreign keys pointing to it.
Aside: don't use time as column name. It's a type name in PostgreSQL and a reserved word in every SQL standard. It is also misleading to name a timestamp column time.

Related

Flink DataSet api with Hbase table input format - read rows multiple times

I am using Flink 1.3.2 with the hbase TableInputFormat from flink-connectors (flink-hbase_2.11), using the DataSet API.
I have an HBase table where the rowkeys are stuctured as follows:
| RowKey | data |
| 0-someuniqid | data |
| 0-someuniqid | data |
| 2-someuniqid | data |
| 2-someuniqid | data |
| 4-someuniqid | data |
| 5-someuniqid | data |
| 5-someuniqid | data |
| 7-someuniqid | data |
| 8-someuniqid | data |
The prefix of the table can be 0 to 9 (this is to prevent hot spotting in the hbase nodes). In my test table no one writes to this table.
I have got a job of the form:
tableInputFormat0 = new TableInputFormat("table", 0);
tableInputFormat1 = new TableInputFormat("table", 1);
...
tableInputFormat9 = new TableInputFormat("table", 9);
tableInputFormat0.union(tableInputFormat1).(...).union(tableInputFormat9)
.map(mapFunction())
.rebalance()
.filter(someFilter())
.groupBy(someField())
.reduce(someSumFunction())
.output(new HbaseOutputFormat());
The problem is when a a lot of records are read (around the 20 million records), the job does not always read the same amount of records.
Most of the time it (correctly) reads: 20,277,161 rows. But sometimes it reads: 20,277,221 or 20,277,171 always more never less. (I get this number via the flink web dashboard, but the effects I can see in what gets written ie too much data is aggregated by the reduce)
I cannot make the problem smaller by using a smaller dataset because the problem does not happen when running the job against a a table of say 5million records. It is hard to identify what records are read multiple times because of the volume.
How can I debug (and solve) this problem?
TableInputFormat is an abstract class and you have to implement a subclass.
I would do two things:
check that each input split is processed just once (this info is written to the JobManager log file)
adapt your input format to count the number of emitted records for each input split. The record count and split id should be written to the (TaskManager) log.
This should help to identify, whether the problem is
due one (or more) splits being assigned more than once or
due to a bug in the code that processes a split.

Filter after grouping columns in Power BI

I want to accomplish something easy to understand (and maybe easy to do but I can't find a way...).
I have a table which represents the date when a client has bought something.
Let's have this example:
=============================================
Purchase_id | Purchase_date | Client_id
=============================================
1 | 2016/03/02 | 1
---------------------------------------------
2 | 2016/03/02 | 2
---------------------------------------------
3 | 2016/03/11 | 3
---------------------------------------------
I want to create a single number card which will be the average of purchase realised by day.
So for this example, the result would be:
Result = 3 purchases / 2 different days = 1.5
I managed doing it by grouping in my query by Purchase_date and my new column is the number of rows.
It gives me the following query:
==================================
Purchase_date | Number of rows
==================================
2016/03/02 | 2
----------------------------------
2016/03/11 | 1
----------------------------------
Then I put the field Number of rows in a single number card, selecting "Average".
I have to precise that I am using Direct Query with SQL Server.
But the problem is that I want to have a filter on the Client_id. And once I do the grouping, I lose this column.
Is there a way to have this Client_id as a parameter?
Maybe even the fact of grouping is not the right solution here.
Thank you in advance.
You can create a measure to calculate this average.
From Power BI's docs:
The calculated results of measures are always changing in response to
your interaction with your reports, allowing for fast and dynamic
ad-hoc data exploration
This means filtering client_id's will change the measure accordingly.
Here is an easy way of defining this measure:
Result = DISTINCTCOUNT(tableName[Purchase_date])/DISTINCTCOUNT(tableName[Purchase_id])

Order By any field in Cassandra

I am researching cassandra as a possible solution for my up coming project. The more I research the more I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
Is it possible to sort on any field? If there is a performance impact for sorting on fields not in the cluster what is that performance impact? I need to sort around or about 2 million records in the table.
I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
It's not so much that it's a bad idea. It's just really not possible to make Cassandra sort your data by an arbitrary column. Cassandra requires a query-based modeling approach, and that goes for sort order as well. You have to decide ahead of time the kinds of queries you want Cassandra to support, and the order in which those queries return their data.
Is it possible to sort on any field?
Here's the thing with how Cassandra sorts result sets: it doesn't. Cassandra queries correspond to partition locations, and the data is read off of the disk and returned to you. If the data is read in the same order that it was sorted in on-disk, the result set will be sorted. On the other hand if you try a multi-key query or an index-based query where it has to jump around to different partitions, chances are that it will not be returned in any meaningful order.
But if you plan ahead, you can actually influence the on-disk sort order of your data, and then leverage that order in your queries. This can be done with a modeling mechanism called a "clustering column." Cassandra will allow you to specify multiple clustering columns, but they are only valid within a single partition.
So what does that mean? Take this example from the DataStax documentation.
CREATE TABLE playlists (
id uuid,
artist text,
album text,
title text,
song_order int,
song_id uuid,
PRIMARY KEY ((id),song_order))
WITH CLUSTERING ORDER BY (song_order ASC);
With this table definition, I can query a particular playlist by id (the partition key). Within each id, the data will be returned ordered by song_order:
SELECT id, song_order, album, artist, title
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order DESC;
id | song_order | album | artist | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 4 | No One Rides For Free | Fu Manchu | Ojo Rojo
62c36092-82a1-3a00-93d1-46196ee77204 | 3 | Roll Away | Back Door Slam | Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 2 | We Must Obey | Fu Manchu | Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1 | Tres Hombres | ZZ Top | La Grange
In this example, if I only need to specify an ORDER BY if I want to switch the sort direction. As the rows are stored in ASCending order, I need to specify DESC to see them in DESCending order. If I was fine with getting the rows back in ASCending order, I don't need to specify ORDER BY at all.
But what if I want to order by artist? Or album? Or both? Since one artist can have many albums (for this example), we'll modify the PRIMARY KEY definition like this:
PRIMARY KEY ((id),artist,album,song_order)
Running the same query above (minus the ORDER BY) produces this output:
SELECT id, song_order, album, artist, title
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204;
id | song_order | album | artist | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 3 | Roll Away | Back Door Slam | Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 4 | No One Rides For Free | Fu Manchu | Ojo Rojo
62c36092-82a1-3a00-93d1-46196ee77204 | 2 | We Must Obey | Fu Manchu | Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1 | Tres Hombres | ZZ Top | La Grange
Notice that the rows are now ordered by artist, and then album. If we had two songs from the same album, then song_order would be next.
So now you might ask "what if I just want to sort by album, and not artist?" You can sort just by album, but not with this table. You cannot skip clustering keys in your ORDER BY clause. In order to sort only by album (and not artist) you'll need to design a different query table. Sometimes Cassandra data modeling will have you duplicating your data a few times, to be able to serve different queries...and that's ok.
For more detail on how to build data models while leveraging clustering order, check out these two articles on PlanetCassandra:
Getting Started With Time Series Data Modeling - Patrick McFadin
We Shall Have Order! - Disclaimer - I am the author

How to optimize massive amounts of data in mysql tables using a join

Perhaps using a JOIN isn't the best option here, but here's the scenario:
I have two tables, one is for houses, the other for objects in that house.
I have 50 houses and 8000 objects.
Lastly, each object will be either black or white (boolean).
Each object must be associated with each house and each object must be either black or white, which means, through my current design, there are going to be 400,000 records (8,000 ones, 8,000 twos all the way up to 50) in the objects table! Not the best for optimization. And my site turned into geriatric snails smoking ganja when I tried to load the query on my webpage. It died.
The table I have for houses looks like this:
==============================
House| Other cols | Other cols
==============================
1 | |
2 | |
3 | |
4 | |
to 50
The table I have for objects looks like this:
============================
House_ID | Object | Color
============================
1 | 1 | 1
1 | 2 | 1
1 | 3 | 0
1 | 4 | 1
1 | 5 | 0
"House_ID" increments to 2 once "Object" reaches 8,000. This incrementing continues until House_ID reaches 50.
There must be a better way to create an association between the house and the objects where each object must have that specific house ID and it is not quite so taxing on the server.
BTW, I'm using an INNER JOIN to combine both tables. I think this might be wrong, but don't know a way around it. Doing SQL queries in phpMyAdmin.
How would I join or set up my table/queries so that it's not so cumbersome?
You probably need to investigate indexing your tables. This is actually a fairly small data set for what you are doing.
If your table names are houses and objects, try:
CREATE INDEX houses_index ON houses (House)
and
CREATE INDEX house_objects_index ON objects (HouseID,Object)
This will make your queries run MUCH faster, if, as I presume, indexes do not already exist.
(You might also want to keep you column names consistent between tables; calling the field House in one table and HouseID in another is, I think, more confusing than calling it HouseID both places.)

Using HBase for analytics

I'm almost completely new to HBase. I would like to take my current site tracking based on MySQL and put it to HBase because MySQL simply doesn't scale anymore.
I'm totally lost int eh first step...
I need to track different actions of users and need to be able to aggregate them by some aspects (date, country they come from, product they performed the action with, etc...)
The way I store it currently is that I have a table with a composite PK with all these aspects (country, date, product, ...) and the rest of the fields are counters for actions. When an action is performed, I insert it to the table incrementing the action's column by one (ON DUPLICATE KEY UPDATE...).
*date | *country | *product | visited | liked | put_to_basket | purchased
2011-11-11 | US | 123 | 2 | 1 | 0 | 0
2011-11-11 | GB | 123 | 23 | 10 | 5 | 4
2011-11-12 | GB | 555 | 54 | 0 | 10 | 2
I have a feeling that this is completely against the HBase way, and also doesn't really scale (with the growing number if keys inserts get expensive) and not really flexible.
How to track user actions with it attributes effectively in HBase? How table(s) should look like? Where MapReduce comes in the picture?
Thanks for all suggestions!
Lars George's "HBASE: the definitive guide" explains a design very similar to what you want to achieve in the introduction chapter
This can be done as follows,
Have the unique row id in Hbase as follows,
rowid = date + country + product ---> append these into a single entity and have it as key.
Then have the counters as columns. So when you get an event like,
if(event == liked){
increment the liked column of the hbase by 1 for the corresponding key combination.
}
and so on for other cases.
Hope this helps!!

Resources