How to handle scenario of same city with multiple names - algorithm

Ok, I have a list with some contacts on it filled by respective persons. But persons living in same city might write different names of cities(that I don't have any control on sice the names of cities may change with changing government).
For example:
NAME CITY
John Banaglore
Amit Bengaluru
Here both the Bangalore and the Bengaluru refer to the same city. How can I make sure(my be programatically) that my system does not consider as two different cities but one, while traversing the list.
One solution I could think of is to have a notion of unique ids attached to each city, but that requires recreating the list and also I have to train the my contacts about the unique id notion.
Any thoughts are appreciated.
Please feel free to route this post to any other stackexchange.com site if you think it does not belong here or update the tags.

I would recommend creating a table alias_table which maps city aliases to a single common name:
+------------+-----------+
| city_alias | city_name |
+------------+-----------+
| Banaglore | Bangalore |
| Bengaluru | Bangalore |
| Bangalore | Bangalore |
| Mumbai | Bombay |
| Bombay | Bombay |
+------------+-----------+
When you want do any manipulation of the table in your OP, you can join the CITY column to the city_alias column above as follows:
SELECT *
FROM name_table nt INNER JOIN alias_table at
ON nt.CITY = at.city_alias

I think the best way is to provide selection from a list of existing Cities and do not allow to enter it by the user manually.
But if you have data already it is more reliable to use alias table proposed by #Tim Biegeleisen
In addition, some automation could be added, as an example, to check is it correct to ignore 2 words difference in a case when it is not first letter by put it on the aliases table with mark as candidate for future review.
Here is examples of reasons for exclusion of first letter:
Kiev = Kyiv
Lviv != Kiev

Related

Cognos Analytics, multiple columns in crosstab but only one row in measure

I have a problem where in a crosstab with multiple columns there are multiple rows of measures where I would only like to have one.
The crosstab looks like this:
|-----Amount-----|
SITE-----|---PERSON---|----------------|
----------------------|----------------|
SITE1 | James | 45 |
SITE2 | John | 34 |
SITE2 | Jones | 34 |
SITE3 | Jane | 54 |
----------------------|----------------|
TOTAL-----------------| 167 |
So the first column is the site, the second one people on the site (notice that site2 has two people). The structure is simplified, but you get the point.
what I would like to have is the following structure:
|-----Amount-----|
SITE-----|---PERSON---|----------------|
----------------------|----------------|
SITE1 | James | 45 |
SITE2 | John | 34 |
SITE2 | Jones | |
SITE3 | Jane | 54 |
----------------------|----------------|
TOTAL-----------------| 133 |
So the measure rows are generated only from the site column, not from site and person columns. This way I can calculate the total amount across sites, not across persons. Currently the duplicate row(s) cause the total value to be higher than it actually is.
Is there a way to achieve this using crosstab, or do I need to think some other approach (second list to show sites and persons) for this use-case.
<--------------------EDIT-------------------->
I have mistakenly explained the amount column in my example. I have a table containing sales events and the amount measure should actually be the number of sales events per site. So what I'm trying to achieve is an question: For a given type of a sales event list the sites where these sales occurred, list the persons working on that site and list the total number of sales events on said site. So basically I'm fetching all the sales events with some filter (type=something). These sales events have a site where they occurred. that site has zero to n employees. So there's one inner join between sales event and site, and outer join between site and person table. The SQL query returns data like this:
sales_event_1|site1|James|type1|subtype2
sales_event_2|site2|John|type1|subtype1
sales_event_2|site2|Jones|type1|subtype1
sales_event_3|site2|John|type1|subtype2
sales_event_3|site2|Jones|type1|subtype2
sales_event_4|site3|Jane|type1|subtype1
...
So the crosstab structure is the following:
Rows= site|person
Columns= subtype
measure= count (distinct [sales_event_id] for [site])
And crosstab looks something like this:
|-----subtype1----|-----subtype2----|-----total----|
SITE-----|---PERSON---|-----------------|-----------------|--------------|
----------------------|-----------------|-----------------|--------------|
SITE1 | James | 35 | 10 | 45 |
SITE2 | John | 20 | 14 | 34 |
SITE2 | Jones | 20 | 14 | 34 |
SITE3 | Jane | 54 | 0 | 54 |
--------------------------|-------------|-----------------|--------------|
TOTAL-----------------|-----------------|-----------------| 133 |
I hope this helps you guys.
Create a new data item
total([Sales] for [Site])
Use that as the metric for the crosstab
Next, click on the metric, and set the property Group Span to be [Site]
It is good that you understand your data well enough to recognize that you are getting the wrong results. It would help you to know that the term is double-counting.
In your case, the grain of the amount fact is on the site level. I'm assuming that person is an attribute in the same dimension (the relational thing; not the thing with members, hierarchies, and levels, although that is built on concepts from the relational thing (read Kimball )). Your report is trying to project the query below the grain of the fact and you get double counting.
You ought to have determinants defined in your model (if you are using a Framework Manager package) or column dependency (if you are using a data module). These are things set up to tell the query engine the fact grains and what objects in a dimension are at which grain, to tell the query engine how to aggregate facts in a multi-fact multi-grain situation, and how to deal with attempts to project a query below the grain of a fact.
Because it would be defined in your model, it would be available to every report that you create and every report that ordinary users create, which would be better than trying to create handling for these sorts of situations in every report you create and hoping that your ordinary users know what to do, which they probably won't.
The fact that you don't have determinants set up that would suggest that your organization's modeller might have let your team down in other ways. For example, not handling role-playing and disambiguating query paths.

Order By any field in Cassandra

I am researching cassandra as a possible solution for my up coming project. The more I research the more I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
Is it possible to sort on any field? If there is a performance impact for sorting on fields not in the cluster what is that performance impact? I need to sort around or about 2 million records in the table.
I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
It's not so much that it's a bad idea. It's just really not possible to make Cassandra sort your data by an arbitrary column. Cassandra requires a query-based modeling approach, and that goes for sort order as well. You have to decide ahead of time the kinds of queries you want Cassandra to support, and the order in which those queries return their data.
Is it possible to sort on any field?
Here's the thing with how Cassandra sorts result sets: it doesn't. Cassandra queries correspond to partition locations, and the data is read off of the disk and returned to you. If the data is read in the same order that it was sorted in on-disk, the result set will be sorted. On the other hand if you try a multi-key query or an index-based query where it has to jump around to different partitions, chances are that it will not be returned in any meaningful order.
But if you plan ahead, you can actually influence the on-disk sort order of your data, and then leverage that order in your queries. This can be done with a modeling mechanism called a "clustering column." Cassandra will allow you to specify multiple clustering columns, but they are only valid within a single partition.
So what does that mean? Take this example from the DataStax documentation.
CREATE TABLE playlists (
id uuid,
artist text,
album text,
title text,
song_order int,
song_id uuid,
PRIMARY KEY ((id),song_order))
WITH CLUSTERING ORDER BY (song_order ASC);
With this table definition, I can query a particular playlist by id (the partition key). Within each id, the data will be returned ordered by song_order:
SELECT id, song_order, album, artist, title
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order DESC;
id | song_order | album | artist | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 4 | No One Rides For Free | Fu Manchu | Ojo Rojo
62c36092-82a1-3a00-93d1-46196ee77204 | 3 | Roll Away | Back Door Slam | Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 2 | We Must Obey | Fu Manchu | Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1 | Tres Hombres | ZZ Top | La Grange
In this example, if I only need to specify an ORDER BY if I want to switch the sort direction. As the rows are stored in ASCending order, I need to specify DESC to see them in DESCending order. If I was fine with getting the rows back in ASCending order, I don't need to specify ORDER BY at all.
But what if I want to order by artist? Or album? Or both? Since one artist can have many albums (for this example), we'll modify the PRIMARY KEY definition like this:
PRIMARY KEY ((id),artist,album,song_order)
Running the same query above (minus the ORDER BY) produces this output:
SELECT id, song_order, album, artist, title
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204;
id | song_order | album | artist | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 3 | Roll Away | Back Door Slam | Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 4 | No One Rides For Free | Fu Manchu | Ojo Rojo
62c36092-82a1-3a00-93d1-46196ee77204 | 2 | We Must Obey | Fu Manchu | Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1 | Tres Hombres | ZZ Top | La Grange
Notice that the rows are now ordered by artist, and then album. If we had two songs from the same album, then song_order would be next.
So now you might ask "what if I just want to sort by album, and not artist?" You can sort just by album, but not with this table. You cannot skip clustering keys in your ORDER BY clause. In order to sort only by album (and not artist) you'll need to design a different query table. Sometimes Cassandra data modeling will have you duplicating your data a few times, to be able to serve different queries...and that's ok.
For more detail on how to build data models while leveraging clustering order, check out these two articles on PlanetCassandra:
Getting Started With Time Series Data Modeling - Patrick McFadin
We Shall Have Order! - Disclaimer - I am the author

Keeping historic records as well as current overview

I'm using a small collection of webscrapers to get the current GPS location of various devices. I also want to keep historic records. What's the best way of doing this without storing the data twice? For now i have two tables, both looking like this:
Column | Type | Modifiers | Storage | Description
---------+-----------------------------+---------------+----------+-------------
vehicle | character varying(20) | | extended |
course | real | | plain |
speed | real | | plain |
fix | smallint | | plain |
lat | real | | plain |
lon | real | | plain |
time | timestamp without time zone | default now() | plain |
One is named gps, and another is named gps_log. The function that updates these two does two things: first it performs an INSERT on gps_log, and afterwards it does an UPDATE OR INSERT (a user-defined function) on gps. However, this results in what seems to me as a pointless case of double-storing for other purposes than having easy SELECTable access to the current data.
Is there a simple way of only using gps_log and having a function select only the newest entry for each vehicle? Keep in mind that gps_log currently has 1397150 rows increasing with roughly 150 rows every 15 minutes, so performance is likely to be an issue.
Using PostgreSQL 8.4 via Perl DBI.
If SELECT performance is paramount, your current solution with redundant storage might not be such a bad idea.
If you get rid of the redundant table, you can help SELECT performance with a multi-column index like:
CREATE INDEX gps_log_vehicle_time ON gps_log (vehicle, time DESC);
Assuming that vehicle is your primary key.
Would make this corresponding query pretty fast:
SELECT *
FROM gps_log
WHERE vehicle = 'foo'
ORDER BY time DESC
LIMIT 1;
To SELECT the last entry for multiple or all rows, use this related technique.
Total storage size would probably grow, though, because the index will be bigger that the redundant table (+ index) if you have many rows per vehicle.
It might help storage and performance to add a serial column as a surrogate primary key instead of vehicle. Especially if you have foreign keys pointing to it.
Aside: don't use time as column name. It's a type name in PostgreSQL and a reserved word in every SQL standard. It is also misleading to name a timestamp column time.

Google API Map of cities based on database

I'm wondering is there a possibility to create dynamic map of cities based on my database?
For example i have few cities in my db eg:
1 | Warsaw
2 | Cracov
3 | Poznan
4 | Poznan
5 | Poznan
6 | Warsaw
7 | Berlin
8 | Berlin
etc. This is based on registered users, and as you can see for example i have 2 users from Berlin, 2 from Warsaw, 1 from Cracov and 3 from Poznan. Now i'd like to count how many times unique city name is repeated. And i'd like to make an dynamic map based on this which will looks something like thad i've made in graphic editor
If count of city is bigger, then circle on it is bigger and it means that more people from that city where registered on my page. I'm looking for kind of that solution. It can me different and frpm different page but it has to be dynamic and change on every refresh, for example someone noew have registeres - whe i refresh it will add new city to map or if city exists before circle on it will be bigger.
Hope you understand me and you will help me ;)
regards
Darek

Using HBase for analytics

I'm almost completely new to HBase. I would like to take my current site tracking based on MySQL and put it to HBase because MySQL simply doesn't scale anymore.
I'm totally lost int eh first step...
I need to track different actions of users and need to be able to aggregate them by some aspects (date, country they come from, product they performed the action with, etc...)
The way I store it currently is that I have a table with a composite PK with all these aspects (country, date, product, ...) and the rest of the fields are counters for actions. When an action is performed, I insert it to the table incrementing the action's column by one (ON DUPLICATE KEY UPDATE...).
*date | *country | *product | visited | liked | put_to_basket | purchased
2011-11-11 | US | 123 | 2 | 1 | 0 | 0
2011-11-11 | GB | 123 | 23 | 10 | 5 | 4
2011-11-12 | GB | 555 | 54 | 0 | 10 | 2
I have a feeling that this is completely against the HBase way, and also doesn't really scale (with the growing number if keys inserts get expensive) and not really flexible.
How to track user actions with it attributes effectively in HBase? How table(s) should look like? Where MapReduce comes in the picture?
Thanks for all suggestions!
Lars George's "HBASE: the definitive guide" explains a design very similar to what you want to achieve in the introduction chapter
This can be done as follows,
Have the unique row id in Hbase as follows,
rowid = date + country + product ---> append these into a single entity and have it as key.
Then have the counters as columns. So when you get an event like,
if(event == liked){
increment the liked column of the hbase by 1 for the corresponding key combination.
}
and so on for other cases.
Hope this helps!!

Resources