How to aggregate number of different events in a histogram in AWS Quicksight? - amazon-quicksight

As I am not so sure on how to google that question, trying my luck here:
My data is essentially a log, represented as time-series with IDs.
I want to have a histogram that shows how many IDs are in the timeseries e.g. in range [1..5][5..10][10..15][15..20][>20] times.
How can I achieve that ?
To illustrate: Think of logging key strokes on a key board in a time series, one character per line entry. I am not interested in the specific "count" for characters "e" or "n", but rather how many characters in total fall into each range.
Thanks for support

Related

Use label as metric in Grafana Prometheus source

Hello everyone I have prometheus as a label returns the amount. The metric value is the number of payments. How do I withdraw the total amount a day to the dashboard? i.e. value_metric*sum
As far as I know, there is no way to do that because labels aren't meant to be used in calculations. Labels and their values are essentially the index of Prometheus' NoSQL TSDB, they're used to create relations and join pieces of data together. You wouldn't store values and do math with column names of a relational database, would you?
Another problem is that labels with high cardinality greatly increase database size. Here is an extraction from Prometheus best practices:
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
Though I see that you use somewhat fixed values in labels, maybe a histogram would fit your needs.

The column of the csv file in google automl tables is recognised as text or categorical instead of numeric as i would like

I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help
The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.

Seeking appropriate clustering algorithm

I'm analyzing the GDELT dataset and I want to determine thematic clusters. Simplifying considerably, GDELT parses news articles and extracts events. As part of that, it recognizes, let's say, 250 "themes" and tags each "event" it records in a column a semi-colon separated list of all themes identified in the article.
With that preamble, I've extracted, for 2016, a list of approximately 350,000 semi-colon separated theme lists, such as these two:
TAX_FNCACT;TAX_FNCACT_QUEEN;CRISISLEX_T11_UPDATESSYMPATHY;CRISISLEX_CRISISLEXREC;MILITARY;TAX_MILITARY_TITLE;TAX_MILITARY_TITLE_SOLDIER;TAX_FNCACT_SOLDIER;USPEC_POLITICS_GENERAL1;WB_1458_HEALTH_PROMOTION_AND_DISEASE_PREVENTION;WB_1462_WATER_SANITATION_AND_HYGIENE;WB_635_PUBLIC_HEALTH;WB_621_HEALTH_NUTRITION_AND_POPULATION;MARITIME_INCIDENT;MARITIME;MANMADE_DISASTER_IMPLIED;
CRISISLEX_CRISISLEXREC;EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTSOFINTEREST_COLLEGE;TAX_FNCACT;TAX_FNCACT_MAN;TAX_ECON_PRICE;SOC_POINTSOFINTEREST_UNIVERSITY;TAX_FNCACT_JUDGES;TAX_FNCACT_CHILD;LEGISLATION;EPU_POLICY;EPU_POLICY_LAW;TAX_FNCACT_CHILDREN;WB_470_EDUCATION;
As you can see, both of these lists both contain "TAX_FNACT" and "CRISISLEX_CRISISLEXREC". Thus, "TAX_FNACT;CRISISLEX_CRISISLEXREC" is a 2-item cluster. A better understanding of GDELT informs us that it isn't a particularly useful cluster, but it is one nevertheless.
What I'd like to do, ideally, is compose a dictionary of lists. The key for the dictionary is the number of items in the cluster and value is a list of tuples of all theme clusters with that "key" number of elements paired with the number of times that cluster appeared. This ideal algorithm would run until it identified the largest cluster.
Does an algorithm already exist that I can use for this purpose and if so, what is it named? If I had to guess, I would imagine we've created something to extract x-item clusters and then I would just loop from 2->? until I don't get any results.
Clustering won't work well here.
What you describe looks rather like frequent itemset mining. Where the task is to find frequent combinations of 'items' in lists.

Fuzzy match of an English sentence with a set of English sentences stored in a database

There are about 1000 records in a database table. There is a column named title which is used to store the title of articles. Before inserting a record, I need to check if there is already an article with similar title exists in that table. If so, I will skip.
What's the fastest way to perform this kind of fuzzy matching? Assuming all words in sentences can be found in a English dictionary. If 70% of words in sentence #1 can be found in sentence #2, we consider them a match. Ideally, the algorithm can pre-compute a value for each sentence so that the value can be stored in the database.
For a 1000 records, doing the dumb thing and just iterating over all the records could work (assuming that the strings aren't too long and you aren't getting hit with too many queries). Just pull all of the titles out of your database, and then sort them by their distance to your given string (for example, you could use Levenshtein distance for this metric).
A fancier way to do approximate string matching would be to precompute n-grams of all your strings and store them in your database (some systems support this feature natively). This will definitely scale better performance wise, but it could mean more work:
http://en.wikipedia.org/wiki/N-gram
You can read up on forward / reverse indexing of token - value storage for getting faster search results. I personally prefer reverse indexing which stores a hash map of token(key) to value (here title).
Whenever you write a new article, like a new stackoverflow question, the tokens in the title would be searched to map against all the titles available.
To optimize the result, i.e. get the fuzzy logic for results, you can sort the titles by the max amount of occurrences in tokens being searched for. Eg, if t1,t2 and t3 refer to the tokens 'what' 'is' 'love', and the title 'what this love is for?' would exist in all the tokens mappings, it would be placed at the topmost.
You can play around with this more. I hope this approach is more simple and appealing.

Matching people based on names, DoB, address, etc

I have two databases that are differently formatted. Each database contains person data such as name, date of birth and address. They are both fairly large, one is ~50,000 entries the other ~1.5 million.
My problem is to compare the entries and find possible matches. Ideally generating some sort of percentage representing how close the data matches. I have considered solutions involving generating multiple indexes or searching based on Levenshtein distance but these both seem sub-optimal. Indexes could easily miss close matches and Levenshtein distance seems too expensive for this amount of data.
Let's try to put a few ideas together. The general situation is too broad, and these will be just guidelines/tips/whatever.
Usually what you'll want is not a true/false match relationship, but a scoring for each candidate match. That is because you never can't be completely sure if candidate is really a match.
The score is a relation one to many. You should be prepared to rank each record of your small DB against several records of the master DB.
Each kind of match should have assigned a weight and a score, to be added up for the general score of that pair.
You should try to compare fragments as small as possible in order to detect partial matches. Instead of comparing [address], try to compare [city] [state] [street] [number] [apt].
Some fields require special treatment, but this issue is too broad for this answer. Just a few tips. Middle initial in names and prefixes could add some score, but should be kept at a minimum (as they are many times skipped). Phone numbers may have variable prefixes and suffixes, so sometimes a substring matching is needed. Depending on the data quality, names and surnames must be converted to soundex or similar. Streets names are usually normalized, but they may lack prefixes or suffixes.
Be prepared for long runtimes if you need a high quality output.
A porcentual threshold is usually set, so that if after processing a partially a pair, and obtaining a score of less than x out of a max of y, the pair is discarded.
If you KNOW that some field MUST match in order to consider a pair as a candidate, that usually speeds the whole thing a lot.
The data structures for comparing are critical, but I don't feel my particular experience will serve well you, as I always did this kind of thing in a mainframe: very high speed disks, a lot of memory, and massive parallelisms. I could think what is relevant for the general situation, if you feel some help about it may be useful.
HTH!
PS: Almost a joke: In a big project I managed quite a few years ago we had the mother maiden surname in both databases, and we assigned a heavy score to the fact that = both surnames matched (the individual's and his mother's). Morale: All Smith->Smith are the same person :)
You could try using Full text search feature maybe, if your DBMS supports it? Full text search builds its indices, and can find similar word.
Would that work for you?

Resources