Datadog distinct-like custom metrics - metrics

Given following scenario:
A lambda receives an event via SQS
The lambda receives a uuid pointing to an entity.
The lambda may fail with an error
SQS will retrial that particular entity several times
The lambda will be called with different entities thousand of times
Right now we monitor a custom error-count metric like myService.errorType.
Which gives us an exact number of how many times an error occurred - independent from a specific entity: If an entity can't be processed like 100 times, then the metric value will be 100.
What I'd like to have, though, is a distinct metric based on the UUID.
Example:
entity with id 123 fails 10 times
entity with id 456 succeeds
entity with id 789 fails 20 times
Then I'd like to have a metric with the value of 2 - because the processes failed for two entities only (and not for 30, as it would be reported right now).
While searching for a solution I found the possibility of using tags. But as the docs point out they are not meant for such a use-case:
Tags shouldn’t originate from unbounded sources, such as epoch timestamps, user IDs, or request IDs. Doing so may infinitely increase the number of metrics for your organization and impact your billing.
So are there any other possibilities to achieve my goals?

I've solved it now by verifying the status via code and by adding tags to the metrics:
occurrence:first
subsequent
This way I can filter in my dashboard for occurrence:first only.

To make sure things are clear, you have a metric called myService.errorType with a tag entity. This metric is a counter that will increase every time an entity is in error. You will then use this metric query:
sum:myService.errorType{*} by {entity}
When you speak about UUID, it seems that the cardinality is small (here you show 3). Which means that every hour you will have small amount of UUID available. In that case, adding UUID to the metric tags is not as critical as user ID, timestamp, etc. which have a limitless number of options.
I would invite you to add this uuid tag, and check the cardinality in the metric summary page to ensure it works.
Then to get the number of UUID concerned by errors, you can use something like:
count_not_null(sum:myService.errorType{*} by {uuid})
Finally, as an alternative, if the cardinality of UUID can go through the roof, I would invite you to work with logs or work with Christopher's solution which seems to limit the cardinality increase as well.

Related

Smart pagination algorithm that works with local data cache

This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.
Background
A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.
Constraints
Assume a server side database of products containing +- 50000 products (50Mb)
Assume no db type, we interact with it via REST/GraphQL interface
Assume a single product record is < 1kB
Assume a max payload for a resultset of 256kB
Assume max 5MB storage on the client
Assume search result sets ranging between 0 ... 5000 items per search
Challenge
The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.
Example
In traditional paging, when getting the next 100 results for some query using this url:
https://example.com/products?category=shoes&firstResult=100&pageSize=100
the search result may look like this:
{
"totalResults": 2458,
"firstResult": 100,
"pageSize": 100,
"results": [
{"some": "item"},
{"some": "other item"},
// 98 more ...
]
}
The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.
Goal
I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)
Ideas so far
I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.
The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...
Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.
Versioning DB is the answer for resultsets consistency.
Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.
In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.
You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.

JMeter and simulating the real users

I am wondering if there is something I could use to create a simulator using JMeter that would pick the users from my "user list" based on some kind of pattern. In fact, even simpler: imagine I have the users from 0 to N. Some of them are active, some of them are not. I would like to have some simulated users that are active during certain period (say, hour), then they go dormant, others become active etc. So, out of total N users I would have something like X unique active users per hour, Y unique active users per day, Z unique active users per week etc.
I think I could write some kind of generator like this but I am wondering if something already exists - as JMeter plugin or just a library/class that I could use.
See the following test elements which can help you to implement scenario requested:
Ultimate Thread Group - to control virtual users arrival rate and time to hold the load
Constant Throughput Timer - to control virtual users activity in "requests per minute" which can be converted to "requests per second" or "requests per day" by simple arithmetic calculations
Provide uniqueness of virtual users via:
CSV Data Set Config configuration element or __CSVRead() function - for pre-defined users list
__Random or __RandomString function for dynamic unique parameters.

how to improve Neo4J performance in creating edges?

i'm building a traffic schedule application using Neo4J, NodeJS and GTFS-data; currently, i'm trying to get
things working for the traffic on a single day on the Berlin subway network. these are the grand totals
i've collected so far:
10 routes
211 stops
4096 trips
83322 stoptimes
to put it simply, GTFS (General Transit Feed Specification) has the concept of a stoptime which denotes the
event of a given train or bus stopping for passengers to board and alight. stoptimes happen on a trip,
which is a series of stoptimes, they happen on a specific date and time, and they happen on a given
stop for a given route (or 'line') in a transit network. so there's a lot of references here.
the problem i'm running into is the amount of data and the time it takes to build the database. in order
to speed up things, i've already (1) cut down the data to a single day, (2) deleted the database files
and have the server create a fresh one (very effective!), (3) searched a lot to get better queries. alas,
with the figures as given above, it still takes 30~50 minutes to get all the edges of the graph.
these are the indexes i'm building:
CREATE CONSTRAINT ON (n:trip) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stop) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:route) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stoptime) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :trip(`route-id`);
CREATE INDEX ON :stop(`name`);
CREATE INDEX ON :stoptime(`trip-id`);
CREATE INDEX ON :stoptime(`stop-id`);
CREATE INDEX ON :route(`name`);
i'd guess the unique primary keys should be most important.
and here are the queries that take up like 80% of the running time (with 10% that are unrelated to Neo4J,
and 10% needed to feed the node data using plain HTTP post requests):
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
MATCH (stoptime:`stoptime`), (trip:`trip`)
WHERE stoptime.`trip-id` = trip.id
CREATE UNIQUE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]-(stoptime);
MATCH (stoptime:`stoptime`), (stop:`stop`)
WHERE stoptime.`stop-id` = stop.id
CREATE UNIQUE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]-(stoptime);
MATCH (a:stoptime), (b:stoptime)
WHERE a.`trip-id` = b.`trip-id`
AND ( a.idx + 1 = b.idx OR a.idx - 1 = b.idx )
CREATE UNIQUE (a)-[:linked]-(b);
MATCH (stop1:stop)-->(a:stoptime)-[:next]->(b:stoptime)-->(stop2:stop)
CREATE UNIQUE (stop1)-[:distance {`~label`: 'distance', value: 0}]-(stop2);
the first query is still in the range of some minutes which i find longish given that there are only
thousands (not hundreds of thousands or millions) of trips in the database. the subsequent queries that
involve stoptimes take several ten minutes each on my desktop machine.
(i've also calculated whether the schedule really contains 83322 stoptimes each day, and yes, it's plausible:
in Berlin, subway trains run on 10 lines for 20 hours a day with 6 or 12 trips per hour, and there are 173
subway stations: 10 lines x 2 directions x 17.3 stops per line x 20 hours x 9 trips per hour gives 62280,
close enough. there are some faulty? / double / extra stop nodes in the data (211
stops instead of 173), but those are few.)
frankly, if i don't find a way to speed up things at least tenfold (rather more), it'll make little sense to use Neo4J
for this project. just in order to cover the single city of Berlin many, many more stoptimes have to be added,
as the subway is just a tiny fraction of the overall public transport here (e.g. bus and tramway have like
170 routes with 7,000 stops, so expect around 7,000,000 stoptimes each day).
Update the above edge creation queries, which i perform one by one, have now been running for over an hour and not yet finished, meaning that—if things scale in a linear fashion—the time needed to feed the Berlin public transport data for a single day would consume something like a week. therefore, the code currently performs several orders of magnitude too slow to be viable.
Update #MichaelHunger's solution did work; see my response below.
I just imported 12M nodes and 12M rels into Neo4j in 10 minutes using LOAD CSV.
You should see your issues when you run profiling on your queries in the shell.
Prefix your query with profile and look a the profile output if it mentions to use the index or rather just label-scan.
Do you use parameters for your insert queries? So that Neo4j can re-use built queries?
For queries like this:
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
It will very probably not use your index.
Can you perhaps point to your datasource? We can convert it into CSV if it isn't and then import even more quickly.
Perhaps we can create a graph gist for your model?
I would rather use:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id` = route.id)
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
For your initial import you also don't need create unique as you match every trip only once.
And I'm not sure what your "~label" is good for?
Similar for your other queries.
As the data is public it would be cool to work together on this.
Something I'd love to hear more about is how you plan do express your query use-cases.
I had a really great discussion about timetables for public transport with training attendees last time in Leipzig. You can also email me on michael at neo4j.org
Also perhaps you want to check out these links:
Tramchester
http://www.thoughtworks.com/de/insights/blog/transforming-travel-and-transport-industry-one-graph-time
http://de.slideshare.net/neo4j/graph-connect-v5
https://www.youtube.com/watch?v=AhvECxOhEX0
London Tube Graph
http://blog.bruggen.com/2013/11/meet-this-tubular-graph.html
http://www.markhneedham.com/blog/2014/03/03/neo4j-2-1-0-m01-load-csv-with-rik-van-bruggens-tube-graph/
http://www.markhneedham.com/blog/2014/02/13/neo4j-value-in-relationships-but-value-in-nodes-too/
detailed solution
i'm happy to report that #MichaelHunger's solution works like a charm. i modified the edge-building queries
from the question with the below shapes that keep to the suggested query outline:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id`: route.id})
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]->(route)
MATCH (trip:`trip`)
MATCH (stoptime:`stoptime` {`trip-id`: trip.id})
CREATE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]->(stoptime)
MATCH (stop:`stop`)
MATCH (stoptime:`stoptime` {`stop-id`: stop.id})
CREATE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]->(stoptime)
MATCH (a:stoptime)
MATCH (b:stoptime {`trip-id`: a.`trip-id`, `idx`: a.idx + 1})
CREATE (a)-[:linked {`~label`: 'linked'}]->(b)
MATCH (stop1:stop)--(a:stoptime)-[:linked]-(b:stoptime)--(stop2:stop)
CREATE (stop1)-[:distance {`~label`: 'distance', value: 0}]->(stop2)
as can be seen, the trick here is to give each participating node a MATCH statement of its own and to
move the WHERE clause inside the second match condition; presumably, as mentioned above, Neo4J can only
then take advantage of its indexes.
with these queries in place, the process of reading in nodes and building edges takes roughly 13 minutes;
of these 13 minutes, fetching the data from an external source, building the node representations and issuing CREATE queries
takes about 10 minutes, and building almost a half million edges between them is done in about 3 minutes.
right now none of my queries (especially the node CREATE statements and updates for stop distances) use
parametrized queries, which is another potential source for performance gains.
as for the ~label field and also the question why i use dahes in names where underscores would be more
convenient, well, that's a long story about what i perceive good and practical naming that sometimes clashes
with the syntax of some languages (of most languages, should i say). but that's boring detail. maybe more
intersting is the question: why is there a ~label attribute that repeats what the element label says (what
you write after the colon)? well, it's an attempt to comply with Neo4J conventions (we use labels here), take
advantage of the 'identifier, colon, label' syntax of cypher queries, AND to make it so the labels do
appear in the returned values.
mind you, labels are so central to graph thinking the Neo4J way, but *in query results, labels are
conspicuously absent. when you include a relationship that is marked with nothing but a label in your result set,
then that edge will arrive as an empty
object, telling you only that there is something but not what. so i decided i to duplicate the
label on each single node and each single edge. not an optimal solution but at least now i get an informative
graph display in the Neo4J browser.
as for how to express query use-cases, that's an active field of reserach for me right now. i guess it will
all start with a 'field of interest', like 'show all Berlin subway stops', or 'all busses departing within
the next 15 minutes from a bus stop near me'. the data already allows to see which stops are directly connected
by a subway line, their geographical distance, what services are present and what routes they take. the idea
is to grab the data and present them in novel, usable and beatiful ways. 9292 is quite
close to what i imagine; what's missing are graphical representations of spatial and temporal relationships.

Efficient way to query

My app has a class that saves picture that users upload. Each object in the class has a city property that holds the name of the city that the picture was taken at, and a like property that tracks the number of likes.
I want to be able to send a query that returns one picture per city and each picture should have the highest ranking of likes in the city it belongs to. How can I do that?
One way which I first thought about is doing multiple queries by fetching the most liked picture of a city and save it in an array, and then do the same to other cities.
However, each country has more than one city, thus it's not that efficient.
Parse doesn't support the ordinary operations used in databases. Besides, I tried to use a compound query. Unfortunately, I can't set limit or ordering on the subqueries. Any good solution for this?
It would be easy using group by. Unfortunately, Parse does not support "select distinct" or "group by" features.
As you've suggested you need to fetch for each country all the cities, and for each one get the top most rated photo.
BUT, since Parse has strict restrictions on the duration time execution of a request ( 3 sec for an event listener, 7 sec for a custom function ), I suggest you to do this in a background job, saving in a new table the top rated photo for each city. In this way you can easily query the db from client. The Background jobs can be executed up to 15 minuted before parse drop them, so you could make that kind of queries without timeouts.
Hope it helps

Creating DAX peer measure

The scenario:
We are an insurance brokerage company. Our fact table is claim metrics current table. This table has unique rows for multiple claim sid-s, so that, countrows(claim current) gives the correct count of the number of unique claims. Now, this table also has clientsid and industrysid. The relation between client and industry here is that, 1 industry can have multiple clients, and 1 client can belong to only 1 industry.
Now, let us consider a fact called claimlagdays, which is present in the table at the granularity of claimsid.
Now, one requirement is that, we need to find out "peer" sum(claimlagdays). This, for a particular client, is basically calculated as:
sum(claimlagdays) for the industry of the client being filtered (minus) sum(claimlagdays) for this particular client. Let's call this measure A.
Similar to above, we need to calculate "peer" claim count , which is claimcount for the industry of the client being filtered (minus) claimcount for this particular client.
Let's call this measure B.
In the final calculation, we need to divide A by B, to get the "peer" average lag days.
So basically, the hard part here is this: find the industry of the particular client which is being filtered for, and then, apply this filter to the fact table (claim metrics current) to find out the total claim count/other metric only for this industry. then of course, subtract the client figure from this industry figure to get the "peer" measure. This has to be done for each row, keeping intact any other filters which might be applied in the slicer(date/business unit, etc.)
There are a couple of other filters static which need to be considered, which are present in other tables, such as "Claim Type"(=Indemnity/Medical) and Claim Status(=Closed).
My solution:
For measure B
I tried creating a calculated column, as:
Claim Count_WC_MO_Industry=COUNTROWS(FILTER(FILTER('Claim Metrics Current',RELATED('Claim WC'[WC Claim Type])="Medical" && RELATED('Coverage'[Coverage Code])="WC" && RELATED('Claim Status'[Status Code])="CL"),EARLIER('Claim Metrics Current'[IndustrySID])='Claim Metrics Current'[IndustrySID]))
Then I created the measure
Claim Count - WC MO Peer:=CALCULATE(SUM([Claim Count_WC_MO_Industry])/[Claim - Count])- [Claim - Count WC MO]
{I did a sum because, tabular model doesn't directly allow me to use a calculated column as a measure, without any aggregation. And also, that wouldn't make any sense since tabular model wouldn't understand which row to take}
The second part of the above measure is obviously, the claim count of the particular client, with the above-mentioned filters.
Problem with my solution:
The figures are all wrong.I am not getting a client-wise or year-wise separation of the industry counts or the peer counts. I am only getting a sum of all the industry counts in the measure.
My suspicion is that this is happening because of the sum which is being done. However, I don't really have a choice, do I, as I can't use a calculated column as a measure without some aggregation...
Please let me know if you think the information provided here is not sufficient and if you'd like me to furnish some data (dummy). I would be glad to help.
So assuming that you are filtering for the specific client via a frontend, it sounds like you just want
ClientLagDays :=
CALCULATE (
SUM ( 'Claim Metrics Current'[Lag Days] ),
Static Filters Here
)
Just your base measure of appropriate client lag days, including your static filters.
IndustryLagDays :=
CALCULATE (
[ClientLagDays],
ALL ( 'Claim Metrics Current'[Client] ),
VALUES ( 'Claim Metrics Current'[IndustrySID] )
)
This removes the filter on client but retains the filter on Industry to get the industry-wide total of lag days.
PeerLagDays:=[IndustryLagDays]-[ClientLagDays]
Straightforward enough.
And then repeat for claim counts, and then take [PeerLagDays] / [PeerClaimCount] for your [Average Peer Lag Days].

Resources