I have a system that regularly downloads files and parses them. However, sometimes something might go wrong with the parsing and I have the task to create a Prometheus alert for when a certain file fails. My
initial idea is to create a custom counter alert in Prometheus - something like
processed_files_total and use status as label because if the file fails it has FAILED status and if it succeeds - SUCCESS, so supposedly the alert should look like
increase(processed_files_total{status=FAILED}[24h]) > 0 and I hope that this will alert me in case there is at least 1 file with failed status.
The problem comes from the fact that I also want to have the
exact filename in the alert message and since each file has a unique name I'm almost sure that it is not a good idea to put it as label e.g. filename={filename} - According to Prometheus docs -
Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
is there any other way I can achieve getting the filename from the alert or this is the way to go ?
It's a good question.
I think the correct answer is that the alert should notify you that something failed and the resolution is to go to the app's logs to identify the specific file(s) that failed.
Lightning won't strike you for using the filename as a label value in Prometheus if you really must but, I think, as you are, using an unbounded value should give you pause as to whether you're abusing the tool.
Metrics seem intrinsically (hunch) about monitoring aggregate state (an unusual number of files are failing) rather than specific (why did this one fail); logs and tracing tools help with the specific cases.
Related
I'm working on a Power Automate flow where the flow is supposed to connect to Planner, get the tasks which are due tomorrow and send a message to an MS Teams Channel.
I got the entire flow working except of one thing - getting the names of the person(s) whom the task is assigned.
Here's my current flow:
Getting the list of Tasks from Planner,
Filter those which have a Due Date set,
Filter the ones which have the Due Date tomorrow,
Get the Name of the Task Creator using Get User Profile,
Send the data into an MS Teams Channel.
This all works perfectly fine. However I also need to get the names of the Assigned users. I understand that they come as an array. And when I try to add it to the same Get User Profile it's getting wrapped into an unnecessary "Apply to Each" which breaks everything.
Does anyone have a clue how this can be solved? Basically I need only 1 Assignee as we are not going to assign the same task to more than 1 person.
Any help would be much appreciated!
In order to avoid the loop (which is also a valid approach, long story) then you can use an expression to get the first (in case there are multiple) assigned user.
This is an example where you don't need to loop ...
... you can see I've initialised a (string) variable at the top which will hold the user ID GUID.
... then further down in the Set Assigned To operation, this is the expression I use ...
item()?['_assignments'][0]['userid']
That gets the first user and then the associated userid property. You can then pass that into the Get user profile (V2) task ...
Obviously, you need to adapt that to your flow but I hope that makes sense.
I am trying to push a event towards GA3, mimicking an event done by a browser towards GA. From this Event I want to fill Custom Dimensions(visibile in the user explorer and relate them to a GA ID which has visited the website earlier). Could this be done without influencing website data too much? I want to enrich someone's data from an external source.
So far I cant seem to find the minimum fields which has to be in the event call for this to work. Ive got these so far:
v=1&
_v=j96d&
a=1620641575&
t=event&
_s=1&
sd=24-bit&
sr=2560x1440&
vp=510x1287&
je=0&_u=QACAAEAB~&
jid=&
gjid=&
_u=QACAAEAB~&
cid=GAID&
tid=UA-x&
_gid=GAID&
gtm=gtm&
z=355736517&
uip=1.2.3.4&
ea=x&
el=x&
ec=x&
ni=1&
cd1=GAID&
cd2=Companyx&
dl=https%3A%2F%2Fexample.nl%2F&
ul=nl-nl&
de=UTF-8&
dt=example&
cd3=CEO
So far the Custom dimension fields dont get overwritten with new values. Who knows which is missing or can share a list of neccesary fields and example values?
Ok, a few things:
CD value will be overwritten only if in GA this CD's scope is set to the user-level. Make sure it is.
You need to know the client id of the user. You can confirm that you're having the right CID by using the user explorer in GA interface unless you track it in a CD. It allows filtering by client id.
You want to make this hit non-interactional, otherwise you're inflating the session number since G will generate sessions for normal hits. non-interactional hit would have ni=1 among the params.
Wait. Scope calculations don't happen immediately in real-time. They happen later on. Give it two days and then check the results and re-conduct your experiment.
Use a throwaway/test/lower GA property to experiment. You don't want to affect the production data while not knowing exactly what you do.
There. A good use case for such an activity would be something like updating a life time value of existing users and wanting to enrich the data with it without waiting for all of them to come in. That's useful for targeting, attribution and more.
Thank you.
This is the case. all CD's are user Scoped.
This is the case, we are collecting them.
ni=1 is within the parameters of each event call.
There are so many parameters, which parameters are neccesary?
we are using a test property for this.
We also got he Bot filtering checked out:
Bot filtering
It's hard to test when the User Explorer has a delay of 2 days and we are still not sure which parameters to use and which not. Who could help on the parameter part? My only goal is to update de CD's on the person. Who knows which parameters need to be part of the event call?
I'm not sure if this sort of aggregation is best done after being indexed by elasticsearch or if logstash is a good place to do it.
We are logging information about commands run against a server. Each set of metrics regarding a single command is logged as a single log event, there are multiple 'metric sets' per command. Each metric is of its own document type in ES (currently at least). So we will have multiple events across multiple documents regarding one command run against the server.
Each of these events will have a 'cmdno' field which is a temporary id given to the command we are logging about. Once the command has finished with all events logged, the 'cmdno' may be reused for other commands.
Is it possible to use logstash 'aggregate' plugin to link the events of a single command together using the 'cmdno'? (or any plugin)
All events that pertain to a single command will have the same timestamp + cmdno. I would like to add a UUID to the events as a permanent unique id for that command, so that a single query will give us all events for that single command.
Was thinking along the lines of:
if [cmdno] {
aggregate {
task_id => "%{cmdno}"
code => "map['cmdid'] ||= <some uuid generator>; event['cmdid'] == map['cmdid'] ? event['#timestamp'] == map['<stored timestamp for previous event from the same command>'] : continue"
}
}
Just started learning the ELK stack, not entirely sure as to the programming contructs logstash affords me yet.
I don't know if there is a better way to relate these events, this seemed the most suitable for our needs, if there are more ELK'y methods please let me know, they do need to stay as separate documents of different types though.
Any help much appreciated, let me know if I am missing anything.
Cheers,
Brett
I am monitoring Message Count attribute of JMS Queue using Nagios. For that I am using check_jmx plugin and it gives the output as "JMX OK MessageCount=400". I configured graph for this service, but when click on graph icon it shows no data available. This service not generating any rrd file. How can I configure graph for my message count monitoring service? In graph i want to show message count/hour. Whether I have to use another plugin?
Nagios graphing addons such as PNP4Nagios use the performance data of the plugin output which is everything after the |. Run the plugin on the command line and see if it's outputting performance data, and try different verbosity options by adding -vvv to check_jmx.
More info on performance data
check_jmx usage
As i can't tell from your text, which plugin you are using in order to gernerate the graphs, i myself recommend PNP4Nagios.
Once installed it is working really great.
for your problem:
and it gives the output as "JMX OK MessageCount=400".
if this is the messages / hour, you dont even have to change anything.
if it's not, you might include the code of the plugin in its current version on your nagios to your question or modyfie yourself ( store / grab messagecount and timestamp in order to calculate your messages / hour )
I have datbase with thousands of rows of error logs and their description.This error log is for an application that running 24/7. I want to create a dashboard/UI to view the current common errors happening for prodcution support.
The problem I am having is that even though there are lot of common errors, the error description differs by the transcation ID or user ID or things that are unique for that sigle prcoess.
e.g Error trasaction XYz failed for user 233
e.g 2. Error trasaction XYz failed for user 567
I consider these two erros to be same. So I want to a program that will go through the new error logs and classify them into groups. I am trying to use "edit distance" but its very slow.Since I alraedy have old error logs, i am trying to think of solutions using that information too. Any thoughts?
I'm assuming that the error messages are generated by a program, and so they probably fall into very specific pattern.
That means you don't have to do anything particularly complex. Just parse the error messages: use regular expressions (or maybe something more powerful) to split the messages into tuples. Then group or count or do something with the individual fields. For example, you could do a regex like "Error transaction ([A-Z]*) failed for user ([0-9]*)". You could then make a histogram of the error codes (first capture group) or users (second capture group).
There are other metrics (apart from Levenshtein) which might be more appropriate. Have you considered Cosine Similarity?
SimMetrics is an F/OSS library that offers an extensive collection of similarity algorithms and their corresponding cost functions.