PromQL query to get the maximum of two metrics - max

In a Prometheus timeseries database there are two sets of data (for M1 and M2) collected with following schema
<timestamp>, <M1> <Labels L1, L2, L3>
<timestamp>, <M2> <Labels L1, L2, L3, L4>
Write a PromQL query that creates a new time-series N, that tracks the max(M1, M2) for each time period the query is run.
For example:
table
I tried using max() but it takes only 1 argument.

Try the following PromQL:
max(metric1 OR metric2)

PromQL allows selecting time series with multiple different names via regexp filter on __name__ pseudo-label. See these docs. For example, the following query selects all the series with names m1 and m2:
{__name__=~"m1|m2"}
So, the following query would select the max values over all the series with names m1 and m2 per each requested timestamp:
max({__name__=~"m1|m2"})
See the following additional info:
/api/v1/query docs
/api/v1/query_range docs

Related

Neo4j building initial graph is slow

I am trying to build out a social graph between 100k users. Users can sync other social media platforms or upload their own contacts. Building each relationship takes about 200ms. Currently, I have everything uploaded on a queue so it can run in the background, but ideally, I can complete it within the HTTP request window. I've tried a few things and received a few warnings.
Added an index to the field pn
Getting a warning This query builds a cartesian product between disconnected patterns. - I understand why I am getting this warning, but no relationship exists and that's what I am building in this initial call.
MATCH (p1:Person {userId: "....."}), (p2:Person) WHERE p2.pn = "....." MERGE (p1)-[:REL]->(p2) RETURN p1, p2
Any advice on how to make it faster? Ideally, each relationship creation is around 1-2ms.
You may want to EXPLAIN the query and make sure that NodeIndexSeeks are being used, and not NodeByLabelScan. You also mentioned an index on :Person(pn), but you have a lookup on :Person(userId), so you might be missing an index there, unless that was a typo.
Regarding the cartesian product warning, disregard it, the cartesian product is necessary in order to get the nodes to create the relationship, this should be a 1 x 1 = 1 row operation so it's only going to be costly if multiple nodes are being matched per side, or if index lookups aren't being used.
If these are part of some batch load operation, then you may want to make your query apply in batches. So if 100 contacts are being loaded by a user, you do NOT want to execute 100 queries each, with each query adding a single contact. Instead, pass as a parameter the list of contacts, then UNWIND the list and apply the query once to process the entire batch.
Something like:
UNWIND $batch as row
MATCH (p1:Person {pn: row.p1}), (p2:Person {pn: row.p2)
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
It's usually okay to batch 10k or so entries at a time, though you can adjust that depending on the complexity of the query
Check out this blog entry for how to apply this approach.
https://dzone.com/articles/tips-for-fast-batch-updates-of-graph-structures-wi
You can use the index you created on Person by suggesting a planner hint.
Reference: https://neo4j.com/docs/cypher-manual/current/query-tuning/using/#query-using-index-hint
CREATE INDEX ON :Person(pn);
MATCH (p1:Person {userId: "....."})
WITH p1
MATCH (p2:Person) using index p2:Person(pn)
WHERE p2.pn = "....."
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2

Optimize Cypher Query for Movie Recommendation on Large Dataset

I'm currently working on movie recommendation using MovieLens 20m dataset after reading https://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/. Node Movie connects to Genre with relationship hasGenre, Node Movie connects to User with relationship hasRating. I'm trying to retrieve all movies which are most highly co-rated (co-rated > 3.0) with a query (e.g. Toy Story) that share all genres with Toy Story. Here's my Cypher query:
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]-(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)-[:hasGenre]->(genre)
WITH inputGenres, r, o, movie, COLLECT(genre) AS genres
WHERE ALL(h in inputGenres where h in genres) and (r.rating>3 and o.rating>3)
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
However, it seems that my system cannot handle it (using 16GB of RAM, Core i7 4th gen, and SSD). When I'm running the query, it peaks to 97% of RAM then Neo4j shutdowns unexpectedly (probably due to heap size or else, due to RAM size).
Do I make the query correct? I'm newbie in Neo4j so probably I make the query incorrectly.
Please suggest how to optimize such query?
How can I optimize the Neo4j so it can handle large dataset with my system's spec according to the query?
Thanks in advance.
First, your Cypher can be simplified for more efficient planning by only matching what we need, and handling the rest in the WHERE (so that filtering can possibly be done while matching)
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)
WHERE (r.rating>3 and o.rating>3) AND ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
Now, if you don't mind adding data to the graph to find the data you want, another thing you can do is split the query into tiny bits and "cache" the result. So for example
// Cypher 1
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (movie:Movie)
WHERE ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
// Merge so that multiple runs don't create extra copies
MERGE (inputMovie)-[:isLike]->(movie)
// Cypher 2
MATCH (movie:Movie)<-[r:hasRating]-(user)
WHERE r.rating>3
// Merge so that multiple runs don't create extra copies
MERGE (user)-[:reallyLikes]->(movie)
// Cypher 3
MATCH (inputMovie:Movie{movieId: 1})<-[:reallyLikes]-(user)-[:reallyLikes]->(movie:Movie)<-[:isLike]-(inputMovie)
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC

Conditional Mapping in Talend

I have created a simple job in Talend that will perform an inner join in the data between 2 excel sheets and then dump the result in an output excel sheet. This can be best illustrated by the below diagram :-
The mapping used in tMap is :-
However the additional challenge for me now is that I have to perform this mapping only if the column value in that row is not NULL. eg there is a mapping row1.RECID = row2.RECID, but this should only be legal if row2.RECID is not NULL.
How do I achieve this in Talend? I have experimented a lot with tMap expressions but can't get it right..
Here is a small sample input and it's corresponding expected output.
Suppose my input has values :-
v1, v2,v3,v4
1 , A, O, 3
2, B, X, 4
3, C, X, 4
and lookup has values
v1, v2, v3
1, A, O, 3
2, null, X, 4
3, null, C, 4
2,null,X,null
Then the output should be :-
v1,v2,v3
1,A,O,3
2,B,X,4
2,B,X,4
Before joining your input flows, you have to reject rows with null values, I have created a mapping based on the given simple data.
Try to map the maximum of values from row1, the put row2 with left outer join.
I you want values which are only in row1 and row2, you can add a filter in row2 for that (but I guess that this is not what you want)
Talend does have a more elegant option that will allow the filtering of your data on multiple columns. Use the tSchemaComplianceCheck component where filtering out nulls and empty is as simple as clicking a couple of check boxes. This allows you to use your own schema to check against nulls and empty values and filter them out. The error rows go to a reject flow which you have the option of processing. If you do not wish to capture and process the rejects you can simply ignore them. Your main flow will only have the records that passed the compliance check. Here are some tips on using it:
In the tSchemaComplianceCheck component -->Basic Settings Screen click Custom Defined and it will show you each column. Make sure Nullable is unchecked or else it will allow nulls to pass thru.
In the Advanced Settings tab check Treat all empty string as NUll. This will work in conjunction with the prior step to filter out both null and empty.
In your Excel component, click Advances Settings tab, and check Stop reading on encountering empty rows.
below is a screen shot which shows the basic flow and setting. You would link to a tMap instead of the tLogRow. If I have understood your problem correctly I think you will find this is the ideal solution in Talend.

Equal distribution of couchdb document keys for parallel processing

I have a couchdb db instance where each document has a unique id (string). I would like to go over each document in the db and perform some external operation based on the contents of each document (for ex: connecting to another web server to get specific details etc). However, instead of sequentially going over each document, is it possible to first get a list of k buckets of these document keys represented by the starting key + ending key (id being the key), then to query for all documents in each of these buckets separately & do the external operation on each bucket's documents in parallel ?
I currently use couchdb-python for accessing my db + views. For ex, this is the code I currently use:
for res in db.view("mydbviews/id"):
doc = db[res.id]
do_external_operation(doc) # Time consuming operation
It would be great if I could do something like 'parallel for' for the above loop.
Assuming that you're only emitting one result per document in the view, then presumably running the view with start and end keys along with some python parallelisation technique is sufficient here. As #Ved says, the bigger issue here is parallel processing, rather than generating the subsets of documents. I'd recommend the multiprocessing module, like so:
def work_on_subset(viewname, key_low, key_high):
rows = db.view(viewname, startkey=key_low, endkey=key_high)
for row in rows:
pass # Do your work here
viewname = '_design/designname/_view/viewname'
key_list = [('a', 'z'), ('1', '10')] # Or whatever subset you want
pool = multiprocessing.Pool(processes=10) # Or however many you want
result = []
for (key_low, key_high) in key_list:
result.append(pool.apply_async(work_on_subset, args=(viewname, key_low, key_high)))
pool.close()
pool.join()

assigning IDs to hadoop/PIG output data

I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.
Say one of pattern is - find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).
My expected output should be two data files
1) Rollup data - like account A1 received 50 transactions from country AU.
2) Raw transactions - all above 50 transactions for A1.
My PIG script is currently creating output data source in following format
Account Country TotalTxns RawTransactions
A1 AU 50 [(Txn1), (Txn2), (Txn3)....(Txn50)]
A2 JP 30 [(Txn1), (Txn2)....(Txn30)]
Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).
I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?
EDIT (after using Enumerate from DataFu)
here is the PIG script
register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;
after running this, I am getting
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
....
I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).
As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.
In many cases Java's UUID.randomUUID() will be sufficient.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
What is unique in your rows? It appears that account ID and country code are what you have grouped by in your Pig script, so why not make a composite key with those? Something like
CONCAT(CONCAT(account, '-'), country)
Of course, you could write a UDF to make this more elegant. If you need a numeric ID, try writing a UDF which will create the string as above, and then call its hashCode() method. This will not guarantee uniqueness of course, but you said that was all right. You can always construct your own method of translating a string to an integer that is unique.
But that said, why do you need a single ID key? If you want to join the fields of two tables later, you can join on more than one field at a time.
DataFu had a bug in Enumerate which was fixed in 0.0.9, so use 0.0.9 or later.
In case when your IDs are numbers and you can not use UUID or other string based IDs.
There is a DataFu library of UDFs by LinkedIn (DataFu) with a very useful UDF Enumerate. So what you can do is to group all records into a bag and pass the bag to the Enumerate. Here is the code from top of my head:
register jar with UDF with Enumerate UDF
inpt = load '....' ....;
allGrp = group inpt all;
withIds = foreach allGrp generate flatten(Enumerate(inpt));

Resources