How to use Gremlin to get both node properties and edge names in one query? - go

I have been thrown into a pool of golang / gremlin / Neptune, and am able to get some things to work. Life is good - enough, but I am hoping there is a simple answer (which I have not been able to find) to what seems like a simple question.
I have 'obs' nodes with some properties, two of which are ('type','domain') and ('value','whitehouse.com).
Another set of nodes is 'attack' ('type','group') and ('value','Emotet'), along with other properties.
An observation node can have an edge pointing to one or more attack nodes. (and actually, other types of nodes as well.) These edges have a time-based property - when the observation was seen manifesting a certain type of attack.
I'm working in Go, using gremson to communicate with a Neptune db. In this environment you construct your query as a string and send it down the wire to Neptune, and get something called graphson back.
Thus, I construct this, and send it...
fmt.Sprintf("g.V().hasLabel('obs').has('value','%s').limit(1)", domain)
And I get back properties for a vector, in gremson. Were I using the console, all I would get back would be the id. Go figure.
Then I construct this, and send it...
fmt.Sprintf("g.V().hasLabel('obs').has('value','%s').limit(1).out()", domain)
and I get back the properties of the connected nodes, in graphson. Again, using the console I would only get back ids. No sweat.
What I would LIKE to do is to combine these two queries somehow so that I am not doing what seems to be like two almost identical lookups.
console-wise, assume both queries also have valueMap() or entityMap() tacked on the end. Is there any way to do them as one query?

There are many ways you could write this query. Here are a couple of options
g.V().hasLabel('obs').
has('value','%s').
limit(1).as('a').
out().as('b').
select('a','b')
or using project
g.V().hasLabel('obs').
has('value','%s').
limit(1).
project('a','b').
by().
by(out().fold())
My preference is for the project example as you will get the connected vertices back in a list.

Related

GA3 Event Push Neccesary fields in Request

I am trying to push a event towards GA3, mimicking an event done by a browser towards GA. From this Event I want to fill Custom Dimensions(visibile in the user explorer and relate them to a GA ID which has visited the website earlier). Could this be done without influencing website data too much? I want to enrich someone's data from an external source.
So far I cant seem to find the minimum fields which has to be in the event call for this to work. Ive got these so far:
v=1&
_v=j96d&
a=1620641575&
t=event&
_s=1&
sd=24-bit&
sr=2560x1440&
vp=510x1287&
je=0&_u=QACAAEAB~&
jid=&
gjid=&
_u=QACAAEAB~&
cid=GAID&
tid=UA-x&
_gid=GAID&
gtm=gtm&
z=355736517&
uip=1.2.3.4&
ea=x&
el=x&
ec=x&
ni=1&
cd1=GAID&
cd2=Companyx&
dl=https%3A%2F%2Fexample.nl%2F&
ul=nl-nl&
de=UTF-8&
dt=example&
cd3=CEO
So far the Custom dimension fields dont get overwritten with new values. Who knows which is missing or can share a list of neccesary fields and example values?
Ok, a few things:
CD value will be overwritten only if in GA this CD's scope is set to the user-level. Make sure it is.
You need to know the client id of the user. You can confirm that you're having the right CID by using the user explorer in GA interface unless you track it in a CD. It allows filtering by client id.
You want to make this hit non-interactional, otherwise you're inflating the session number since G will generate sessions for normal hits. non-interactional hit would have ni=1 among the params.
Wait. Scope calculations don't happen immediately in real-time. They happen later on. Give it two days and then check the results and re-conduct your experiment.
Use a throwaway/test/lower GA property to experiment. You don't want to affect the production data while not knowing exactly what you do.
There. A good use case for such an activity would be something like updating a life time value of existing users and wanting to enrich the data with it without waiting for all of them to come in. That's useful for targeting, attribution and more.
Thank you.
This is the case. all CD's are user Scoped.
This is the case, we are collecting them.
ni=1 is within the parameters of each event call.
There are so many parameters, which parameters are neccesary?
we are using a test property for this.
We also got he Bot filtering checked out:
Bot filtering
It's hard to test when the User Explorer has a delay of 2 days and we are still not sure which parameters to use and which not. Who could help on the parameter part? My only goal is to update de CD's on the person. Who knows which parameters need to be part of the event call?

Is there a way to combine a query and a command in CQRS?

I have a project built using CQRS, but I can't figure out how to implement one use case.
The user needs to be able to make a Query which will return a set of data for them to view. However, I also need to save the data they got at the same time.
Is there a way to do this within a Query without violating CQRS' principles? Or would the Query and Command need to be two separate API calls one after another?
In CQRS it is your client who can do both command and queries. This client is not necessary a piece of UI.
It can be an API endpoint handler, which would
receive a query
forward it to the query endpoint
wait for the answer
send an answer to the caller
send a command to store the answer
Is there a way to do this within a Query without violating CQRS' principles?
It depends.
If "save the data" means "make some change to the domain model"... well, that would be pretty weird.
Asking a question should not change the answer. -- Bertrand Meyer
On the other hand, logging/telemetry are pretty normal ways to track the activity of an application, so that should be fine.
There are some realities of a distributed system on an unreliable network that you need to be aware of (what should the behavior be if the telemetry system is not available? What are the consequences of recording queries that don't actually reach the client (because the network is unreliable).
As #VoiceOfUnreason stated, it may be somewhat strange to effect domain changes when querying data.
However, it may be that you could swop that around.
For instance, perhaps one could query a forecast of sorts. We would want to store that forecast. It then seems as though the query results in us having to save the result. This appears to break CQS at some level since each query would result in a change of state.
If we swop that around and first request a forecast via the domain handling and then that produces a result, or even a pointer to the result, then the query would be something you could perform on the data multiple times without "breaking" CQS.

Micro Services and noSQL - Best practice to enrich data in micro service architecture

I want to plan a solution that manages enriched data in my architecture.
To be more clear, I have dozens of micro services.
let's say - Country, Building, Floor, Worker.
All running over a separate NoSql data store.
When I get the data from the worker service I want to present also the floor name (the worker is working on), the building name and country name.
Solution1.
Client will query all microservices.
Problem - multiple requests and making the client be aware of the structure.
I know multiple requests shouldn't bother me but I believe that returning a json describing the entity in one single call is better.
Solution 2.
Create an orchestration that retrieves the data from multiple services.
Problem - if the data (entity names, for example) is not stored in the same document in the DB it is very hard to sort and filter by these fields.
Solution 3.
Before saving the entity, e.g. worker, call all the other services and fill the relative data (Building Name, Country name).
Problem - when the building name is changed, it doesn't reflect in the worker service.
solution 4.
(This is the best one I can come up with).
Create a process that subscribes to a broker and receives all entities change.
For each entity it updates all the relavent entities.
When an entity changes, let's say building name changes, it updates all the documents that hold the building name.
Problem:
Each service has to know what can be updated.
When a trailing update happens it shouldnt update the broker again (recursive update), so this can complicate to the microservices.
solution 5.
Keeping everything normalized. Fileter and sort in ElasticSearch.
Problem: keeping normalized data in ES is too expensive performance-wise
One thing I saw Netflix do (which i like) is create intermediary services for stuff like this. So maybe a new intermediary service that can call the other services to gather all the data then create the unified output with the Country, Building, Floor, Worker.
You can even go one step further and try to come up with a scheme for providing as input which resources you want to include in the output.
So I guess this closely matches your solution 2. I notice that you mention for solution 2 that there are concerns with sorting/filtering in the DB's. I think that if you are using NoSQL then it has to be for a reason, and more often then not the reason is for performance. I think if this was done wrong then yeah you will have problems but if all the appropriate fields that are searchable are properly keyed and indexed (as #Roman Susi mentioned in his bullet points 1 and 2) then I don't see this as being a problem. Yeah this service will only be as fast as the culmination of your other services and data stores, so they have to be fast.
Now you keep your individual microservices as they are, keep the client calling one service, and encapsulate the complexity of merging the data into this new service.
This is the video that I saw this in (https://www.youtube.com/watch?v=StCrm572aEs)... its a long video but very informative.
It is hard to advice on the Solution N level, but certain problems can be avoided by the following advices:
Use globally unique identifiers for entities. For example, by assigning key values some kind of URI.
The global ids also simplify updates, because you track what has actually changed, the name or the entity. (entity has one-to-one relation with global URI)
CAP theorem says you can choose only two from CAP. Do you want a CA architecture? Or CP? Or maybe AP? This will strongly affect the way you distribute data.
For "sort and filter" there is MapReduce approach, which can distribute the load of figuring out those things.
Think carefully about the balance of normalization / denormalization. If your services operate on URIs, you can have a service which turns URIs to labels (names, descriptions, etc), but you do not need to keep the redundant information everywhere and update it. Do not do preliminary optimization, but try to keep data normalized as long as possible. This way, worker may not even need the building name but it's global id. And the microservice looks up the metadata from another microservice.
In other words, minimize the number of keys, shared between services, as part of separation of concerns.
Focus on the underlying model, not the JSON to and from. Right modelling of the data in your system(s) gains you more than saving JSON calls.
As for NoSQL, take a look at Riak database: it has adjustable CAP properties, IIRC. Even if you do not use it as such, reading it's documentation may help to come up with suitable architecture for your distributed microservices system. (Of course, this applies if you have essentially parallel system)
First of all, thanks for your question. It is similar to Main Problem Of Document DBs: how to sort collection by field from another collection? I have my own answer for that so i'll try to comment all your solutions:
Solution 1: It is good if client wants to work with Countries/Building/Floors independently. But, it does not solve problem you mentioned in Solution 2 - sorting 10k workers by building gonna be slow
Solution 2: Similar to Solution 1 if all client wants is a list enriched workers without knowing how to combine it from multiple pieces
Solution 3: As you said, unacceptable because of inconsistent data.
Solution 4: Gonna be working, most of the time. But:
Huge data duplication. If you have 20 entities, you are going to have x20 data.
Large complexity. 20 entities -> 20 different procedures to update related data
High cohesion. All your services must know each other. Data model change will propagate to every service because of update procedures
Questionable eventual consistency. It can be done so data will be consistent after failures but it is not going to be easy
Solution 5: Kind of answer :-)
But - you do not want everything. Keep separated services that serve separated entities and build other services on top of them.
If client wants enriched data - build service that returns enriched data, as in Solution 2.
If client wants to display list of enriched data with filtering and sorting - build a service that provides enriched data with filtering and sorting capability! Likely, implementation of such service will contain ES instance that contains cached and indexed data from lower-level services. Point here is that ES does not have to contain everything or be shared between every service - it is up to you to decide better balance between performance and infrastructure resources.
This is a case where Linked Data can help you.
Basically the Floor attribute for the worker would be an URI (a link) to the floor itself. And Any other linked data should be expressed as URIs as well.
Modeled with some JSON-LD it would look like this:
worker = {
'#id': '/workers/87373',
name: 'John',
floor: {
'#id': '/floors/123'
}
}
floor = {
'#id': '/floor/123',
'level': 12,
building: { '#id': '/buildings/87' }
}
building = {
'#id': '/buildings/87',
name: 'John's home',
city: { '#id': '/cities/908' }
}
This way all the client has to do is append the BASE URL (like api.example.com) to the #id and make a simple GET call.
To remove the extra calls burden from the client (in case it's a slow mobile device), we use the gateway pattern with micro-services. The gateway can expand those links with very little effort and augment the return object. It can also do multiple calls in parallel.
So the gateway will make a GET /floor/123 call and replace the floor object on the worker with the reply.

Segmenting on users who have performed a behaviour not behaving as expected

I want to look at the effect of having performed a specific action sequence at any (tracked) time in the past on user retention and engagement.
The action sequence is that of performing an optional New User Flow.
This is signalled to Google Analytics via sending it appropriate events. That works fine. The events show up in reports as expected.
My problem is what happens to results when I used these events to create segments. I have tried two different ways of creating a segment based on this in Advanced Segmentations, via Conditions (defining the segment via the end event, filtered over users not sessions), and via Sequences (defining start and end events, again filtered over users not sessions).
What I get when I look at various retention/loyalty reports, using either of these segments, is ever so very clearly a result which is doing this segmentation within session, not across uses sessions. So for NUF completers , I am seeing all my loyalty/recency on Session 1, in which people are most likely to do the NUF, if they ever do it at all. This is not what I want. (Mind you it is something that could be really useful in other context, with another event! But not for the new user flow.)
What are my options for getting what I want? I see two possible ways forward:
Using custom dimensions, assigning a custom dimension value in the code when the New User Flow is completed. However I do not know if this will solve the cross-session persistence problem.
Injecting a UserID, which we do not currently do, and (somehow!) using the reports available when you inject a UserID to do this.
Are either of these paths plausible? Is there a better way forward? Is it silly to even try to do this in Google Analytics? I'm way more familiar with App Tracking solutions (e.g. Flurry, Mixpanel, DeltaDNA) which do this as a matter of course, than with Google Analytics, and the fact this is at the very least awkward in Google Analytics is coming a bit of a surprise.
thanks,
Heather

Duplicates when linkswalking riak using ripple

I'm working on a project where I use Riak with Ripple, and I've stumbled on a problem.
For some reason I get duplicates when link-walking a structure of links. When I link walk using curl I don't get the duplicates as far as I can see.
The difference between my curl based link-walk
curl -v http://127.0.0.1:8098/riak/users/2306403e5177b4716da9df93b67300824aa2fd0e/_,projects,0/_,tasks,1
and my ruby ripple/riak-client based link walk
result = Riak::MapReduce.new(self.robject.bucket.client).
add(self.robject.bucket,self.key).
link(Riak::WalkSpec.new({:key => 'projects'})).
link(Riak::WalkSpec.new({:key => 'tasks', :bucket=>'tasks'})).
map("function(v){ if(!JSON.parse(v.values[0].data).completed) {return [v];} else { return [];} }", {:keep => true}).run
is as far as I can tell the map at the end.
However the result of the map/reduce contains several duplicates. I can't wrap my head around why. Now I've settled for removing the duplicates based on the key, but I wish that the riak result wouldn't contain duplicates, since it seems like waste to remove duplicates at the end.
I've tried the following:
Making sure there are no duplicates in the links sets of my ripple objects
Loading the data without the map reduce, but the link walk contains duplicate keys.
Any help is appreciated.
What you're running into here is an interesting side-effect/challenge of Map/Reduce queries.
M/R queries don't have any notion of read quorum values, and they necessarily have to hit every object (within the limitations of input filtering, of course) on every node.
Which means, when N > 1, the queries have to hit every copy of every object.
For example, let's say N=3, as per default. That means, for each written object, there are 3 copies, one each on 3 different nodes.
When you issue a read for an object (let's say with the default quorum value of R=2), the coordinating node (which received the read request from your client) contacts all 3 nodes (and potentially receives 3 different values, 3 different copies of the object).
It then checks to make sure that at least 2 of those copies have the same values (to satisfy the R=2 requirement), returns that agreed-upon value to the requesting client, and discards the other copies.
So, in regular operations (reads/writes, but also link walking), the coordinating node filters out the duplicates for you.
Map/Reduce queries don't have that luxury. They don't really have quorum values associated with them -- they are made to iterate over every (relevant) key and object on all the nodes. And because the M/R code runs on each individual node (close to the data) instead of just on the coordinating node, they can't really filter out any duplicates intrinsically. One of the things they're designed for, for example, is to update (or delete) all of the copies of the objects on all the nodes. So, each Map phase (in your case above) runs on every node, returns the matched 'completed' values for each copy, and ships the results back to the coordinating node to return to the client. And since it's very likely that your N>1, there's going to be duplicates in the result set.
Now, you can probably filter out duplicates explicitly, by writing code in the Reduce phase, to check if there's already a key present and reject duplicates if it is, etc.
But honestly, if I was in your situation, I would just filter out the duplicates in ruby on the client side, rather than mess with the reduce code.
Anyways, I hope that sheds some light on this mystery.

Resources