Which of the following data structures is scalable in elasticsearch? - elasticsearch

I want to build an event analytics system, where I can record and query events that a user has done, for example on a website.
My naive idea of the data model was simply a collection of event documents, each event including the userid, event type, and so on. So I thought something like this:
{ userid: Joe, event: homepage }
{ userid: Mike, event: homepage }
{ userid: Joe, event: productsPage }
{ userId: Joe, event: accountSettings }
{ userId: Joe, event: checkout }
etc
But now I'm struggling to figure out how to do some of the queries I'm most likely to be able to want to do.
For example, I want to say "Give me a list of all users who have visited the homepage AND the products page AND the checkout page"
Seems to me I would need to use my application code to do this, rather than elasticsearch? And I would need to do something like:
Step 1: select all users who have done 'homepage'
Step 2: select all users who have done 'products page'
Step 3: select all users who have done 'checkout page'
Step 4: build a list of only those users who appear in all 3 lists.
If I have a userbase of 20 million users, I risk bringing huge lists of data into my application?
An alternative would be to have one document per user, so that Joe looks like
{ userid: Joe, event: [ homepage, productsPage, accountSettings, checkout ] }
and so on.
But then that would involve updating this document every time the user did something. Since elasticsearch writes a new record rather than updating in place, that would involve a horrendous amount of rewriting, given that each user might do say 5000 events in a year, and spread across different days. Not to mention rewriting of the index?
Is there an idiomatic way I'm missing of accomplishing a database by user that can handle regular updates to each user, and buid indexes that allow for fast querying of that data by multiple criteria - eg users who have done eventA AND eventB AND eventC?
Many thanks for all your help!

You can use Kibana for Visualizing Data stored in Elasticsearch.
You can use this type of events itself:-
{ userid: Joe, event: homepage }
{ userid: Mike, event: homepage }
{ userid: Joe, event: productsPage }
{ userId: Joe, event: accountSettings }
{ userId: Joe, event: checkout }
etc
After storing your data in Elasticsearch, You can use Kibana and create visualizations specifying AND Filters over the event field.

Related

Is there any way limit the form of the query in graphql

For example, if there are two types User and Item
type User {
items: [Item!]!
}
type Item {
id: ID!
name: String!
price: Int!
}
If one user has PARTNER role.
I want to prevent it from being called only in the form of the query below.
query Query1 {
user {
items {
name
}
}
}
If user call another query, I want to indicate that user doesn't have permission.
query Query2 {
user {
items {
id
name
}
}
}
In short. if (Query1 != Query2) throw new Error;
Your question is a bit hard to follow but a couple things:
A GraphQL server is stateless - you cannot (and really should not) have a query behave differently based on a previous query. (If there's a mutation in between sure but not two queries back to back)
access management is normally implemented in your resolvers. You can have the resolver for the item id check to see if the user making the query has the right to see that or not and return an error if they don't have access.
Note that it can be bad practice to hide the id of objects from queries as these are used as keys for caching on the client.

GraphQL Pagination | The very first request

According to the connection based model for pagination using graphQL, I have the following simplified schema.
type User {
id: ID!
name: String!
}
type UserConnection {
totalCount: Int
pageInfo: PageInfo
edges: [UserEdge]
}
type UserEdge {
cursor: String
node: User
}
type PageInfo {
lastCursor: Int
hasNextPage: Boolean
}
type Query {
users(first: Int, after: String): UserConnection
}
Consider the following router on within SPA front-end:
/users - once the user hit this page, I'm fetching first 10 records right up from the top of the list and further I'm able to paginate by reusing a cursor that I've retrieved from the first response.
/user/52 - here I'd like to show up 10 records that should go right from the position of user52.
Problem What are the possible ways to retrieve a particular subset of records on the very first request? On this moment I don't have any cursor to construct something similar to
query GetTenUsersAfter52 {
users(first: 10, after: "????") { # struggling to pass anything as a cursor...
edges {
node {
name
}
}
}
}
What I've already tried(a possible solution) is that I know that on a back-end the cursor is encoded value of an _id of the record in the DB. So, being on /users/52 I can make an individual request for that particular user, grab the value of id, then on the front-end I can compute a cursor and pass it to the back-end in the query above.
But in this case personally, I found a couple of disadvantages:
I'm exposing the way of how my cursor is computed to the front-end, which is bad since if I needed to change that procedure I need to change it on front-end and back-end...
I don't want to make another query field for an individual user simply because I need its id to pass to the users query field.
I don't want to make 2 API calls for that as well...
This is a good example of how Relay-style pagination can be limiting. You'll hit a similar scenario with create mutations, where manually adding a created object into the cache ends up screwing up your pagination because you won't have a cursor for the created object.
As long as you're not actually using Relay client-side, one solution is to just abandon using cursors altogether. You can keep your before and after fields, but instead simply accept the id (or _id or whatever PK) value instead of a cursor. This is what I ended up doing on a recent project and it simplified things significantly.

Apollo/React mutating two related tables

Say I have two tables, one containing products and the other containing prices.
In Graphql the query might look like this:
option {
id
price {
id
optionID
price
date
}
description
}
I present the user with a single form (in React) where they can enter the product detail and price at the same time.
When they submit the form I need to create an entry in the "product" table and then create a related entry in the "price" table.
I'm very new to Graphql, and React for that matter, and am finding it a steep learning curve and have been following an Apollo tutorial and reading docs but so far the solution to this task is remaining a mystery!
Could someone put me out of my misery and give me, or point me in the direction of, the simplest example of handling the mutations necessary for this?
Long story short, that's something that should actually be handled by your server if you want to optimize for as few requests as possible.
Problem: The issue here is that you have a dependency. You need the product to be created first and then with that product's ID, relate that to a new price.
Solution: The best way to implement this on the server is by adding another field to Product in your mutation input that allows you to input the details for Price as well in the same request input. This is called a "nested create" on Scaphold.
For example:
// Mutation
mutation CreateProduct ($input: CreateProductInput!) {
createProduct(input: $input) {
changedProduct {
id
name
price {
id
amount
}
}
}
}
// Variables
{
input: {
name: "My First Product",
price: {
amount: 1000
}
}
}
Then, on the server, you can parse out the price object in your resolver arguments and create the new price object while creating the product. Meanwhile, you can also relate them in one go on the server as well.
Hope this helps!

How bad would it be to have nested mutations?

I am aware that it would be considered as an anti-pattern, but why exactly?
mutation {
createUser(name: "john doe") {
addToTeam(teamID: "123") {
name,
id
},
id
}
}
Wouldn't it be more convenient than two HTTP calls?
mutation {
createUser(name: "john doe") {
id, # we store the ID
}
}
mutation {
addToTeam(userID: id, teamID: "123") {
name,
id,
}
}
If you have a relation between Team and User, you could expose this API:
Create user, relate to existing team
mutation {
createUser(name: "john doe", teamId: "team-id") {
id
team {
id
}
}
}
Create new user and new team
mutation {
createUser(name: "john doe", team: {name: "New team"}) {
id
team {
id
}
}
}
This is exactly how the Graphcool API handles this as shown in this blog article. You can find another example in the documentation.
There are two reasons why this is an anti-pattern:
First, there are two atomic operations here, each may involve some extra logic related to authentication, validation, and yield different errors. So mixing them together could lead to some extra complexity.
For example, say a team can only have 10 people, and it has reached its max. Should the compose operation fail altogether? Shall we just add the user but not add it to the team? What the response will look like?
Second, lumping two operations in such way may potentially expose application logic. One can be tempted to use such mutations to perform 'When X happens Y should also happen as well'. For instance, when adding a new line to an invoice, the total should update. This should really happen with one mutation, addLineToInvoice, and have the logic reside on the server.
In a way, the command part of APIs is better being process (or action) centric, rather than data centric. If your API calls are focused on data manipulation, you are risking loading the client with business logic that should live in the server. You may also be losing on quite a few goodies like middleware (which is great for cross-cutting concerns, like permissions and logging).

Elasticsearch: document relationship

I'm doing a elastic search autocomplete-as-you-type
I'm using cool features like ngram's and other stuff to create needed analyzer.
currently I break my had around indexing following data.
Let say I have Payments type,
each document in this type looks like this
{
..elastic meta data..
paymentId: 123453425342,
providerAccount : {
id: 123456
firstName: Alex,
lastName: Web
},
consumerAccount : {
id: 7575757,
firstName: John,
lastName: Doe
},
amount: 556,
date : 342523454235345 (some unix timestamp)
}
so basically this document represents not only the payment itself but it also shows the relationship of the payment, the 2 entities which related to the payment.
Payment always have its provider and consumer.
I need this data in payment document because I want to show it in UI.
By indexing it like so, it might be a big pain for handling the updates of Consumer or Provider because each time some of them change its properties I have to update all the payments which has this entity.
Another possible solution is to store only id's of this consumers/providers and make a query on payments and then 2 queries for the entities for retrieving needed fields, but i'm not sure about this because i'm doing ajax requests each time a character entered, so here comes the performance question.
I have also looked into parent/child relationship solution which basically fits my case but I wasn't able to figure out if I can retrieve also the parent(consumer/provider) fields while I querying child(payment).
What would you suggest?
Thanks!
Yes, you can retrieve the parent while querying child using has_child.
Considering payment as child and consumer as parent, You can search all the consumers by :
GET /index_name/consumer/_search
{
"query": {
"has_child": {
"type": "payment",
"query": {
// any query on payment table
},
"inner_hits": {}
}
}
}
This would fetch you all the consumer based on the query on child i.e payment in your case.
inner_hits is what you are looking for. This will retrieve you the children as well. But it was introduced in elasticsearch 1.5.0. So version should be greater than elasticsearch 1.5.0.
You can refer https://www.elastic.co/blog/elasticsearch-1-5-0-released.
Your problem is not an issue. I suppose you want tot freeze data after the pay, right? So you don't need to update the accounts data in existing payment documents.
Further: parent/schild is easy for updating, but less efficient with querying. For auto complete, stay using your current mapping!

Resources