Quick retrieval of same data with multi dimensions - caching

Here is my case.I have fields of a user like name,phone,email,city,country,other user attributes. If I search by name or phone or email ,I need to get all users details who are satisfying those conditions in very short time.As far as I know,To implement this in a key value cache, I need to have same data with different key.
My current plan of implementing is as follows
All the following info are stored in my cache
User level complete data
unique_user_id(XXX) :{name:abc, phone:12345, email:abc#gmail.com..etc}
unique_user_id(YYY) :{name:def, phone:67891, email:def#gmail.com..etc}
Email level info
abc#gmail.com : XXX
def#gmail..com : YYY
Name level info
abc : XXX
def : YYY
Phone level info
12345 : XXX
67891 : YYY
If I search with email, I query 'Email level info' and get the unique user id and query the 'User level info' for user details. Here I have duplicated data for each type of possible key.I do not want to duplicate data. I want to have single key value for each user.
Is there any better way than key value caches where data stored should be minimal and response time is very less. I am okay to go with database along with hash indexing kind of stuff.Please suggest an appropriate software for this.
I have never used caches. Please bear with my understanding and language

Related

Correct way to model foreign keys to entities that do not (yet) exist?

I'm building a Spring Boot application using Spring Data JPA. I'll give a simplified description of the application that illustrates my problem:
I have a table of Students that has a student_id primary key and various personal information (incl name, etc). This personal data is loaded from an external source and may only be retrieved once the student gives his permission to retrieve it by logging into the application. I thus cannot create a list of all the users that might log in ahead of time. They are created and inserted into the database when the student logs in the first time.
I also load data sets like historical grades into the database ahead of time. These are of the form student_id (foreign key), course_id (foreign key), grades, year (and some other fields). The point is that once a student logs in, their historical grades will be visible. However, the database (initialized as empty by Spring Data JPA) will not let me insert the historical data as it complains that e.g. student_id 1234 (foreign key in the grades table) cannot be found as a primary key in the Students table. Which is true, because I will only create that user 1234 when and if he/she logs in.
I see the following options and don't really like any of them:
disable all constraints on foreign keys for the relevant classes (in which case: How do I tell Spring Data JPA to do that? I googled but couldn't find it) -> Disabling integrity checks sounds like a bad idea though.
Create 'dummy' students, i.e. simply go through the historical data, list all the student_id's and then pre-fill my Student table with entries like student_id = 1234, name = "", address = "", etc. This data would be filled in if/when the students logs in -> This also feels like a 'dirty' solution.
Keep the historical data in .csv files, or another manually created table and have the application load it into the 'real' database only after the student logs in for the first time -> This just sounds like a terrible mess.
Conceptually I'm inclined towards option 1, because I do in fact want to create/save a piece of historical data about a student, despite having no other information about that student. I'm afraid, however, that if I e.g. retrieve all grades for a course_id, those Grade entities will contain links to Student entities that do not in fact exist and this will just result in more errors.
How do you handle such a situation?

Map multiple values to a unique column in Elasticsearch

I want to work with Elasticsearch to process some Whatsapp chats. So I am initially planning the data load.
The problem is that the data exported from Whatsapp, doesn't contain a real unique id per user but it only contains the name of the user taken from the contact directory of the device where the chat is exported (ie. a user can change the number or have two numbers in the same group).
Because of that, I need to create a custom explicit mapping table between the user names and a self-generated unique id, that gets populated in an additional column.
Then, my question is: "How can I implement such kind of explicit mapping in Elasticsearch to generate an additional unique column?". Alternatively, a valid answer could be a totally different approach to the problem.
PS. As I write, I think the solution could be in the ingestion process, like in a python script, but I still want to post the question to understand if this is something that Elasticsearch can do by itself.
yes, do it during the index process
if you had the data that maps the name and the id stored in a separate index you could do this with an enrich processor when you index the data to add whichever value you want to the document via a pipeline
also - Elasticsearch doesn't have columns, only fields

Redis advice- when to group data and when to split it up

Hoping to get some sage advice from those that have been in the trenches, and can help me better understand when to keep data clustered together in Redis, and when to split it up.
I am working on a multi-tenant platform that has GPS data coming in every 3-5 seconds for each asset I track. As data is processed, we store additional information associated with the asset (i.e. is it on time, is it late, etc).
Each asset belongs to one or more tenant. For example, when tracking a family car, that car may exist for the husband, and the wife. Each needs to know where it is relative to their needs. For example, the car may be being used by the teenager and is on time for the husband to use it at 3:00 pm, but late for the wife to use it at 2:30 pm.
As an additional requirement, a single tenant may want read access to other tenants. I.E. the Dad wants to see the family car, and any teenager's cars. So the heirarchy can start to look something like this:
Super-Tenant --
--Super Tenant (Family)
--Tenant (Dad)
--Vehicle 1
--Gps:123.456,15.123
--Status:OnTime
--Vehicle 2
--Gps:123.872,15.563
--Status:Unused
--Tenant (Mom)
--Vehicle 1
--Gps:123.456,15.123
--Status:Late
--Vehicle 2
--Gps:123.872,15.563
--Status:Unused
--Tenant (Teenager)
--Vehicle 1
--Gps:123.456,15.123
--Status:Unused
--Vehicle 2
--Gps:123.872,15.563
--Status:Unused
My question has to do with the best way to store this in Redis.
I can store by tenant - I.E. I can use a Key for Dad, then have a collection of all of the vehicles he has access to. Each time a new GPS location comes in (whether for Vehicle 1 or Vehicle 2), I would update the contents of the collection. My concern is that if there are dozens of vehicles that we would be updating his collection way to often.
Or
I can store by tenant, then by vehicle. This means that when Vehicle 1's GSP location comes in I will be updating information across 3 different tenants. Not to bad.
What gives me pause is that I am working on a website for Dad to see all his vehicles. That call is going to come in and ask for all Vehicles under the Tenant of Dad. If I split out the data so it is stored by tenant/vehicle, then I will have to store the fact that Dad has 2 vehicles, then ask Redis for everything in (key1,key2,etc).
If I make it so that everything is stored in a collection under the Dad tenant, then my request to Redis would be much simpler and will be asking for everything under the key Dad.
In reality, each tenant will have between 5-100 vehicles, and we have 100's of tenants.
Based on your experience, what would be your preferred approach (please feel free to offer any not offered here).
From your question it appears you're hoping to store everything you need under one key. Redis doesn't support nested hashes as-is. There are a few recommendations from this answer on how to work around.
Based on the update cadence of the GPS data, it's best to minimize the total number of writes required. This may increase the number of operations to construct a response on read; however, adding read only slave instances should allow you to scale reads. You can tune your writes with some pipelining of updates.
From what you have described it sounds like the updates are limited to the GPS and status of a vehicle for a user. The data requested on read would be for a single user view what their set of vehicle position and status is.
I would start with a Tenant stored as a hash with the user's name, and a field referenceing the vehicles and sessions associated to the user. This is not really necessary if you take similar naming conventions, but shown as an option if additional user data needs cached.
- Tenant user-1 (Hash)
-- name: Dad (String)
-- vehicles: user-1:vehicles (String) reference to key for set of vehicles assigned.
-- sessions: user-1:sessions (String) reference to key for set of user-vehicle sessions.
The lookup of a the vehicles could be done with key formatting if none of the other tenant data is needed. The members would be references to the keys of vehicles.
- UserVechicles user-1:vehicles (Set)
-- member: vehicle-1 (String)
This would allow lookup of the details for the vehicle. Vehicles would have their position. You could include a field to reference a vehicle centric schedule of sessions similar to the user sessions below. Also you could place a vehicle name or other data if this is also required for response.
- Vehicle: vehicle-1 (Hash)
-- gps: "123.456,15.123" (String)
Store the sessions specific to a user in a sorted set. The members would be references to the keys storing session information. The score would be set to a timestamp value allowing range lookups for recent and upcoming sessions for that user.
- Schedule user-1:sessions
-- member: user-1:vehicle-1:session-1 (String)
-- score: 1638216000 (Float)
Tenant Sessions on a vehicle you could go simply with the listing of the status in a string. An alternative here is shown that would allow storing additional state of scheduled and available times if you need to support vehicle centric views of a schedule. Combining this with a sorted-set of vehicle sessions would round this out.
- Session user-1:vehicle-1:session-1 (Hash)
-- status: OnTime (String)
-- scheduled_start: 1638216000 (String) [optional]
-- scheduled_end: 1638216600 (String) [optional]
-- earliest_available: 1638215000 (String) [optional]
If you're not tracking state elsewhere you could use a hash to store the counters for the cache objects you have for when it comes time to issue a new one. Read and increment these when adding new cache objects.
- Globals: global (Hash)
-- user: 0
-- vehicle: 0
-- session: 0
For updates you would have: 200k write operations per update cycle.
100k tenants-vehicles (1000 tenants * 100 vehicles/tenant) each with
1 HSET vehicle
1 HSET session
Pipelining and tuning the number of requests in the pipeline can improve write performance, but I would anticipate you should be able to complete all writes in <2s.
For a read you would have something like: ~300 operations per user per request.
1 HGETALL user
1 ZRANGESTORE tempUSessions user-sessions LIMIT 200 (find up-to 200 sessions in a timeframe for the user)
200 HGETALL session
1 SMEMBERS user-vehicles (find all vehicles for the user)
100 HGET vehicle gps (get vehicle location for all vehicles)
Considerations:
A process to periodically remove sessions and their references after they pass would keep the memory from growing unbounded and performance consistent.
Adding some scripts to allow for easier updates to the cache when a new user or vehicle is added and for returning the state you described needing for display to a user would round this out.

How to manage store "created by" in micro-service?

I am building the inventory service, all tables keep track the owner of each record in column createdBy which store the user id.
The problem is this service does not hold the user info, so it cannot map the id to username which is required for FE to display data.
Calling user service to map the username and userid for each request does not make sense in term of decouple and performance. Because 1 request can ask for maximum 100 records. If I store the username instead of ID, there will be problem when user change their username.
Is there any better way or pattern to solve this problem?
I'd extend the info with the data needed with from the user service.
User name is a slow changing dimension so for most of the time the data is correct (i.e. "safe to cache")
Now we get to what to do when user info changes - this is, of course, a business decision. In some places it makes sense to keep the original info (for example what happens when the user is deleted - do we still want to keep the original user name (and whatever other info) that created the item). If this is not the case, you can use several strategies - you can have a daily (or whatever period) job to go and refresh the users info from the user service for all users used in the inventory, you can publish a daily summary of changes from the user service and have the inventory subscribe to that, you can publish changes as they happen and subscribe to that etc. - depending on the requirement for freshness. The technology to use depends on the strategy..
In my option what you have done so far is correct. Inventory related data should be Inventory Services' responsibility just like user related data should be User Services'.
It is FE's responsibility to fetch the relevant user details from User Service that are required to populate the UI (Remember, call backend for each user is not acceptable at all. Bulk search is more suitable).
What you can do is when you fetch inventory data from Inventory Service, you can publish a message to User Service to notify that "inventory related data was fetched for these users. So there is a possibility to fetch user related data for these users. Therefore you better cache them."
PS - I'm not an expert in microservices architecture. Please add any counter arguments if you have any.*

Loading records into Dynamics 365 through ADF

I'm using the Dynamics connector in Azure Data Factory.
TLDR
Does this connector support loading child records which need a parent record key passed in? For example if I want to create a contact and attach it to a parent account, I upsert a record with a null contactid, a valid parentcustomerid GUID and set parentcustomeridtype to 1 (or 2) but I get an error.
Long Story
I'm successfully connecting to Dynamics 365 and extracting data (for example, the lead table) into a SQL Server table
To test that I can transfer data the other way, I am simply loading the data back from the lead table into the lead entity in Dynamics.
I'm getting this error:
Failure happened on 'Sink' side. ErrorCode=DynamicsMissingTargetForMultiTargetLookupField,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=,Source=,''Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot find the target column for multi-target lookup field: 'ownerid'.
As a test I removed ownerid from the list of source columns it loads OK.
This is obviously a foreign key value.
It raises two questions for me:
Specifically with regards to the error message: If I knew which lookup it needed to use, how can I specify which lookup table it should validate against? There's no settings in the ADF connector to allow me to do this.
This is obviously a foreign key value. If I only had the name (or business key) for this row, how can I easily lookup the foreign key value?
How is this normally done through other API's, i.e. the web API?
Is there an XRMToolbox addin that would help clarify?
I've also read some posts that imply that you can send pre-connected data in an XML document so perhaps that would help also.
EDIT 1
I realised that the lead.ownertypeid field in my source dataset is NULL (that's what was exported). It's also NULL if I browse it in various Xrmtoolbox tools. I tried hard coding it to systemuser (which is what it actually is in the owner table against the actual owner record) but I still get the same error.
I also notice there's a record with the same PK value in systemuser table
So the same record is in two tables, but how do I tell the dynamics connector which one to use? and why does it even care?
EDIT 2
I was getting a similar message for msauto_testdrive for customerid.
I excluded all records with customerid=null, and got the same error.
EDIT 2
This link appears to indicate that I need to set customeridtype to 1 (Account) or 2 (Contact). I did so, but still got the same error.
Also I believe I have the same issue as this guy.
Maybe the ADF connector suffers from the same problem.
At the time of writing, #Arun Vinoth was 100% correct. However shortly afterwards there was a documentation update (in response to a GitHub I raised) that explained how to do it.
I'll document how I did it here.
To populate a contact with against a parent account, you need the parent accounts GUID. Then you prepare a dataset like this:
SELECT
-- a NULL contactid means this is a new record
CAST(NULL as uniqueidentifier) as contactid,
-- the GUID of the parent account
CAST('A7070AE2-D7A6-EA11-A812-000D3A79983B' as uniqueidentifier) parentcustomerid,
-- customer id is an account
'account' [parentcustomerid#EntityReference],
'Joe' as firstname,
'Bloggs' lastname,
Now you can apply the normal automapping approach in ADF.
Now you can select from this dataset and load into contact. You can apply the usual automapping approach, this is: create datasets without schemas. Perform a copy activity without mapping columns
This is the ADF limitation with respect to CDS polymorphic lookups like Customer and Owner. Upvote this ADF idea
Workaround is to use two temporary source lookup fields (owner team and user in case of owner, account and contact in case of customer) and with parallel branch in a MS Flow to solve this issue. Read more, also you can download the Flow sample to use.
First, create two temporary lookup fields on the entity that you wish to import Customer lookup data into it, to both the Account and Contact entities respectively
Within your ADF pipeline flow, you will then need to map the GUID values for your Account and Contact fields to the respective lookup fields created above. The simplest way of doing this is to have two separate columns within your source dataset – one containing Account GUID’s to map and the other, Contact.
Then, finally, you can put together a Microsoft Flow that then performs the appropriate mapping from the temporary fields to the Customer lookup field. First, define the trigger point for when your affected Entity record is created (in this case, Contact) and add on some parallel branches to check for values in either of these two temporary lookup fields
Then, if either of these conditions is hit, set up an Update record task to perform a single field update, as indicated below if the ADF Account Lookup field has data within it

Resources