Data structure to manage huge data - in memory indexing - data-structures

Suppose you have a huge data of students. This data is in RAM (around 48 GB).
Student has following attributes:
1) RollNo
2) Name
3) Address
Now I need to implement three method:
getStudentByRollNo(int rollno)
getStudentsByName(String name)
getStudentsByAddress(String address)
In what data structure I can keep these students so that these methods can return the results really fast.

Related

Performance in Elasticsearch

I am now beginning with elasticsearch.
I have two cases of data in a relational database, but in both cases I want to find the records from the first table as quickly as possible.
Case 1: binding tables 1: n (example Invoice - Items of invoice)
Have I been to save the data to the elasticsearch system: all rows from slave or master_id and group all data from slave to single string?
Case 2: binding tables n: 1 (example Invoice - Customer)
Have I been to save the data as in case 1 to independent index or add next column to previous index?
The problem is that sometimes I only need to search for records that contain a specific invoice item, sometimes a specific customer, and sometimes both an invoice item and a customer.
Should I create one index containing all the data, or all 3 variants?
Another problem, is it possible to speed up the search in elasticsearch somehow, when the stored data is eg only EAN (13 digit number) but not plain text?
Thank
Jaroslav
You should denormalize and just use single index for all your data(invoices, items and customer) for the best performance, Elasticsearch although supports joins and parent-child relationship but their performance is no where near to when all the data is part of single index and quick benchmark test on your data will prove it easily.

Elasticsearch best practice (addresses)

we are going to hold a mass of address data (mass in eyes of my company - about 150.000 to 500.000 rows per Customer).
The address data contains about 5 columns:
Name1
Name2
Street (+ No)
Postcode
City
Maybe later some more stuff (like phone, mail etc.)
Is it the best way to assign a pool of addresses per customer to one shard? (A user of the application is assigned to a customer and shares the address pool to all users of customer)
"Jargon wise" give each Customer their own index (with the same mapping). Elasticsearch can query multiple indices with a single query. An index may consist of many shards. For 150 - 500.000 documents, you don't need that many shards. You might be fine with just one, but depending on the amount of queries made, at least check between 1 - 5.

Is it cheaper to check if item exists before accessing it on GAE Datastore?

If I want to load or delete an object from the Datastore. Is it better to check if it exists first? I've read that small operations are free. Does checking for existance save potential read operations to non-existing objects?
if (StatusObj.ofy().load().type(Object.class).filterKey(key).count() != 0) {
StatusObj.ofy().load().key(key).now();
}
Yes. However, network latency is not negligible and your app's response time will almost double if entities exist.
From: https://cloud.google.com/appengine/docs/python/datastore/entities#Python_Batch_operations
A batch operation for two keys costs two reads, even if one of the keys did not exist. For example, it is more economical to do a keys-only query that retrieves 1000 keys, and then do a fetch on 500 of them, than to do a regular (not keys-only) query for all 1000 directly:
Query returning 1000 keys + fetching 500 entities:
$0.0000007 (base query cost) + $0.0001 (per-key query cost) + 0.00035 (entity fetch)
= $0.0004507
Fetching 1000 entities:
$0.0000007 (base query cost) + $0.0007 (per-entity query cost)
= $0.0007007
From pricing page (https://cloud.google.com/appengine/pricing) : Small datastore operations include calls to allocate datastore ids or keys-only queries, and these operations are free.

Data Transformation for Large data in a file

I am new to ensemble and have a clarification regarding the Data Transformations.
I have 2 schemas as follows,
PatientID,
Patient Name,
Patient Address (combination of door number, Street, District, State)
and another schema as,
PatientID,
Patient Name,
Door Number
Street
District
State
Now there is an incoming text file with 1000's of records as per the first schema ('|' separated) as below,
1001|John|220,W Maude Ave,Suisun City, CA
like this there a 1000's of recrods in the input file
My requirement is to convert this as per the second schema (i.e to separate the Address) and store in the file like,
1001|John|220|W Maude Ave|Suisun City|CA
One solution I implemented was to loop through each line in the file and replace the , in the address with '|'.
My question is, whether we can do it through DTL. If the answer is yes how do we loop through 1000s of records using DTL.
Whether DTL will be time consuming? because we need to load the schema and then do the transformations.
Please help.
You can use DTL with any class that inherit from Ens.VirtualDocument or %XML.Adaptor, virtually Ensemble use class dictionary to represent the schema so for basic classes there is not problem is you extends %XML.Adaptor Ensemble can represent it. In case of virtual documents the object has to be set the DocType.
In order to do the loop there is a in DTL
Yes, DTLs can parse 1000's of records. You can do the following:
1) Create a record map to parse the incoming file that has schema 1
2) Define an intermediate object that maps schema 2 fields to object properties
3) Create a DTL whose source object is the record map object from 1 above and target is object from 2 above.

Paging, listing and grouping queries with AppFabric Cache

I read a lot of documents about AppFabric caching but most of them cover simple scenarios.
For example adding city list data or shopping card data to the cache.
But I need adding product catalog data to the cache.
I have 4 tables:
Product (1 million rows), ProductProperty (25 million rows), Property (100 rows), PropertyOption (300 rows)
I display paged search results querying with some filters for Product and ProductProperty tables.
I am creating criteria set over searched result set. For example (4 Items New Product, 34 Items Phone, 26 Items Book etc.)
I query for grouping over Product table with columns of IsNew, CategoryId, PriceType etc.
and also another query for grouping over ProductProperty table with PropertyId and PropertyOptionId columns to get which property have how many items
Therefore to display search results I make one query for search result and 2 for creating criteria list (with counts)
Search result query took 0,7 second and 2 grouping queryies took 1,5 second in total.
When I run load test I reach 7 request per second and %10 dropped by IIS becasue db could not give response.
This is why I want to cache Product and property records.
If I follow items below (in AppFabric);
Create named cache
Create region for product catalog data (a table which have 1 million rows and property table which have 25 million rows)
Tagging item for querying data and grouping.
Can I query with some tags and get 1st or 2nd page of results ?
Can I query with some tags and get counts of some grouping results. (displaying filter options with count)
And do I have to need 3 servers ? Can I provide a solution with only one appfabric server (And of course I know risk.)
Do you know any article or any document explains those scenarios ?
Thanks.
Note:
Some additional test:
I added about 30.000 items to the cache and its size is 900 MB.
When I run getObjectsInRegion method, it tooks about 2 minutes. "IList> dataList = this.DataCache.GetObjectsInRegion(region).ToList();"
The problem is converting to IList. If I use IEnumerable it works very quicly. But How can I get paging or grouping result without converting it to my type ?
Another test:
I tried getting grouping count with 30.000 product item and getting result for grouping took 4 seconds. For example GetObjectByTag("IsNew").Count() and other nearly 50 query like that.
There is, unfortunately, no paging API for AppFabric in V1. Any of the bulk APIs, like GetObjectsByTag, are going to perform the query on the server and stream back all the matching cache entries to the client. From there you can obviously use any LINQ operators you want on the IEnumerable (e.g. Skip/Take/Count), but be aware that you're always pulling the full result set back from the server.
I'm personally hoping that AppFabric V2 will provide support via IQueryable instead of IEnumerable which will give the ability to remote the full request to the server so it could page results there before returning to the client much like LINQ2SQL or ADO.NET EF.
For now, one potential solution, depending on the capabilities of your application, is you can actually calculate some kind of paging as you inject the items into the cache. You can build ordered lists of entity keys representing each page and store those as single entries in the cache which you can pull out in one request and then individually (in parallel) or bulk fetch the items in the list from the cache and join them together with an in-memory LINQ query. If you wanted to trade off CPU for Memory, just cache the actual list of full entities rather than IDs and having to do the join for the entities.
You would obviously have to come up with some kind of keying mechanism to quickly pull these lists of objects from the cache based on the incoming search criteria. Some kind of keying like this might work:
private static string BuildPageListCacheKey(string entityTypeName, int pageSize, int pageNumber, string sortByPropertyName, string sortDirection)
{
return string.Format("PageList<{0}>[pageSize={1};pageNumber={2};sortedBy={3};sortDirection={4}]", entityTypeName, pageSize, pageNumber, sortByPropertyName, sortDirection);
}
You may want to consider doing this kind of thing with a separate process or worker thread that's keeping the cache up to date rather than doing it on demand and forcing the users wait if the cache entry isn't populated yet.
Whether or not this approach ultimately works for you depends on several factors of your application and data. If it doesn't exactly fit your scenarios maybe it will at least help shift your mind into a different way of thinking about solving the problem.

Resources