elasticsearch mapping for Friend to Friend list - elasticsearch

we have started using elasticsearch in our project, we are storing user data and his friend list as nested object, and nested to nested object storing friend's friend list because we required this data when we are doing global search.
Now we are syncing this data in real time with our database, so is this good to syncing done in real time 50-100 TPS or in future it will create problem.
We need to create complex queries for updating the data because we are managing friend list in 2nd level. so how to create advance scripting in painless, I have checked this in Google but not found anything in detail.
If my approach is wrong of doing this, please let me know.

To answer your first question
so is this good to syncing done in real time 50-100 TPS or in future
it will create problem
As of version ES6.0, Multi level nesting is automatically supported, and detected, resulting in an inner nested query to automatically match the relevant nesting level (and not root) if it exists within another nested query. But there is a caveat, indexing a document with 100 nested fields actually indexes 101 documents as each nested document is indexed as a separate document. To safeguard against ill-defined mappings the number of nested fields that can be defined per index is usually limited to 50 using the index.mapping.nested_fields.limit . This setting allows you to limit the number of field mappings that can be created manually or dynamically, in order to prevent bad documents from causing a mapping explosion. So to answer your question, this is fine, but as your data grows, it becomes more complicated to manage and you risk the danger of a mapping explosion.
To answer your second question
We need to create complex queries for updating the data because we are
managing friend list in 2nd level. so how to create advance scripting
in painless, I have checked this in Google but not found anything in
detail.
You might need to present some context here to be able to understand why your approach is necessary, but basically, in a social profile context, managing a friends list as you are doing is always a bad idea, especially if you anticipate scaling in the future. It may work for smaller use-cases, but it does not work very well when you scale. This is because, the relationships become more sophisticated and you will end up having too many multi nested objects. As mentioned, all factors kept at a constant, you might want to look at a graph database for this kind of a scenario. You could, however, have other reasons to your approach which is why you might want to enumerate your context so we can better advise.
Hope this helps!!

Related

Is it OK to have multiple merge steps in an Excel Power query?

I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.
It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.

Updating nested documents en masse

We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)
We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.
To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.
The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.
Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.
This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.
We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.
However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.
Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.
If you've got this far, thank you! I have two questions:
Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?
Thanks
I've already answered a similar question to your use case of Nested datatype.
Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.
Try this. Do let me know if this solution helps in solving your problem.

DB candidate as CouchDB/Schema replacement

The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)
Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck

Joomla getItems default Pagination

Can anyone tell me if the getItems() function in the model automatically adds the globally set LIMIT before it actions the query (from getListQuery()). Joomla is really struggling, seemingly trying to cache the entire results (over 1 million records here!).
After looking in /libraries/legacy/model/list.php AND /libraries/legacy/model/legacy.php it appears that getItems() does add LIMIT to setQuery using $this->getState('list.limit') before it sends the results to the cache but if this is the case - why is Joomla struggling so much.
So what's going on? How come phpMyAdmin can return the limited results within a second and Joomla just times out?
Many thanks!
If you have one million records, you'll most definitely want to do as Riccardo is suggesting, override and optimize the model.
JModelList runs the query twice, once for the pagination numbers and then for the display query itself. You'll want to carefully inherit from JModellist to avoid the pagination query.
Also, the articles query is notorious for it's joins. You can definitely lose some of that slowdown (doubt you are using the contacts link, for example).
If all articles are visible to public, you can remove the ACL check - that's pretty costly.
There is no DBA from the West or the East who is able to explain why all of those GROUP BY's are needed, either.
Losing those things will help considerably. In fact, building your query from scratch might be best.
It does add the pagination automatically.
Its struggling is most likely due to a large dataset (i.e. 1000+ items returned in the collection) and many lookup fields: the content modules for example join as many as 10 tables, to get author names etc.
This can be a real killer, I had queries running for over one second with a dedicated server and only 3000 content items. One tag cloud component we found could take as long as 45 seconds to return a keywords list. If this is the situation (a lot of records and many joins), your only way out is to further limit the filters in the options to see if you can get some faster results (for example, limiting to articles in the last 3 months can reduce the time needed dramatically).
But if this is not sufficient or not viable, you're left with writing a new optimized query in a new model, which ultimately will bring the best performance optimization of any other optimization. In writing the query, consider leveraging the database specific optimizations, i.e. adding indexes, full-text indexes and only use joins if you really need them.
Also consider that joins must never grow with the number of fields, translations or else.
A constant query is easy for the db engine to optimize and cache, whilst a dynamic query will never be as efficient.

How to optimize "text search" for inverted index and relational database? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Update 2022-08-12
I re-thought about it and realized I was overcomplicating it. I found the best way to enhance this system is by using good old information retrieval techniques ie using 'location' of a word in a sentence and 'ranking' queries to display best hits. The approach is illustrated in this following picture.
Update 2015-10-15
Back in 2012, I was building a personal online application and actually wanted to re-invent the wheel because am curious by nature, for learning purposes and to enhance my algorithm and architecture skills. I could have used apache lucene and others, however as I mentioned I decided to build my own mini search engine.
Question: So is there really no way to enhance this architecture except by using available services like elasticsearch, lucene and others?
Original question
I am developing a web application, in which users search for specific titles (say for example : book x, book y, etc..) , which data is in a relational database (MySQL).
I am following the principle that each record that was fetched from the db, is cached in memory , so that the app has less calls to the database.
I have developed my own mini search engine , with the following architecture:
This is how it works:
a) User searches a record name
b) The system check what character the query starts with, checks if query there : get record. If not there, adds it and get all matching records from database using two ways:
Either query already there in the Table "Queries" (which is a sort of history table) thus get record based on IDs (Fast performance)
Or, otherwise using Mysql LIKE %% statement to get records/ids (Also then keep the used query by the user in history table Queries along with the ids it maps to).
-->Then It adds records and their ids to the cache and Only the ids to the inverted index map.
c) results are returned to the UI
The system works fine, however I have Two main issues, that i couldn't find a good solution for (been trying for the past month):
First issue:
if you check point (b) , case where no query "history" is found and it has to use the Like %% statement : this process becomes time consuming when the query matches numerous records in the database (instead of one or two):
It will take some time to get records from Mysql (this is why i used INDEXES on the specific columns)
Then time to save query history
Then time to add records/ids to cache and inverted index maps
Second issue:
The application allows users to add themselves new records, that can immediately be used by other users logged in the to application.
However to achieve this, inverted index map and table "queries" have to be updated so that in case any old query matches to the new word. For example if a new record "woodX" is being added, still the old query "wood" does map to it. So in order to re-hook query "wood" to this new record, here is what i am doing now:
new record "woodX" gets added to "records" table
then i run a Like %% statement to see which already existing query in table "queries" does map to this record(for example "wood"), then add this query with the new record id as a new row: [ wood, new id].
Then in memory, update inverted index Map's "wood" key's value (ie the list), by adding the new record Id to this list
--> Thus now if a remote user searches "wood" it will get from memory : wood and woodX
The Issue here is also time consumption. Matching all query histories (in table queries) with the newly added word takes a lot of time (the more matching queries, the more time). Then the in memory update also takes a lot of time.
What i am thinking of doing to fix this time issue, is to return the desired results to the user first , then let the application POST an ajax call with the required data to achieve all these UPDATE tasks. But i am not sure if this is a bad practice or an unprofessional way of doing things?
So for the past month ( a bit more) i tried to think of the best optimization/modification/update for this architecture, but I am not an expert in the document retrieval field (actually its my first mini search engine ever built).
I would appreciate any feedback or guidance on what i should do to be able to achieve this kind of architecture.
Thanks in advance.
PS:
Its a j2ee application using servlets.
I am using MySQL innodb (thus i cannot use full-text search option)
I would strongly recommend Sphinx Search Server, wchich is best optimized in full-text searching. Visit http://sphinxsearch.com/.
It's designed to work with MySQL, so it's an addition to Your current workspace.
I do not pretend to have THE solution but here is my ideas.
First, I though like you for time-consuming queries LIKE%% : I would execute a query limited to a few answers in MySQL, like a dozen, return that to user, and wait to see if user wants more matching records, or launch in background the full-query, depending on you indexation needs for future searches.
More generally, I think that storing everything in memory could lead, one day, to too-much memory consumption. And althrough the search-engine becomes faster and faster when it keeps everything in memory, you'll have to keep all these caches up-to-date when data is added or updated and it will certainly take more and more time.
That's why I think the solution I saw a day in an "open-source forum software" (I couldn't remember its name) is not too bad for text searching in posts : each time a data is inserted, a table named "Words" keeps tracks of every existing word, and another table (let's say "WordsLinks") the link between each word and posts it appears in.
This kind of solution has some drawbacks:
Each Insert, Delete, Update in database is a lot slower
Data selection for search engine must be anticipated : if you choose to keep two letter words you never kept, it is too late for already recorded data, unless you launch a complete data re-processing.
You must take care of DELETE as well as UPDATE and INSERT
But I think there are some big advantages:
Computing time is probably the same than the "memory solution" (eventually), but it is divided in each database Create/Update/Delete, rather than at query time.
Looking for a whole word, or words "starting with" is instantaneous : when indexed, searching in "Words" table is dichotomic. And "WordLinks" table query is very fast either with an index.
Looking for multiple words at the same time could be simple : gather a group of "WordLinks" for each found Word, and execute an intersection on them to keep only "Database Ids" common to all these groups. For example with the words "tree" and "leaf", the first one could give Table records {1, 4, 6}, and the second one could give {1, 3, 6, 9}. So with an intersection it is simple to keep only common parts : {1, 6}.
A "Like %%" in a single-column table is probably faster than a lot of "Like %%" in different fields of different tables. And each database engine handles some cache : "Words" table could be little enough to be kept in memory
I think there is a small risk of performance and memory problems if data becomes huge.
As every search is fast, you can even look for synonyms. For example search "network" if user didn't find anything with "ethernet".
You can apply rules, like splitting camel case words to generate for example the 3 words "wood", "X", "woodX" from "woodX". Each "word" is very lightweight to store and find, so you can do a lot of things.
I think the solution you need could be a blend of methods : for example you can keep lightweight UPDATE, INSERT, DELETE, and launch "Words" and "WordsLinks" feeding from a TRIGGER.
Just for anecdote, I saw a software developped by my company in which it was decided to keep "everything" (!) in memory. It leads us to recommend to our customers to buy servers with 64GB RAM. A little bit expensive. It explains why I am very prudent when I see solutions that could lead, eventually, to memory filling.
I have to say, I don't think your design fits the problem very well. The issues that you see now are consequences of that. And apart from that, your current solution doesn't scale.
Here is a possible solution:
Redesign your database to only contain authoritative data, but no derived data. So all cache entries must vanish from MySQL.
Keep data only for the duration of a request in memory within your application. This makes the design of your application much simpler (think race conditions) and enables you to scale to a sensible number of clients.
Introduce a caching layer. I'd strongly recommend to use an established product, rather than building this yourself. This frees you of all the custom built caching logic in your application and even does the job much better.
You can take a look at Redis or Memcached for the caching layer. I think an LRU strategy should fit here. Depending on how complex your queries become, a dedicated indexed search mechanism like Lucene might make sense as well.
I'm sure this can be implemented in MySQL but it would be a lot less effort to just use an existing search-oriented database such as Elasticsearch. It uses Lucene library to implement the inverted index, has extensive documentation, supports horizontal scaling, fairly simple query language and so forth. I guess it has been quite a lot of work to get this far, and it will be even more work to handle caches, race conditions, bugs, performance issues etc. to make the solution "production grade".

Resources