I'm trying to get the number of elements that match a search query.
The point is, the search will produce SizeLimitExceededException from time to time, and I would like to know exactly how many entries match the query. Therefore, counting the results obtained from the search is not an option.
Any ideas?
Thanks in advance :)
There is a reason for the size-limit being exceeded, it prevents clients from trawling the directory for object information, counting the number of objects, and so on. Trawling the directory is a) a security risk and b) will have a negative impact on old legacy server software.
There is also a time-limit that constrains the number of seconds a server may spend on a particular search which may come into play.
If all the entries (and no other)s that match your search filter are subordinate to an object, configure the server to support the numSubordinates attribute. That attribute (if supported) is the number of objects subordinate to the object in which the numSubordinates attribute appears. This method requires that all and only entries that match your search filter be stored subordinate to an object and that no other objects are subordinate to an object.
A plugin could be written to provide functionality; a plugin often has root DN access to the server database and often is not subject to access controls and as such may be able to count entries.
On recent, professional-quality servers, an DN could be created with the appropriate privileges to count the number of entries. An application could be furnished with this DN and credentials for the purpose of simply counting the entries.
Related
We are designing an LDAP schema (specifically for OpenDJ) and we primarily need to be able to search on the mail attribute. We don't need to do a substring search as the user would provide the whole email address when they log in.
We already have an index on the mail attribute. However we are also considering to sub-divide the user directory by the first letter of the email address as well (so all users with an email address that starts with the letter A would be in an ou=A subdirectory under ou=users. The only value I can see in doing this is that when we do searches for a user by email, we can limit the baseDN of the search, thus reducing the scope of the search to approximately 1/26 of the entire directory.
My primary question is, does limiting the baseDN of an LDAP search like this provide any improvement on performance if the attribute already has an index? Do indexes take into account the baseDN, or are they indexed over the whole directory?
A secondary question, if I'm allowed, is there any other usage for splitting the users directory by first letter (or any other arrangement) other than providing a more specific baseDN when searching?
What you are thinking about seems like premature optimization when you don't even know if you have a performance issue.
Also, indexes and processing a query is not a standard element of LDAP, it's an implementation detail of the technology you are using.
In OpenDJ, an index is configured and maintain for a whole database backend.
The cost of a lookup in the email equality index and returning a single entry is the same whether you have 1 entry or 1 billion entries.
I have more than 20 years of experiences with LDAP and directory services, I've never seen any directory structured with splitting entries by the first letter of an attribute.
I once (and only once) encountered a problem similar to the one you're anticipating -- essentially you've got so many records that searching for a record creates an unacceptable user experience. In my case, there were over a million customers in the directory. What is now a rather old iteration of IBM's Tivoli Directory Server had several bugs that meant searching the directory took minutes to accomplish (indexes or no indexes). No one wants to wait minutes to log in and pay their bill! And we were constrained to using IBM's LDAP server.
In that case, I used the e-mail address used as the naming attribute when the account was created and never searched the directory. I.E. I'm cn=lisa#example.com,ou=customers,o=example within the directory. When I log in with lisa#example.com, the site programmatically formulates the bind DN as "cn=" + userInput + ",ou=customers,o=example" and validates the supplied password instead of searching for my account.
I realize this is possible with the FileNET P8 API, however I'm looking for a way to find the physical document path within the database. Specifically there are two level subfolders in the FileStore, like FN01\FN13\DocumentID but I can't find the reference to FN01 or FN13 anywhere.
You will not find the names of the folders anywhere in the FN databases. The folder structure is determined by a hashing function. Here is an excerpt from this page on filestores:
Documents are stored among the directories at the leaf level using a hashing algorithm to evenly distribute files among these leaf directories.
The IBM answer is correct only from a technical standpoint of intended functionality.
If you really really need to find the document file name and folder location, disable your actual file store(s) by making the file store(s) folder unavailable to Content Engine. I did that for each file store by simply changing the root FN#'s to FN#a. For instance, FN3 became FN3a. Once done, I changed the top tree folder back. I used that method so log files would not exceed the tool's maximum output. Any method that leaves a storage location (eg: drive, share, etc) accessible and searchable, but renders the individual files unavailable should cause the same results.
Then, run the Content Engine Consistency Checker. It will provide you with a full list of all files, IDs and locations.
After that, you can match the entries to the OBJECT_ID fields in the database tables. In non-MSSQL databases, the byte ordering is reversed for the first few octets of the UUID. You need to account for that and fix the byte ordering to match the CCC output.
...needs to be byte reversed so that it can be queried upon in Oracle.
When querying on GUIDs, GUIDs are stored in byte-reversed form in
Oracle and DB2 (not MS SQL), whereby the first three sections are pair
reversed and the last two are left alone.
Thus, the same applies in reverse. In order to use the output from the Content Consistency Checker to match output to database, one must go through the same byte ordering reversal.
See this IBM Tech Doc and the answer linked below for details:
IBM Technote: https://www.ibm.com/support/pages/node/469173
Stack Answer: https://stackoverflow.com/a/53319983/1854328
More detailed information on the storage mechanisms is located here:
IBM Technote: "How to translate the unique identifier as displayed within FileNet Enterprise Manager so that it matches what is stored in the Oracle and DB2 databases"
I do not suggest using this for anything but catastrophic need, such as rebuilding and rewriting an entire file store that got horrendously corrupted by your predecessor when they destroyed an NTFS (or some similarly nasty situation).
It is a workaround to bypass FileNet's hashing that's used to obsfucate content information from those looking at the file system.
Given a Resource such as DeviceObservationReport, a number of fields have cardinality 0..many. In some cases these contain reference(s) to other Resource(s) which may also have cardinality 0..many. I am having considerable difficulty in deciding how to support 'chained' queries over referenced Resources which may be two or three steps 'deep' (for want of a better term).
For example, in a single DeviceObservationReport there may be multiple Observation Resource references. It is entirely probable that a client may wish to perform a query which requests all instances of an Observation with a specific code, which have a timestamp (appliesDate) later than a specific instant in time. The named Search Parameter observation would appear to be the obvious starting point and the Path to the observation is specified as virtualDevice.channel.metric.observation. Given that the virtualDevice, channel, and metric fields have cardinality 0..*, would a 'simple' query to retrieve all DeviceObservationReport instances which contain observations with code TESTCODE and observed later than 14:00 on 10 October 2014 look something like:
../data/DeviceObservationReport?virtualDevice[0].channel[0].metric[0].observation.name=TESTCODE&virtualDevice[0].channel[0].metric[0].observation.date>2014-10-10%2014:00
Secondly, if the client requests that the result set be sorted on date, how would that be expressed in the query, because, from the various attempts I have made to implement this, at this point support for the query becomes rather more complex, and thus far I have not been able to come up with a satisfactory solution.
firstly, the path for the parameter is the path within the resource, and chaining links between the defined names. So your query would look like this:
../data/DeviceObservationReport?observation.name=TESTCODE&observation.date=>2014-10-10%2014:00
e.g. the search parameters are aliases within the resource. However the problem with this search is that the parameters are anded at the root, not the leaf - which means this finds all device observation reports that have an observation with a TESTCODE, and that have an observation with date >DATE, which is subtly different to what you probably want: all device observation reports that have an observation with a TESTCODE, and a date >DATE. This will be addressed in the next major release of FHIR.
Sorting is tough with a chained query. I ended up extracting the field that I sort by, but not actually sorting by it - I insert the raw matches into a holding table, and then sort by the sort field as I access the secondary table. The principle reason for this is to make paging robust against on-going changes to the resources.
This is a general design problem - I want to validate a username field for uniqueness when the user enters the value and tabs out. I do a Ajax validation and get a response from the server. This is all very standard. Now, what if I have a HUGE user database ? How to handle this situation ? I want to find if a username "foozbarz" is present among 150Million usernames ?
Database queries are out of question [EDIT] - Read the username database once and populate the cache/hash for faster lookup (to clarify Emil Vikström's point)
In memory databases wont help either
Keep an in-memory hash (or cache/memcache) to store all usernames - usernames can be easily hashed and lookup will be very fast. But there are some problems with this:
a. Size of the hash - can we optimize so that we can reduce the hash size ?
b. Hash/cache refresh frequencies (users might get added while we are validating)
Shard the username table based on some criteria (e.g.: A-B in table username_1 and so on) - thanks piotrek for this suggestion
Or, any other better approach ?
why don't you simply partition the data? if you have/plan to have 150M+ users i assume you have/will have budget for this. if you are just starting (with 2k users) do it traditional way with simple indexed search on database. when you have so many users that you observe performance issues and measure that this is because of your database (and not e.g. www server) then you simply put another database. on the first one you will have users with name from a to m and rest on the other one. you may choose other criterion, like hash, to make data be balanced. when you need more you will add more databases. but if you don't have so many users right now, i advise you not to do any premature optimizations. there are many things that may become a bottleneck with this amount of data
You are most likely right about doing some kind of hashing where you store the taken names and, obviously, not hashed means it's free.
What you shouldn't do is rely on that validation. There can be a lot of time between user pressing Register and user checking if name is free.
To be fair, you only have one issue here and that's consideration for whether you REALLY need to worry whether you will get 150 million users. Scalability is often an issue, but unless this happens over night, you can probably swap in a better solution before this happens.
Secondly, your worry about both users getting a THIS NAME IS FREE and then one taking it. First of all, the chances of that happening are pretty damn low. Secondly, the only ways I can think of ‘solving’ this in a way where user will never click OK with validated name and get a USERNAME TAKEN is to either
a) Remember what user validated last, store that, and if someone else registers that in a mean time, use AJAX to change the name field to taken and notify the user. Don't do this. A lot of wasted cycles and really too much effort to implement.
b) Lock usernames as user validates one, for a short period of time. This results in a lot of free usernames coming up as taken when they actually aren't. You probably don't want this either.
The easiest solution for this is to simply put hash things into the table as users actually click OK, but before doing that, check if the name exists again. If it does, just send the user back with USERNAME TAKEN. The chances of someone racing someone else for a name are really, really slim and I doubt anyone will make a big fuss over how your validator (which did its job, the name was free at the point of checking) ‘lied’ to the user.
Basically your only issue is how you want to store the nicknames.
Your #1 criteria is flawed because this is exactly what you have a database system for: to store and manage data. Why do you even have a table with usernames if you're not going to read it?
The first thing to do is improving the database system by adding an index, preferably a HASH index if your database system supports it. You will have a hard time writing anything near the performance of this yourself.
If this is not enough, you must start scaling your database, for example by building a clustered database or by partitioning the table into multiple sub-tables.
What I think is a fair thing to do is implement caching in front of the database, but for single names. Not all usernames will have a collision attempt, so you may cache the small subset where the collisions typically happen. A simple algorithm for checking the collision status of USER:
Check if USER exist in your cache. If it does:
Set a "last checked" timestamp for USER inside the cache
You are done and USER is a collision
Check the database for USER. If it does exist:
Add USER to the cache
If the cache is full (all X slots is used), remove the least recently used username from the cache (or the Y least recently used usernames, if you want to minimize cache pruning).
You are done and USER is a collision
If it didn't match the cache or the db, you are done and USER is NOT a collision.
You will of course still need a UNIQUE contraint in your database to avoid race conditions.
If you're going the traditional route you could use an appropriate index to improve the database lookup.
You could also try using something like ElasticSearch which has very low latency lookups on large data sets.
If you have 150M+ users, you will have to have in place some function that:
Checks that the user exists, and signals if not found
Verifies the password is correct, and signals if it is not
Retrieves the user's data
This problem you will have, and will have to solve it. In all likelihood with something akin to a user's query. Even if you heavily rely on sessions, still you will have the problem of "finding session X among many from a 150M+ pool", which is structurally identical to "finding user X among many from a 150M+ pool".
Once you solve the bigger problem, the problem you now have is just its step #1.
So I'd check out a scalable database solution (possibly a NoSQL one), and implement the "availability check" using that.
You might end with a
retrieveUserData(user, password = None)
which returns the user info if user and password are valid and correct. For the availability check, you would send no password, and expect an UserNotFound exception if the username is available.
I have a web application that stores project in the database.
I have decided to use App Farbic Caching to speed performance.
What would be the best pattern regarding the below (or on which criteria should I decide):
store each project separately in the cache.
OR store the whole list in the cache (i.e. one key which represent the list of items)?
Many Thanks,
Joseph
It depends. There are a couple of considerations.
If the list was potentially enormous, the content of the individual cache key could get very large (obviously this could be mitigated by enabling local caching). Serializing and de-serializing a large object graph like this is going to consume time and resources on your client.
You may however want to do this, as you may in your application want to execute a linq to objects query against your list after it has been de-serialized back from the cache.
If the queries you execute against the list are well defined, you could cache multiple flavors of the list under different cache keys - instead of people, you could have PeopleMale, PeopleFemale, PeopleAmerican, PeopleIrish, PeopleFrench etc.
If you do this you could potentially have the same person appearing under multiple cached person lists and you would have to manage this.
For example, I have a female person with dual american and irish citizenship. If I edit that person so the gender changes from female to make and the citizenship is changed to dutch, it would be necessary to ensure that four keys are invalidated PeopleMale, PeopleFemale, PeopleAmerican, PeopleIrish.
The example I've given above could get tricky to manage - whether its worth it or not really depends on your exact use case.
In general, where possible, I'd advise you to only use cache keys containing lists for relatively non-volatile reference data (countries, status types, nationalities etc).
Hope this helps.