Google translate character limit in a single POST - google-api

I have seen in unofficial comments that there is a 5000 character limit. In a single request I am sending an array of strings to be translated which could exceed this limit in total. It's also unclear if this limit applies to a single item in the array, or the total.
I need to know if I have to modify my logic to batch up these requests if there is a limit imposed but I couldn't find relevant information in the docs, including under https://cloud.google.com/translate/quotas

The 5000 character limit refers to all characters in the requests. It's briefly mentioned here.
It's a performance limit, not an artificial one, so I don't think stacking them in arrays who make it any better.

Related

How to implement Elasticsearch pagination with large dataset

Environment
.Net 5
Elasticsearch.Net.Aws 7.1.0
Problem
Even with pagination, Elasticsearch's query API does not support more than 10_000 records by default. I.e. if the sum of from and size > 10_000 the API throws an error.
Potential solutions
Increase size
I can increase the index's max_result_window as described here. However I am expecting a large dataset in production - probably less than 10_000_000 records at one time, but for obvious reasons I don't believe that simply increasing the window size is a good idea. My use-case does not require over-the-top performance, but it has to be reasonable for both the end-user and the AWS bill.
What do you think? What leeway do I have regarding to max_result_window setting?
Track total hits
I've read about track_total_hits parameter - It only returns the correct amount of total hits on each request, but still does not allow records after the 10_000th to be fetched
Scroll API
I've read about the Scroll-API - it's being deprecated currently, so I'd like to avoid it.
Search after
I've read about the search_after parameter - the concept is to define a consistent sort criteria and call exact query for each page, the only difference being is the value of search_after, which for every subsequent search should be the sort value returned of the last hit in the previous search.
As far as I can tell this is the recommended solution, but while it may work for large page sizes, I'm having difficulty understanding how it solves the basic paging case:
Lets say we have 20_000 records total, page size is 10, hense 2_000 pages. How can I return the last page, containing records 19_990-20_000? Unless I misunderstand, search_after does not help, because I've skipped pages and I don't have the sort value of record number 19_989.
Further more, per the docs:
If provided, the from argument must be 0 (default) or -1
This means that I cannot use a combination of both:
Perform one search with "from": "990"
Use the last record's sort value to perform a second search, again using a "from": "990"
Return the results of the second search.
Beyond that I cannot figure out another way to use it. Could you tell me where I'm getting it wrong?

Is there a way to change the Search API facet count to show a total word count instead of the count of matching fragments (documents)?

I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))
The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.

Parse: limitations of count()

Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.

What is the maximum length of a display name in an email address

According to the question "What is the maximum length of a valid email address?", the maximum length of the address is 254. But I like to know what would be the maximum length of the display name:
Display Name <my#examplemailaddress.net>
Following this link https://www.ietf.org/mail-archive/web/ietf-822/current/msg00086.html the size is unlimited but practically according this link https://www.ietf.org/mail-archive/web/ietf-822/current/msg00088.html the size would be 72 characters. But I believe this answer is a bit outdated? What would be reasonable limit for today?
If you ask the maximal length allowed by the specs (the normative source is RFC5322 as of current_timestamp) then, indeed, there is no limit, since folding allows you to have an unlimted length of any field (while still respecting the advised 78 [or the larger 998] character limit).
Practical limit is a very subjective question, since "practical" would be the length which is fully displayed by "most" clients and environments; now that's pretty hard to calculate.
I would say the upper limit of practicality would be the total length of 78 characters from the "From:" header up to the last ">" character of the email address, since longer ones may probably break up while displaying in almost all environments, which would give you around 40 characters to use even for longer email addresses.
Most clients, however, probably expects to display around 20-25 characters under normal circumstances.
These are all displayed characters and not the actual length in bytes of a whatever way encoded address (especially for long utf-8 codes).

Getting the most frequent items without counting every item

I was wondering if there was an algorithm for counting "most frequent items" without having to keep a count of each item? For example, let's say I was a search engine and wanted to keep track of the 10 most popular searches. What I don't want to do is keep a counter of every query since there could be too many queries for me to count (and most them will be singletons). Is there a simple algorithm for this? Maybe something that is probabilistic? Thanks!
Well, if you have a very large number of queries (like a search engine presumably would), then you could just do "sampling" of queries. So you might be getting 1,000 queries per second, but if you just keep a count one per second, then over a longish period of time, you'd get an answer that would be relatively close to the "real" answer.
This is how, for example, a "sampling" profiler works. Every n mililiseconds it looks at what function is currently being executed. Over a long period of time (several seconds) you get a good idea of the "expensive" functions, because they're the ones that appear in your samples more often.
You still have to do "counting" but by doing periodic samples, instead of counting every single query you can get an upper bound on the amount of data that you actually have to store (e.g. max of one query per second, etc)
If you want the most frequent searches at any given time, you don't need to have endless counters keeping track of each query submitted. Instead, you need an algorithm to measure the amount of submissions for any given query divided by a set period of time. This is a pretty simple algorithm. Any search submitted to your search engine, for example the word “cache,” is stored for a fixed period of time called a refresh rate, (the length of your refresh rate depends on the kind of traffic your search engine is getting and the amount of “top-results” you want to keep track of). If the refresh rate time period expires and searches for the word “cache” have not persisted, the query is deleted memory. If searches for the word “cache” do persist, your algorithm only needs to keep track of the rate at which the word “cache” is being searched. To do this, simply store all searches on a “leaky-counter.” Every entry is pushed onto the counter with an expiration date after which the query is deleted. Your active counters are the indicators of your top queries.
Storing each and every query would be expensive, yet necessary to ensure the top 10 are actually the top 10. You'll have to cheat.
One idea is to store a table of URLs, hit counters, and timestamp indexed by count, then timestamp. When the table reaches some arbitrary near-maximum size, start removing low-end entries that are older than a given number of days. Although old, infrequent queries won't be counted, the queries likely to make the top 10 should make it on the table because of the faster query rate.
Another idea would be to write a 16-bit (or more) hash function for search queries. Have a 65536-entry table holding counters and URLs. When a search is performed, increment the respective table entry and set the URL if necessary. However, this approach has a major drawback. A spam bot could make repeated queries like "cheap viagra", possibly making legitimate queries increment the spam query counters instead, placing their messages on your main page.
You want a cache, of which there are many kinds; see Wikipedia
Cache algorithms and
Page replacement algorithm Aging.

Resources