I would like to know how the overall coherence is measured for u_mass', 'c_v', 'c_uci', 'c_npmi' for each set of topics in the gensim (https://radimrehurek.com/gensim/models/coherencemodel.html)?
Is it based on the average of coherence values: (coherence topic 1 + coherence topic 2 + .... + coherence topic n)/n ?
For example, if we have 5 topics, the overall coherence would be (coherence topic 1 + coherence topic 2 + coherence topic 3 + coherence topic 4 + coherence topic 5) divided by 5
Normally, the overall coherence of a model is the average coherence of topics (meaning your example of five topics is correct). C_v, C_uci, C_npmi, and C_u_mass are all slightly different coherence metrics, all based on pointwise mutual information.
There is a good survey paper that details all of the above coherence metrics that can be found here.
Related
I understood that the hot-warm(-cold-frozen-deleted) lifecycle is a great tool, but I haven't found much numerical documentation: one of the few documents that gives examples with numbers (and not just feature descriptions) is this blogpost. In the hot-warm example without roll-up, it seems to me that the main storage optimization is given by the number of replicas:
one day of data = 86.4 GB
7 hot days = one day of data * 7 days * 2 replicas = 1.2 TB
30-7 warm days = one day of data * 23 days * 1 replica = 1.98 TB
There are other resources like this webinar, yet it doesn't distinguish between storage usage and RAM usage. Is there an official document (or third parties experiment/report) that shows if and how much the cold/frozen/"non-searchable snapshot after deletion" phases optimize the storage usage? Or is only about less RAM usage?
There can't be a single "benchmark" here since ILM is just a tool that allows tuning your hardware configuration according to your data usage patterns.
For example, suppose you have heavy indexing and heavy searching across all of your data. In that case, you don't want to reduce your replica count for the old data, and the gain would be primarily due to slightly cheaper "warm" SSD storage. So the difference here would be minimal or none at all if the separation overhead compensates that gain.
An opposite example would be storing logs for compliance purposes (lots of writes but minimal reads, and it's primarily last 24 hrs) - then you probably want to move everything beyond a week or so into the "frozen" tier which uses s3 buckets for storage and is very cheap. Also, those shards don't count towards cluster shard count regarding heap usage and stability. In this case, tiered storage might turn out to be orders of magnitude cheaper than a single-tier cluster.
We have an application where we have 150 queues and now we want to introduce priority messages for all these queues.
All these queues are durable and persistent in nature.
Is it possible to convert these queues into priority queueues ? Do we have any documentation(s) available for this ? Maybe I missed it in the docs.
We have a plan that we will introduce 2 levels in priority either 0 or 1(default). https://www.rabbitmq.com/priority.html I found that
with more level of priority we need more CPU resources.
I didn't find what will be the scale or how much CPU resources will increase on which factor like no. of messages ? Do we have any stats or study available ?
Consul reference architecture mentions below statement -
"In any case, there should be high-bandwidth, low-latency (sub 8ms round trip) connectivity between the failure domains."
What happens if the RTT is more than 8ms? What is the maximum allowed RTT between 2 nodes in a cluster?
This limitation primarily applies to latency between Consul servers. Excessive latency between the servers could cause instability with Raft which could affect the availability of the server clusters.
Clients, however, can operate with higher latency thresholds. HashiCorp is in the process of updating documentation to clarify this, and also list acceptable latency thresholds for client agents.
Unless I'm missing an obvious list that's provided somewhere, there doesn't seem to be a list that gives examples of large-ish Elastic clusters.
To answer this question I'd appreciate it if you could list a solution you know of and some brief details about it. Note that no organisational details need be shared unless these are already public.
Core info
Number of nodes (machines)
Gb of index size
Gb of source size
Number of documents / items (millions)
When the system was built (year)
Any of the follow information would be appreciated as well:
Node layout / Gb of memory in each node. Number of master nodes (generally smaller), number and layout of data nodes
Ingest and / or query performance (docs per second, queries per second)
Types of CPU - num cores, year of manufacture or actual CPU specifics
Any other relevant information or suggestions of types of additional info to request
And as always - many thanks for this, and I hope it helps all of us !
24 m4.2xlarge for data nodes,
separate masters and monitoring cluster
multiple indices (~30 per day), 1-2Tb of data per day
700-1000M documents per day
It is continiously building, changing, optimizing (since version 1.4)
hundreds of search requests per second, 10-30k documents per second
We have a fairly big Greenplum v4.3 cluster. 18 hosts, each host has 3 segment nodes. Each host has approx 40 cores and 60G memory.
The table we have is 30 columns wide, which has 0.1 billion rows. The query we are testing has 3-10 secs response time when there is no concurrency pressure. As we increase the # of queries we fired in parallel, the latency is decreasing from avg 3 secs to 50ish secs as expected.
But we've found that regardless how many queries we fired in parallel, we only have like very low QPS(query per sec), almost just 3-5 queries/sec. We've set the max_memory=60G, memory_limit=800MB, and active_statments=100, hoping the CPU and memory can be highly utilized, but they are still poorly used, like 30%-40%.
I have a strong feeling, we tried to feed up the cluster in parallel badly, hoping to take the best out of the CPU and Memory utilization. But it doesn't work as we expected. Is there anything wrong with the settings? or is there anything else I am not aware of?
There might be multiple reasons for such behavior.
Firstly, every Greenplum query uses no more than one processor core on one logical segment. Say, you have 3 segments on every node with 40 physical cores. Running two parallel queries will utilize maximum 2 x 3 = 6 cores on every node, so you will need about 40 / 6 ~= 6 parallel queries to utilize all of your CPUs. So, maybe for your number of cores per node its better to create more segments (gpexpand can do this). By the way, are the tables that used in the queries compressed?
Secondly, it may be a bad query. If you will provide a plan for the query, it may help to understand. There some query types in Greenplum that may have master as a bottleneck.
Finally, that might be some bad OS or blockdev settings.
I think this document page Managing Resources might help you mamage your resources
You can use Resource Group limit/controll your resource especialy concurrency attribute(The maximum number of concurrent transactions, including active and idle transactions, that are permitted in the resource group).
Resouce queue help limits ACTIVE_STATEMENTS
Note: The ACTIVE_STATEMENTS will be the total statement current running, when you have 50s cost queries and next incoming queries, this could not be working, mybe 5 * 50 is better.
Also, you need config memory/CPU settings to enable your query can be proceed.