etcd 3.5 db_size growing constantly - etcd

I have 3-node etcd cluster version 3.5.2. I noticed a sitituation that endpoint's db_size is constantly growing. I have to perform compaction and defrag manually so that the db_size value cannot reach to limitation. I have not faced any similar problem in 3.2 version.
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.201.64.106:2379 | 6af28eee6b8fd63a | 3.5.2 | 18 MB | true | false | 3 | 7509221 | 7509221 | |
| 10.201.64.107:2379 | 8t2ae31d2c14413e | 3.5.2 | 18 MB | false | false | 3 | 7509221 | 7509221 | |
| 10.222.82.121:2379 | c6131f42ed372576 | 3.5.2 | 18 MB | false | false | 3 | 7509221 | 7509221 | |
I expect the db size to not increase that fast. Or I shouldn't do the defrag process manually.
What should I do for this issue?
Thanks in advance for your help.

DB Size is expected to grow without compaction and de-fragmentation.
You can set up auto compaction to make sure you dont hit the size limit.
All in all, 18 MB is quite low, default size limit is 2GB. What exactly are you worried about?
Side note: 3.5.2 is not recommended for production, you should use 3.5.3 or above. reference: https://groups.google.com/g/etcd-dev/c/sad9tgmKU7Y

Related

Clarification YouTube Data API v3 Quotas

I have a Google project created on the Google developer console. This project retrieves information about all content on my channel (livestreams and posted videos) and keeps statistics updated for each one. The processes that gather this information run three times per day. I would like to run them more frequently, but after a total of 300 quota, I receive the dreaded error 403 that says,
The request cannot be completed because you have exceeded your quota
Here is a today's usage summary as an example:
+------------+---------------------------------------------+------------------------------------+-------+-----------+--------+
| callDate | calledBy | serviceName | calls | totalCost | failed |
+------------+---------------------------------------------+------------------------------------+-------+-----------+--------+
| 2022-04-14 | /www/htdocs/GetChats/someoneSingsVideos.php | search->listSearch | 53 | 106 | 3 |
| 2022-04-14 | /www/htdocs/GetChats/someoneSingsVideos.php | videos->listVideos | 50 | 50 | NULL |
| 2022-04-14 | /www/htdocs/GetChats/lives.php | liveBroadcasts->listLiveBroadcasts | 10 | 10 | NULL |
| 2022-04-14 | /www/htdocs/GetChats/videos.php | search->listSearch | 48 | 96 | NULL |
| 2022-04-14 | /www/htdocs/GetChats/videos.php | videos->listVideos | 48 | 48 | NULL |
| 2022-04-14 | /www/htdocs/GetChats/someoneSingsVideos.php | --program total-- | 103 | 156 | 3 |
| 2022-04-14 | /www/htdocs/GetChats/lives.php | --program total-- | 10 | 10 | NULL |
| 2022-04-14 | /www/htdocs/GetChats/videos.php | --program total-- | 96 | 144 | NULL |
| 2022-04-14 | --day total-- | | 209 | 310 | 3 |
+------------+---------------------------------------------+------------------------------------+-------+-----------+--------+
So in essence, I can get about 50 searches and about 100 list calls done in a day... then I have to wait 24 hours to try again. But when I check my quotas and their usage at the IAM Admin Quotas page, I see a very different result:
This page, intended to clarify API Quotas and their usage, just adds questions for me. For example, at face value, looks like there is a 10,000 quota-units limit per day but I can do 1.8M per minute. Clearly both cannot be true. But morr to the point, it seems to indicate that i'm way below my legal limit although it issues that error and stops me until the next day, after that 300 quota is reached.
Can anyone help me understand what the real limits are, and maybe help me understand why I must keep my usage so low?

sqlite: wide v. long performance

I'm considering whether I should format a table in my sqlite database in "wide or "long" format. Examples of these formats are included at the end of the question.
I anticipate that the majority of my requests will be of the form:
SELECT * FROM table
WHERE
series in (series1, series100);
or the analog for selecting by columns in wide format.
I also anticipate that there will be a large number of columns, even enough to need to increase the column limit.
Are there any general guidelines for selecting a table layout that will optimize query performance for this sort of case?
(Examples of each)
"Wide" format:
| date | series1 | series2 | ... | seriesN |
| ---------- | ------- | ------- | ---- | ------- |
| "1/1/1900" | 15 | 24 | 43 | 23 |
| "1/2/1900" | 15 | null | null | 23 |
| ... | 15 | null | null | 23 |
| "1/2/2019" | 12 | 12 | 4 | null |
"Long" format:
| date | series | value |
| ---------- | ------- | ----- |
| "1/1/1900" | series1 | 15 |
| "1/2/1900" | series1 | 15 |
| ... | series1 | 43 |
| "1/2/2019" | series1 | 12 |
| "1/1/1900" | series2 | 15 |
| "1/2/1900" | series2 | 15 |
| ... | series2 | 43 |
| "1/2/2019" | series2 | 12 |
| ... | ... | ... |
| "1/1/1900" | seriesN | 15 |
| "1/2/1900" | seriesN | 15 |
| ... | seriesN | 43 |
| "1/2/2019" | seriesN | 12 |
The "long" format is the preferred way to go here, for so many reasons. First, if you use the "wide" format and there is ever a need to add more series, then you would have to add new columns to the database table. While this is not too much of a hassle, in general once you put a schema into production, you want to avoid further schema changes.
Second, the "long" format makes reporting and querying much easier. For example, suppose you wanted to get a count of rows/data points for each series. Then you would only need something like:
SELECT series, COUNT(*) AS cnt
FROM yourTable
GROUP BY series;
To get this report with the "wide" format, you would need a lot more code, and it would be as verbose as your sample data above.
The thing to keep in mind here is that SQL databases are built to operate on sets of records (read: across rows). They can also process things column wise, but they are not generally setup to do this.

How to share Hazelcast cache over multi-war Tomcats

We have multiple Tomcats, each with multiple .war files (= Spring Boot app) deployed in it.
We now need some distributed caching between app1 on tomcat1 and app1 on tomcat2. It´s essential that app2 on tomcat1 (and app2 on tomcat2) cannot see the Hazelcast cache of the other deployed apps.
The following image shows this situation:
Tomcat 1 Tomcat 2
+-----------------------------------+ +-----------------------------------+
| | | |
| app1.war app2.war | | app1.war app2.war |
| +----------+ +----------+ | | +----------+ +----------+ |
| | | | | | | | | | | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| +----+-----+ +----+-----+ | | +----+-----+ +-----+----+ |
| | | | | ^ ^ |
+-----------------------------------+ +-----------------------------------+
| | | |
| | | |
| | | |
| | | |
+--------------------------------------+ |
Shared cache via Hazelcast | |
| |
+---------------------------------------+
Shared cache via Hazelcast
Is this possible with Hazelcast? And if so, how?
Right now I only find solution talking about shared web sessions via Hazelcast. But this doesn´t seem to be a solution for me here, or am I wrong?
If your applications must be strictly isolated, then you probably need to use different cluster groups. Cluster groups make it possible for different clusters to coexist on the same network, while being completely unreachable to one another (assuming correct configuration).
If, however, you just need application data to be separate, then you can just make sure that app1 instances use caches with names that do not clash with app2 cache names. This is the simplest implementation.
If you are deploying a sort of multitenant environment where you have security boundaries between the two groups of applications, then going for the cluster group option is better as you can protect clusters with passwords, and applications will be using distinct ports to talk to one another in those groups.
Yes, this is possible.
You can configure the cache name.
Application app1 uses a cache named app1. Application app2 uses a cache named app2.
If you configure it correctly then they won't see each's others data.
If by "essential" that they can't you mean that you have a stronger requirement than preventing accidental mis-configuration, then you need to use role-based security.

How to design templates for clustered nifi

Do we need to think about underlying cluster while designing nifi templates?
Here is my simple flow
+-----------------+ +---------------+ +-----------------+
| | | | | |
| READ FROM | | MERGE | | PUT HDFS |
| KAFKA | | FILES | | |
| +-----------------------> | +---------------------> | |
| | | | | |
| | | | | |
| | | | | |
+-----------------+ +---------------+ +-----------------+
I have 3 nodes cluster.. When system is running I check "cluster" menu and see only master node is utilizing sources, other cluster nodes seems idle... The question is in such a cluster should I design template according to cluster or nifi should do the load balancing.
I saw one of my colleagues created remote processors for each node on cluster and put a load balancer in front of these within template, is it required? (like below)
+------------------+
| | +-------------+
| REMOTE PROCESS | | input port |
+----> | GROUP FOR | | (rpg) |
| | NODE 1 | +-------------+
| | | |
| | | |
| +------------------+ v
+-----------------+ +-----------------+ RPG
| | | | | +--------------+
| READ FROM | | | | | |
| KAFKA | | LOAD BALANCER | | +------------------+ | MERGE FILES |
| +-------------> | +-------------> | | | |
| | | | | | REMOTE PROCESS | | |
| | | | | | GROUP FOR | | |
| | | | | | NODE 2 | | |
+-----------------+ +-----------------+ RPG | | +--------------+
| +------------------+ |
| |
| v
|
| +-------------------+ +---------------+
| | | | |
| | REMOTE PROCESS | | PUT HDFS |
+-----> | GROUP FOR | | |
| NODE 3 | | |
| | | |
| | | |
+-------------------+ +---------------+
And what is the use-case for load-balancer except remote clusters, can I use load-balancer to split traffic into several processors to speedup the operation?
Apache NiFi does not do any automatic load balancing or moving of data, so it is up to you to design the data flow in a way that utilizes your cluster. How to do this will depend on the data flow and how the data is being brought into the cluster.
I wrote this article once to try and summarize the approaches:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
In you case with Kafka, you should be able to have the flow run as shown in your first picture (without remote process groups). This is because Kafka is a data source that will allow each node to consume different data.
If ConsumeKafka appears to be running on only one node, there could be a couple of reasons for this...
First, make sure ConsumeKafka is not scheduled for primary node only.
Second, figure out how many partitions you have for your Kafka topic. The Kafka client (used by NiFi) will assign 1 consumer to 1 partition, so if you have only 1 partition then you can only ever have 1 NiFi node consuming from it. Here is an article to further describe this behavior:
http://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka

national language supported sort in Hive

Don't have to much experience with nls in hive. Changing locale in client linux shell doesn't affect the result.
Googling also doesn't help to resolve.
Created table in Hive:
create table wojewodztwa (kod STRING, nazwa STRING, miasto_woj STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
loaded data:
LOAD DATA LOCAL INPATH ./wojewodztwa.txt OVERWRITE INTO TABLE wojewodztwa;
contents of file wojewodztwa.txt:
02,dolnośląskie,Wrocław
04,kujawsko-pomorskie,Bydgoszcz i Toruń
06,lubelskie,Lublin
08,lubuskie,Gorzów Wielkopolski i Zielona Góra
10,łódzkie,Łódź
12,małopolskie,Kraków
14,mazowieckie,Warszawa
16,opolskie,Opole
18,podkarpackie,Rzeszów
20,podlaskie,Białystok
22,pomorskie,Gdańsk
24,śląskie,Katowice
26,świętokrzyskie,Kielce
28,warmińsko-mazurskie,Olsztyn
30,wielkopolskie,Poznań
32,zachodniopomorskie,Szczecin
beeline> !connect jdbc:hive2://172.16.45.211:10001 gpadmin changeme org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://172.16.45.211:10001
Connected to: Hive (version 0.11.0-gphd-2.1.1.0)
Driver: Hive (version 0.11.0-gphd-2.1.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://172.16.45.211:10001> select kod,nazwa from wojewodztwa order by nazwa;
+------+----------------------+
| kod | nazwa |
+------+----------------------+
| 02 | dolnośląskie |
| 04 | kujawsko-pomorskie |
| 06 | lubelskie |
| 08 | lubuskie |
| 14 | mazowieckie |
| 12 | małopolskie |
| 16 | opolskie |
| 18 | podkarpackie |
| 20 | podlaskie |
| 22 | pomorskie |
| 28 | warmińsko-mazurskie |
| 30 | wielkopolskie |
| 32 | zachodniopomorskie |
| 10 | łódzkie |
| 24 | śląskie |
| 26 | świętokrzyskie |
+------+----------------------+
16 rows selected (19,702 seconds)
and it's not correct result, all words starting with language specific characters are at the and.
Hive does not support collations. Strings will sort according to Java String.compareTo rules.

Resources