dht nodes in Libtorrent - libtorrent

How can I increase the number of dht nodes?
Current number is about 240:
app['lt_session'].status().dht_nodes)
While uTorrent says it has about 500 ones.
Here is the settings:
async def lt_session(app):
ses = app['lt_session'] = lt.session({
'active_downloads': 50,
})
ses.listen_on(6881, 6881)
# ses.set_max_connections(3)
ses.add_extension('ut_metadata')
ses.add_extension('smart_ban')
ses.add_extension('ut_pex')
ses.add_extension('metadata_transfer')
ses.add_dht_router("router.utorrent.com", 6881)
ses.add_dht_router("router.bittorrent.com", 6881)
ses.add_dht_router("dht.transmissionbt.com", 6881)
ses.add_dht_router("dht.aelitis.com", 6881)
ses.add_dht_router("router.bitcomet.com", 6881)
ses.start_dht()
ses.start_lsd()
app['torrents'] = {}

If you have an "extended routing table", and a depth of 19 (which I believe is about how large the network is), you end up having 128+64+32+16+(15*8) = 360 nodes in the regular routing table buckets (once it fills up). In
addition to this, each level has 8 replacement buckets, so + (19*8).
libtorrent's node_count only count the nodes in the main routing table buckets, the ones that are used. If you want to count the replacement nodes too, add dht_node_cache.
If you want to make sure your routing table fits as many nodes as possible, and is the least restrictive of which nodes are allowed to be added:
Make sure you set dht_extended_routing_table to true.
If you want to remove restrictions about which nodes are let into the routing table you can set:
dht_restrict_routing_ips to false
dht_restrict_search_ips to false
Also, if you have multiple external IPs, you may want to make sure to enable listening to all of them, as libtorrent will run a DHT node for each. For example if you have bot hIPv4 and IPv6.

Related

Nifi Group Content by Given Attributes

I am trying to run a script or a custom processor to group data by given attributes every hour. Queue size is up to 30-40k on a single run and it might go up to 200k depending on the case.
MergeContent does not fit since there is no limit on min-max counts.
RouteOnAttribute does not fit since there are too many combinations.
Solution 1: Consume all flow files and group by attributes and create the new flow file and push the new one. Not ideal but gave it a try.
While running this when I had 33k flow files on queue waiting.
session.getQueueSize().getObjectCount()
This number is returning 10k all the time even though I increased the queue threshold numbers on output flows.
Solution 2: Better approach is consume one flow file and and filter flow files matching the provided attributes
final List<FlowFile> flowFiles = session.get(file -> {
if (correlationId.equals(Arrays.stream(keys).map(file::getAttribute).collect(Collectors.joining(":"))))
return FlowFileFilter.FlowFileFilterResult.ACCEPT_AND_CONTINUE;
return FlowFileFilter.FlowFileFilterResult.REJECT_AND_CONTINUE;
});
Again with 33k waiting in the queue I was expecting around 200 new grouped flow files but 320 is created. It looks like a similar issue above and does not scan all waiting flow files on filter query.
Problems-Question:
Is there a parameter to change so this getObjectCount can take up to 300k?
Is there a way to filter all waiting flow files again by changing a parameter or by changing the processor?
I tried making default queue threshold 300k on nifi.properties but it didn't help
in nifi.properties there is a parameter that affects batching behavior
nifi.queue.swap.threshold=20000
here is my test flow:
1. GenerateFlowFile with "batch size = 50K"
2. ExecuteGroovyScript with script below
3. LogAttrribute (disabled) - just to have queue after groovy
groovy script:
def ffList = session.get(100000) // get batch with maximum 100K files from incoming queue
if(!ffList)return
def ff = session.create() // create new empty file
ff.batch_size = ffList.size() // set attribute to real batch size
session.remove(ffList) // drop all incoming batch files
REL_SUCCESS << ff // transfer new file to success
with parameters above there are 4 files generated in output:
1. batch_size = 20000
2. batch_size = 10000
3. batch_size = 10000
4. batch_size = 10000
according to documentation:
There is also the notion of "swapping" FlowFiles. This occurs when the number of FlowFiles in a connection queue exceeds the value set in the nifi.queue.swap.threshold property. The FlowFiles with the lowest priority in the connection queue are serialized and written to disk in a "swap file" in batches of 10,000.
This explains that from 50K incoming files - 20K it keeps inmemory and others in swap batched by 10K.
i don't know how increasing of nifi.queue.swap.threshold property will affect your system performance and memory consumption, but i set it to 100K on my local nifi 1.16.3 and it looks good with multiple small files, and first batch increased to 100K by this.

Neo4j cypher query for updating nodes taking a long time

I have a graph with the following:
(:Customer)-[:MATCHES]->(:Customer)
This is not limited to two customers but the depth does not exceed 5. The following also exists:
(:Customer)-[:MATCHES]->(:AuditCustomer)
There is a label :Master that needs to be added to one customer node in a matching set depending on some criteria.
My query therefore needs to find all sets of matching customers where none have the label master and add that label to one node in the set. There are a large number of customer nodes and doing it all in one go causes the database to become very slow.
I have attempted to do it using apoc.periodic.commit:
CALL apoc.periodic.commit("MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master) WITH c limit {limit}
CALL apoc.path.expand(c, 'MATCHES', '+Customer', 0, -1)
YIELD path UNWIND NODES(path) AS nodes WITH c,nodes
order by nodes:Searchable desc, nodes.createdTimestamp asc
with c, head(collect(distinct nodes)) as collectedNodes
set collectedNodes:Master
return count(collectedNodes)", {limit:100})
However this still causes the database to become very slow even if the limit parameter is set to 1. I read that apoc.periodic.commit is a blocking process so this may be causing a problem. Is there a way to do this that is not so resource intensive and the DB can continue to process other transactions while this is running?
The slowest part of the query is the initial match:
MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master) WITH c limit {limit}
This takes around 3.5s and the entire query takes about 4s. There is very little difference between limit 1 and limit 20. Maybe if there is a way to rewrite this so it is faster, it might be a better approach?
Also, if it is of any use, the following returns ~70K
MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master)
RETURN COUNT(c)

Multiple node cassandra cluster is really slow

I had a single node cassandra cluster on EC2. I was running my tests on it and it worked great.
But then, I had to move this cluster to a VPC, so rather than moving the data, I created a new cluster with two nodes (both seeds), and imported the data from the former cluster using sstableloader.
I thought it was really slow, so decided to add two more instances (not seeds). It's even slower.
I use a ONE consistency, and my replication factor is 1, so I don't quite see why it is so slow.
To give you an idea, I can only do 3 read per second.
We use the EC2Snitch but not the AMI recommended by Cassandra though (we didn't see that part in the documentation when we installed it).
I didn't run a cleanup yet on the two first nodes after adding the two new nodes.
When I request all elements of a column family which contains only a dozen of rows, it times out. If I request one element, I get the result after a long time, and with a huge tracing session (~30000 lines...)!
Does anyone know what I can do to make it faster? I don't quite know where to look at right now.
My Cassandra version is Cassandra 2.1.3.
Here is my keyspace schema:
CREATE KEYSPACE keyspace_name WITH replication = {'class': 'NetworkTopologyStrategy', 'us-west-2': '1'} AND durable_writes = true;
And the options for our column family
CREATE TABLE keyspace_name."CFName" (
// ...
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
I had to run a compaction on my nodes because I had too many tombstones.
Many thanks to the amazing IRC channel on freenode #cassandra.

How to check vnode disabled on a hadoop node

Linking back to this question:
Why not enable virtual node in an Hadoop node?
I'm running a mixed 3 node cluster with 2 cassandra and 1 analytics nodes and disabled the virtual nodes by generating 3 tokens with the utility given by DataStax enterprise.
But when I run 'nodetool status' command, I still see 256 tokens with each node and when a mapreduce job is created, it creates 257 mappers and takes a very long time to execute a query with small data.
So my specific questions are:
Is virtual node setting still not disabled? How can I verify if its disabled?
If its disabled then why 257 mappers are still created for each job? Is there a different configuration for that?
Thanks much for any help!!
1) It's not disabled. You can tell because it still says 256 tokens in nodetool status.
To disable vnodes you need to make sure that you change the num_tokens variable in the cassandra.yamnl
# If you already have a cluster with 1 token per node, and wish to migrate to
# multiple tokens per node, see http://wiki.apache.org/cassandra/Operations
# num_tokens: 256 << Make sure this line is commented out
# initial_token allows you to specify tokens manually. While you can use it with
# vnodes (num_tokens > 1, above) -- in which case you should provide a
# comma-separated list -- it's primarily used when adding nodes to legacy clusters
# that do not have vnodes enabled.
initial_token: << Your generated token goes here

Get Hbase region size via API

I am trying to write a balancer tool for Hbase which could balance regions across regionServers for a table by region count and/or region size (sum of storeFile sizes). I could not find any Hbase API class which returns the regions size or related info. I have already checked a few of the classes which could be used to get other table/region info, e.g. org.apache.hadoop.hbase.client.HTable and HBaseAdmin.
I am thinking, another way this could be implemented is by using one of the Hadoop classes which returns the size of the directories in the fileSystem, for e.g. org.apache.hadoop.fs.FileSystem lists the files under a particular HDFS path.
Any suggestions ?
I use this to do managed splits of regions, but, you could leverage it to load-balance on your own. I also load-balance myself to spread the regions ( of a given table ) evenly across our nodes so that MR jobs are evenly distributed.
Perhaps the code-snippet below is useful?
final HBaseAdmin admin = new HBaseAdmin(conf);
final ClusterStatus clusterStatus = admin.getClusterStatus();
for (ServerName serverName : clusterStatus.getServers()) {
final HServerLoad serverLoad = clusterStatus.getLoad(serverName);
for (Map.Entry<byte[], HServerLoad.RegionLoad> entry : serverLoad.getRegionsLoad().entrySet()) {
final String region = Bytes.toString(entry.getKey());
final HServerLoad.RegionLoad regionLoad = entry.getValue();
long storeFileSize = regionLoad.getStorefileSizeMB();
// other useful thing in regionLoad if you like
}
}
What's wrong with the default Load Balancer?
From the Wiki:
The balancer is a periodic operation which is run on the master to redistribute regions on the cluster. It is configured via hbase.balancer.period and defaults to 300000 (5 minutes).
If you really want to do it yourself you could indeed use the Hadoop API and more specifally, the FileStatus class. This class acts as an interface to represent the client side information for a file.

Resources