Fixing contamination factor for isolation forest

Fixing contamination factor for isolation forest - anomaly-detection

I have different data sets (for say IDs columns being same) for which I want to identify the anomalies but each data set have different percentages of anomalies present. Since I am using the isolation forest I would need to set different contamination factor to increase accuracy. Does any one have any suggestions how can this contamination factor be fixed.

Related

how to make apache ignite scale linearly with increase in Number of nodes?

I'm running some tests and found that 1 node is faster and produces more result than 2 and 4 nodes? I'm not able to understand why it is happening.
I'm using parition_aware=True and lazy=True while writing and querying data to ignite.
here are some of the result I got. Its for crossJoin of two 100k row tables.
Results I got after running some queries

Different result sets for different Ignite topologies is an implicit indicator that your affinity collocation configuration is incorrect. You need to distribute your entries across a cluster in the particular way allowing to join tables locally. Make sure that leads and products have the same affinity key column, and use it for your join. This concept is called collocated join, it helps to avoid additional network hops.
For this particular case it seems you are trying to calculate Levenshtein distance, the only way to do that is cross join, it's basically a cartesian product of tables. It means that for each row from the left table you'll need to traverse all the records from the right table (there are some possible optimisations though). The only way to achieve that is to leverage non-collocated joins. But keep in mind that it implies additional network activity. Here's a rough estimation of how much we actually need.
Assume we want to compute the cross join of tables A and B. Let's also assume that the table A contains n rows and the table B contains m rows. In that case for a cluster with k nodes (we are not taking backups into account, they don't take part in SQL) we would come up with some complexity estimation in terms of network data transfer.
There are rows in the table A on every node on the average. For every node-local row in A there are approximately rows in B (residing on the other nodes) to fetch through network. Having k nodes total we'll have the required network activity proportional to . With the growing number of nodes it will creep up to (the entire dataset squared). And it's not really good in fact. Having a smaller number of nodes actually decreases the network load in this scenario.
In a nutshell:
try enabling distributed joins, it will fix the result set size
it's difficult to say what's going on without profiling and query execution plans

How to determine the optimal capacity for Quadtree subdivision?

I've created a flocking simulation using Boid's algorithm and have integrated a quadtree for optimization. Boids are inserted into the quadtree if the quadtree has not yet met its boid capacity. If the quadtree has met its capacity, it will subdivide into smaller quadtrees and the remaining boids will try to insert again on that one, recursively.
The performance seems to get better if I increase the capacity from its default 4 to one that is capable of holding more boids like 20, and I was just wondering if there is any sort of rule or methodology that goes into picking the optimal capacity formulaically.
You can view the site live here or the source code here if relevant.

I'd assume it very much depends on your implementation, hardware, and the data characteristics.
Implementation:
An extreme case would be using GPU processing to compare entries. If you support that, having very large nodes, potentially just a single node containing all entries, may be faster than any other solution.
Hardware:
Cache size and Bus speed will play a big role, also depending on how much memory every node and every entry consumes. Accessing a sub-node that is not cached is obviously expensive, so you may want to increase the size of nodes in order to reduce sub-node traversal.
-> Coming back to implementation, storing the whole quadtree on a contiguous segment of memory can be very beneficial.
Data characteristics:
Clustered data: Having strongly clustered data can have adverse effect on performance because it may cause the tree to become very deep. In this case, increasing node size may help.
Large amounts of data will mean that you may get over a threshold very everything fits into a cache. In this case, making nodes larger will save memory because you will have fewer nodes and everything may fit into the cache again.
In my experience I found that 10-50 entries per node gives the best performance across different datasets.
If you update your tree a lot, you may want to define a threshold to avoid 'flickering' and frequent merging/splitting of nodes. I.e. split nodes with >25 entries but merge them only when getting less than 15 entries.
If you are interested in a quadtree-like structure that avoids degenerated 'deep' quadtrees, have a look at my PH-Tree. It is structured like a quadtree but operates on bit-level, so maximum depth is strictly limited to 64 or 32, depending on how many bits your data has. In practice the depth will rarely exceed 10 levels or so, even for very dense data. Note: A plain PH-Tree is a key-value 'map' in the sense that every coordinate (=key) can only have one entry (=value). That means you need to store lists or sets of entries in case you expect more than one entry for any given coordinate.

shared memory and population dynamics on a landscape

I would like to parallelize population dynamics for individuals moving on a 2D landscape. The landscape will be divided into cells with each processing core operating on individuals that exist in a specific cell.
The problem is that because the individuals move they travel between cells. Meanwhile the positions of individuals in a given cell (and its neighboring cells) must be known at any point in time in order to determine when pairs of individuals can mate.
In openMPI, it would be necessary to pass the structures of individuals (in this case, a list of mutations, and their locations in a genome) as messages whenever they move to a different cell, which would be very slow.
However, it seems that in OpenMP there is a way for the processing cores to share the memory for the entire list of genomes / individuals (i.e., for all cells). In this case, there would be no need for message passing, and the code could be very efficient.
Is my understanding of openMP correct? The nodes on my cluster each contain 32 processing cores. Does this mean I am limited to sharing memory among these 32 cores?
Thank you

Redis GEORADIUS with one ZSET versus a lot of ZSETs of particular size

What will work faster, one big ZSET with geodata where I'll query for 100m radius with GEORADIUS
OR
a lot of ZSETs where each ZSET is responsible for 100m X 100m square covering the whole world? and named after this 100m squares like:
left_corner1_49_2440000_28_5010000
left_corner2_49_2450000_28_5010000
.......
and have all the 100 meters to the right and bottom inside the sets.
So when searching for the nearest point I'll just omit the redundant digits from gps like: 49.2440408, 28.5011694 will become
49.2440000, 28.5010000 so this way I'll know the ZSETS's name where just to get all the exact values with 100 meters precision.
OR to question it in general form: how are the ZSET's names are stored and accessed in redis? If I have too much ZSETS will it impact performance while accessing them?

Precise comparison of this approaches could only be done via benchmark and it would be specific to your dataset and configuration. But architecturally speaking, your pros and cons are:
BIG ZSET: less bandwidth and less operations (CPU cycles) taken to execute, no problems on borders (possible duplicates with many ZSETS), can get throughput with sharding;
MANY ZSETS: less latency for other operations (while big ZSET is going, other commands are waiting), can get throughput with sharding AND latency with clustering.
As for bottom line question, I did not see implementation code, but set names should be the same keys as any other keys you use. This is what Redis FAQ says about number of keys:
What is the maximum number of keys a single Redis instance can hold? <...>
Redis can handle up to 2^32 keys, and was tested in practice to handle
at least 250 million keys per instance.
UPDATE:
Look at what Redis docs say about GEORADIUS:
Time complexity: O(N+log(M)) where N is the number of elements inside
the bounding box of the circular area delimited by center and radius
and M is the number of items inside the index.
It means that items outside of your query make O(log(M)) impact on your query. So, 17 hops for 10m items or 21 hop for 1b items which is quite affordable. The question left is will you do partitioning between nodes?

Cost of a query in/dependent of amount of data

Could you please tell me whether the cost of a query is dependent on the amount of data available in the database at that time?
means, does the cost varies with the variation in the amount of data?
Thanks,
Savitha

The answer is, Yes, the data size will influence the query execution plan, that is why you must test your queries with real amounts of data (and if possible realistic data as the distribution of the data is also important and will influence the query cost).

Any Database management system is different in some respect and what works well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way around. Even storage engines have very important differences which can affect performance dramatically.
Of course, mass data will Slow down the process, In fact If u are firing a query, it need to traverse and search into the database. For more data it ll take time, The three main issues you should be concerned if you’re dealing with very large data sets are Buffers, Indexes and Joins..

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio