I would like to know what kind of internal indexing algorithm MongoDB is using. Because I have some data want to store, and each document (row) has a id, which is probably a unique hash value. (e.g. generated by md5() or other hash algorithm). so, I would to understand which hash method I should use to create the id, so that it is fast for the MongoDB to index it. :)
Yes, mongoDB use b-tree, documentation:
An index is a data structure that collects information about the values
of the specified fields in the
documents of a collection. This data
structure is used by Mongo's query
optimizer to quickly sort through and
order the documents in a collection.
Formally speaking, these indexes are
implemented as "B-Tree" indexes.
I suggest to use mongodb ObjectId for collection _id, and don't care about: "How to create _id?" at all. Because it probably task for mongodb, but not for developer. I suppose that better to care about schema, indexes, etc..
For Mongo 3.2+, the default storage engine is WiredTiger, and B+ tree is used to store data.
WiredTiger maintains a table's data in memory using a data structure called a B-Tree ( B+ Tree to be specific), referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The leaf pages store both keys and values.
And also LSM Tree is used to store data
WiredTiger supports Log-Structured Merge Trees, where updates are buffered in small files that fit in cache for fast random updates, then automatically merged into larger files in the background so that read latency approaches that of traditional Btree files. LSM trees automatically create Bloom filters to avoid unnecessary reads from files that cannot containing matching keys.
LSM
B+ Tree
Pros
* More space efficient by using append-only writes and periodic compaction * Better performance with fast-growing data and sequential writes * No fragmentation penalty because of how SSTable files are written and updated
* Excellent performance with read-intensive workloads
Cons
* CPU overhead for compaction can meaningfully affect performance and efficiency if not tuned appropriately * More tuning options increase flexibility but can seem complex to developers and operators * Read/search performance can be optimized with the use of bloom filters
* Increased space overhead to deal with fragmentation * Uses random writes which causes slower create/insert behavior * Concurrent writes may require locks which slows write performance * Scaling challenges, especially with >50% write transactions
The choice advice
If you don't require extreme write throughput btree is likely to be a better choice. Read throughput is better and high volumes of writes can be maintained.
If you have a workload that requires a high write throughput LSM is the best choice.
Source:
LSM vs B tree
WiredTiger Doc
Related
I have a requirement to store billions of records (with capacity up to one trillion records) in a database (total size is in terms of petabytes). The records are textual fields with about 5 columns representing transactional information.
I want to be able to query data in the database incredibly quickly, so I was researching Milvus, Apache HBase, and RocksDB. Based on my research, all three are incredibly fast and work well with large amounts of data. All else equal, which of these three is the fastest?
What type of data are you storing in the database?
Milvus is used for vector storage and computation.
If you want to search by the semantics of the text, milvus is the fastest option.
Hbase and RocksDB are both K-value database.
If you want to search by the key columns,These 2 would be more faster
I have a use case where the apache flink process must integrate near real-time data streams (events) from multiple sources but due to lack of uniform keys in the different systems I need to use a surrogate key (SK) lookup from an existing data base. The SK data set is very large (50 Million+ keys). Is it possible/advisable to cache such a data set for in-stream transformation (mapping) without a DB lookup? If yes, What are caching limitations? If not, what alternatives are possible with Flink?
There are a few options
Local map
If the surrogate key is never changing, you could just load it in RichMapFunction#open and perform the lookup. That of course means that you will have to adjust the memory settings such that Flink doesn't try to take all memory for its own operations.
Some quick math: assume both keys are strings of length 10. They will each need 40 bytes of chars in memory. With some object overhead, we are getting to ~50 bytes per entry. With 50M entries, we are needing 2.5 GB RAM to store that. Because the hash map will have some overhead, I'd plan with 3 GB RAM.
So if you task manager has 8GB, I'd set taskmanager.memory.size to 4 GB.
Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice. Also I'd choose a format that is suited to load the data as quickly as possible (e.g., Avro) because a slow parsing will greatly reduce startup and recovery time.
State-based
If memory is an issue or data is changing, you can also model the lookup data as a map-state. I'd add a second input for that lookup data and use a KeyedCoProcessFunction. The feed whatever comes from the second input into the map-state. The state should use a rocks-db backend, such that the data effectively resides on disk.
Joining data
A lookup can also be modeled as a join. If you are already using Table API, have a look at Join with Temporal Table. This will internally use the state-based approach but is much more concise. You can also mix DataStream with Tables.
I want to use a Lucene MMapDirectory as a primary file store. Each file would be stored in a separate document as a byte array in a StoredField. All file properties that should be searchable, like file name, size etc., would be stored in indexable fields in the same document.
My questions would be:
What are the drawbacks of using Lucene directories for storing files, especially with regards to indexing and search performance and memory (RAM) consumption?
If this is not a "no-go", is there a better/faster way of storing files in the directory than as a byte array?
Short Answer
I really love Luсene and consider it to be the best opensource library, but I'm afraid that it's not a good decision to use it as a primary file source due to:
high CPU/memory overhead
slow index/query performance
high HDD utilization and doubled index size
weak capabilities to recovery
Long Answer
Under the hood lucene uses the following files to keep all stored fields in one segment:
the fields index file (.fdx),
the fields data file (.fdt).
You can read more about how it works in Lucene50StoredFieldsFormat’s docs.
This means in case of any I/O issue it is almost impossible to restore any file.
In order to return one file - lucene have to read and decompress binary data from the disc in block-by-block manner. This means high CPU overhead to decompress and high memory footprint to keep the whole file in java heap space. No streaming is also avaialbe - compared to file and network storages.
Maximum document size is limited by codec implementation - 2 GB per document
Lucene has a unique write-once segmented architecture: recently indexed documents are written to a new self-contained segment, in append-only, write-once fashion: once written, those segment files will never again change. This happens either when too much RAM is being used to hold recently indexed documents, or when you ask Lucene to refresh your searcher so you can search all recently indexed documents. Over time, smaller segments are merged away into bigger segments, and the index has a logarithmic "staircase" structure of active segment files at any time. This architecture becomes a big problem for file storage:
you can not delete file - only mark as unavailable
merge operation requires 2x disc space and consumes a lot of resources and disc throughput - it creates new .fdt file and copies content of other .fdt files thru java code and java heap memory
So you won't be using MMapDirectory but an actual lucene index.
I have made good experiences with using lucene as the primary data-store for some projects.
Just be sure to also include a generated/natural unique ID, because the document IDs are not constant or reliable.
Also make sure you use a Directory implementation fitting to your use-case. I have switched to the normal RandomAccess implementation in the low-load case, since it uses less memory and is almost as fast.
what is the difference between joins and distributed cache in hadoop. I am really confusing with map-side join and reduce-side join an dhow it works. how distributed cache is different while processing the data in mapreduce job. Please share with example.
Regards,
Ravi
Let's say you have 2 files of data with the following records:
word -> frequency
Same words can be present in both files.
Your task is to merge these files, compute total frequency for each term, and produce the aggregated file.
Map side joins.
Useful when your data on both sides of the join already presorted by keys. In this case, it is a simple merge of two streams with linear complexity. In our example, our word-frequency data have to be pre-sorted alphabetically by words in both files.
Pros: works with virtually unlimited input data (does not have to fit in memory).
Does not require a reducer, thus it is very efficient.
Cons: requires your input data to be pre-sorted (for example, as a result of a previous map/reduce job)
Reduce joins.
Useful when our files are not sorted yet, and they are too large to fit in memory. So you have to merge them using distributed sort with reducer(s).
Pros: works with virtually unlimited input data (does not have to fit in memory).
Cons: requires reduce phase
Distributed cache.
Useful when our input word-frequency files are NOT sorted, and one of two files is small enough to fit in memory. In this case you can use it as a distributed cache, and load it in memory as a hash table Map<String, Integer>. Each mapper than will stream the largest input file as key value pairs and look up the values of the smaller file from the hash map.
Pros: Efficient, linear complexity based on largest input set size. Does not require reducer.
Cons: Requires one of the inputs to fit in memory.
I'm interested in performing knn search on large dataset.
There are some libs: ANN and FLANN, but I'm interested in the question: how to organize the search if you have a database that does not fit entirely into memory(RAM)?
I suppose it depends on how much bigger your index is in comparison to the memory. Here are my first spontaneous ideas:
Supposing it was tens of times the size of the RAM, I would try to cluster my data using, for instance, hierarchical clustering trees (implemented in FLANN). I would modify the implementation of the trees so that they keep the branches in memory and save the leaves (the clusters) on the disk. Therefore, the appropriate cluster would have to be loaded each time. You could then try to optimize this in different ways.
If it was not that bigger (let's say twice the size of the RAM), I would separate the dataset in two parts and create one index for each. I would therefore need to find the nearest neighbor in each dataset and then choose between them.
It depends if your data is very high-dimensional or not. If it is relatively low-dimensional, you can use an existing on-disk R-Tree implementation, such as Spatialite.
If it is a higher dimensional data, you can use X-Trees, but I don't know of any on-disk implementations off the top of my head.
Alternatively, you can implement locality sensitive hashing with on disk persistence, for example using mmap.