Data structure to store, update and search very large size of Gazetteers? - data-structures

I have a set of Gazetteer which contains tens of millions of entities, and I need that to do entity spotting in texts, and the Gazetteer set would be updated if necessary.
Is compressed trie the best data structure to employ?

Related

Which time series database supports these specific requirements?

We have a database with more than a billion daily statistical records. Each record has multiple metrics (m1 through m10), and several immutable tags.
Record can also be associated with zero or more groups. The idea was to use multiple tags (e.g. g1, g2) to indicate the belonging of specific record to specific group.
Our data is stored on daily level, and most time-series databases are really optimized for more granular data. This represents a problem when we want to produce monthly, or quarterly graphs (e.g. InfluxDB have maximum aggregation period of 7d). We need a database that is really optimized for day-level data points and can produce quick aggregations on month/quarter/year level.
Furthermore, the relationship between records and groups is mutable. We need the database to support the batch update of records (pseudo: ADD TAG group1 TO records WHERE record_id: 101), or at least fast deletion/reinserting of updated data. This operation should be relatively fast.
We need something that can produce near-real-time results when aggregating data across tens of millions (filtered) records.
Our original solution is based on elasticsearch and it works quite well, but wanted to explore alternatives in time-series databases niche. Can anyone recommend a time-series database that supports these features?
Try ClickHouse. It is optimized for real-time processing and querying big amounts of data. We successfully used it to store hundreds of billions of records per day on a 15-node cluster. ClickHouse is able to scan billions of records per second per CPU core and its query performance scales linearly with the number of available CPU cores.
ClickHouse also supports infrequent data updates, so you can update groups for particular rows.
If you want more tradituonal TSDB, then take a look at VictoriaMetrics. It is built on architecture ideas from ClickHouse, so it is fast and provides good on-disk data compression.

Improve mapping performance on Elasticsearch

My elastic cluster contains indices with giant mapping files. This is due to the fact that some of my indices contain up to 60k different fields.
To elaborate a bit about my setup, each index contains information from a single source. Each source has several types of data (what I'll call layers) which are indexed as different types in the index corresponding to the source. Each layer has different attributes (20 in average). To avoid field name collision they are indexed as "LayerId_FieldId".
I'm trying to find a way to reduce the size of my mapping (as to my understanding, it might cause performance issues). One option is having one index per layer (and perhaps spreading large layers over several indices, each responsible for a different time segment). I have around 4000 different layers indexed right now, so lets say that in this method I will have 5000 different indices. Is elastic fine with that? What should I be worried about (if at all) with such a large number of indices, some of them very small (some of the layers have as few as 100 items)?
A second possible solution is the following. Instead of saving a layer's data in the way it is sent to me, for example:
"LayerX_name" : "John Doe",
"LayerX_age" : 34,
"LayerX_isAdult" : true,
it will be saved as :
"value1_string" : "John Doe",
"value2_number" : 34,
"value3_boolean" : true,
In the latter option, I will have to keep some metadata index which links the generic names to the real field names. In the above example, I need to know that for layer X the field "value1_string" corresponds to "name". Thus, whenever I receive a new document to index, I have to query the metadata in order to know how to map the given fields into my generic names. This allows me to have a constant size mapping (say, 50 fields for each value type, so several hundred fields overall). However, this introduces some overhead, but most importantly I'm feeling that this basically reduces my database to a relational one, and I lose the ability to handle documents of arbitrary structure.
Some technical details about my cluster:
Elasticsearch version 2.3.5
22 nodes, 3 of them are masters, each node contains 16 Gb of ram, 2 Tb
disc storage. In total I currently have 6 Tb of data spread over 1.2
billion docs, 55 indices, and 1500 shards.
I'd appreciate your input on the two solutions I suggested, or any other alternatives you have in mind!

boltdb data format for (space) efficient storage

I need to store business data from mysql into bolt. The data is a map[string]string looks like this:
{"id": "<uuid>", "shop_id":"12345678", "date": "20181019"... }
Since the amount of data will be huge and increasing, apart from splitting the data into standalone files (such as 201810.db), I would like to make the final file as small as possible. So I plan to encode the data myself, by using "lookup tables", i.e. since all keys will be same for all data items, I map id=1, shop_id=2,... so that the key will only consume 1~2 bytes. And for values, I also do the same encoding, so that columns which are highly redundant (i.e. select distinct only return a few results) will consume less space in the bolt file.
Now my question is: how does bolt stores keys and values? If I use the method described above, will bolt store more objects "per page" so that eventually space efficiency can be improved?
Or, since it utilize "pages" so that even storing one byte of data will consume a full page? If this is the case, do I have to manually group a bunch of objects until their combined size is larger than a bolt page in order to make it "fully filled"? Of course, this will hurt random access, but for my application, this can be overcome at the expense of increased coding complexity.

How to achieve Data Sharding in Endeca (data partitioning)

Currently Oracle Commerce Guided Search (Endeca) supports only language specific partitions (i.e., One MDEX per Language). For systems with huge data volume base (say ~100 million records of ~200 stores), does anyone successfully implemented data partitioning (sharding) based on logical group of data (i.e., One MDEX per group-of-stores) so that the large set of data can be divided into smaller sets of data?
If so, what precautions to be taken while indexing data and strategies for querying the Assembler?
Don't think this is possible. Endeca used to support the Adgidx which allowed you to split or shard the mdex but that is no longer supported. Oracles justification for removing this is that with multithreading and multi-core processors it is no longer necessary. Apache Solr, however, supports sharing
The large set of data can be broken into smaller sets, where each set would be attributed to a property, say record.type, which would identify the different sets. So, basically we are normalizing the records in the Endeca index.
Now, while querying endeca, we can use the concept of record relationship navigation queries, using record-record relationships by applying a relationship filter, to bring back records of different types.
However, you might have to obtain a RRN license to enable the RRN feature in the mdex engine.

Column Store Strategies

We have an app that requires very fast access to columns from delimited flat files, whithout knowing in advance what columns will be in the files. My initial approach was to store each column in a MongoDB document as a array, but it does not scale as the files get bigger as I hit the 16MB document limit.
My second approach is to split the file columnwise and essentially treat them as blobs that can be served off disk to the client app. I'd intuitively think that storing the location in database and the files on disk is the best approach, and that storing them in a database as blobs (or in mongo gridfs) is adding unnessasary overhead, however there may be advatages that are not apparant to me at the moment. So my question is: what would be the advantage to storing them as blobs in a database such as (Oracle/Mongo) and are there any databases that are particularly well suited to this task.
Thanks,
Vackar

Resources