I am trying to find a way to build a full-text index stored in-memory in a format that can be safely passed through Marshal.dump/Marshal.load so I can take the index and encrypt it before storing it to disk.
My rationale for needing this functionality: I am designing a system where a user's content needs to be both encrypted using their own key, and indexed for full text searching. I realize there would be significant overhead and memory usage if for each user of the system I had to un-marshal and load the entire index of their content into memory. For this project security is far more important than efficiency.
A full text index would maintain far too many details about a user's content to leave unencrypted, and simply storing the index on an encrypted volume is insufficient as each user's index would need to be encrypted using the unique key for that user to maintain the level of security desired.
User content will be encrypted and likely stored in a traditional RDBMS. My thought is that loading/unloading the serialized index would be less overhead for a user with large amounts of content than decrypting all the DB rows belonging to them and doing a full scan for every search.
My trials with ferret got me to the point of successfully creating an in-memory index. However, the index failed a Marshal.dump due to the use of Mutex. I am also evaluating xapian and solr but seem to be hitting roadblocks there as well.
Before I go any further I would like to know if this approach is even a sane one and what alternatives I might want to consider if its not. I also want to know if anyone has had any success with serializing a full-text index in this manner, what tool you used, and any pointers you can provide.
Why not use a standard full-text search engine and keep each client's index on a separate encrypted disk image, like TrueCrypt? Each client's disk image could have a unique key, it would use less RAM, and would probably take less time to implement.
Related
My application should handle a lot of entities (100.000 or more) with location and needs to display them only within a given radius. I basically store everything in SQL but using Redis for caching and optimization (mainly GEORADIUS).
I am adding the entities like the following example (not exactly this, I use Laravel framework with the built-in Redis facade but it does the same as here in the background):
GEOADD k 19.059982 47.494338 {\"id\":1,\"name\":\"Foo\",\"address\":\"Budapest, Astoria\",\"lat\":47.494338,\"lon\":19.059982}
Is it bad practice? Or will it make a negative impact on performance? Should I store only ID-s as member and make a following query to get the corresponding entities?
This is a matter of the requirements. There's nothing wrong with storing the raw data as members as long as it is unique (and it unique given the "id" field). In fact, this is both simple and performant as all data is returned with a single query (assuming that's what actually needed).
That said, there are at least two considerations for storing the data outside the Geoset, and just "referencing" it by having members reflect some form of their key names:
A single data structure, such as a Geoset, is limited by the resources of a single Redis server. Storing a lot of data and members can require more memory than a single server can provide, which would limit the scalability of this approach.
Unless each entry's data is small, it is unlikely that all query types would require all data returned. In such cases, keeping the raw data in the Geoset generates a lot of wasted bandwidth and ultimately degrades performance.
When data needs to be updated, it can become too expensive to try and update (i.e. ZDEL and then GEOADD) small parts of it. Having everything outside, perhaps in a Hash (or maybe something like RedisJSON) makes more sense then.
I have BLOB data (pdf file attachment) in a table.
for us, its too expensive to write Java/ some other code to read BLOB for validate.
Is there any short cut/easy/less expensive way to validate my BLOB? Any command/s to read meta data and validate the BLOB?
I would like to check whether the BLOB object is corrupted or not
That's not something you should do in the database. A BLOB is a binary file which is interpreted by the appropriate client software (Adobe Reader, MS Word, whatever). As far as the database is concerned it's a black box. So your application ought to validate the file before it uploads it into the database.
However, there is a workaround. You can build an Oracle Text CONTEXT index on your BLOB column. CONTEXT is really designed for free text searching of documents but indexing is a way to prove that the uploaded file is readable.
The snag with CONTEXT indexes is that they aren't transactional: normally there's a background job running which indexes new documents but for this purpose you would probably want to call CTX_DDL.SYNC_INDEX() as part of the upload to present the user with timely feedback. Find out more.
I will reiterate that Text is a workaround, and expensive in terms of database resources. The index itself will consume space and the indexing process requires time and cpu cycles. That's a big investment unless you're going to work with the document inside the database.
I am trying to enter Couchbase world and learning things about a views.
Several time in presentations and demos i heard its bad to return whole doc in from a view:
emit(meta.id, doc);
My question is why? What should i return then and how can i grab a proper values of the document?
It's a bad idea because it's actually counterproductive. Writing a document to the view means it will be stored on disk with the view index itself. You pay the IO price for writing the document to disk again (a duplicate of the original key/value doc), and you pay it again for reading it at query time. Because views queries are served from disk (or the file system cache), you will never take advantage of the integrated cache layer to retrieve the document faster. In short, in average it will be faster to get the document ID from the view and retrieve the document by id, than it is to just read the whole document from the view. This is especially true for operations on multiple documents.
It's bad because it's a large drain on resources, views will often update and overwrite indices, so if you are writing a whole doc repeatedly it's going to require a large amount of processor time and disk I/O (along with filesystem cache).
Therefore, it is recommended (and far more efficient) to have the view return the doc.id and then use the standard get procedure to return the whole doc.
This is a description of the application I want to build and I'm not sure whether to use Core Data or Sqlite (or something else?):
Single user, desktop, not networked, only one frontend is accessing datastorage
User occasionally enters some data, no bulk data importing or large data inserts
Simple datamodel: entity with up to 20-30 attributes
User searches in data (about 50k datasets max.)
Search takes place mostly in attribute values, not looking for any keys here, but searching for text in values
Writing the data is nothing I see as critical, it happens not very often and with small amounts of data. The text search in the attributes has to be blazingly fast, a user would expect almost instant results. This is absolutely critical.
I would rather go with Core Data, but is this a scenario CD can handle?
Thanks
-Fish
Core Data can handle this scenario. But because you're looking for blazingly fast full text search, you'll have to do some extra work. Session 211 of WWDC 2013 goes into depth about how to do this (slides 117-131). You'll probably want to have a separate Entity with text search tokens: all of the findable words in your dataset.
Although one of the FTS extensions is available in Apple's deployment of SQLite, it's not exposed in Core Data.
We need to keep some in-memory data structure to keep english word dictionary in memory.
When the computer/wordweb starts,we need to read dictionary from disk into an in-memory data structure.
This question asks how do we populate from disk to in-memory data structure in typical real world dictionaries say wordweb?
Ideally we would like to keep dictionary in disk in the way, we require it in in-memory, so that we don't have to spend time building in-memory data structure, we just read it off the disk. But for linked lists, pointers etc, how do we store the same image in disk. Some relative addresses etc would help here?
Typically, is the entire dictionary read and stored in memory. or only part/handlers and leaf page IOs are done, when searching for a specific word.
If somebody wants to help with what that in-memory data structure is typically, please go ahead.
Thanks,
You mentioned pointers, so I'm assuming you're using C++; if that's the case and you want to read directly from disk into memory without having to "rebuild" your data structure, then you might want to look into serialization: How do you serialize an object in C++?
However, you generally don't want to load the entire dictionary anyway, especially if it's a user application. If the user is looking up dictionary words, then reading from disk happens so fast that the user will never notice the "delay." If you're servicing hundreds or thousands of requests, then it might make sense to cache the dictionary into memory.
So how many users do you have?
What kind of load are you expecting to have on the application?
Wordweb is using Sqlite Database at backend. It makes sense to me to use a Database system to store the content so its easier to GET the content which the user is looking for quickly.
Wordweb has Word prediction as well... so it will be a query to database like
select word from table where word='ab%';
on the other hand, when the user presses enter for the word
select meaning from table where word='abandon'
You do not want to be Serializing the content from disk to memory while the user is typing or after he has pressed Enter to search. Since the data will be large (Dictionary), Serialization will probably take time more then the user will tolerate for every word search.
Else why don't you create a JSON format File containing all the meaning creating a short form of Dictionary ?