How to validate BLOB object in oracle - oracle

I have BLOB data (pdf file attachment) in a table.
for us, its too expensive to write Java/ some other code to read BLOB for validate.
Is there any short cut/easy/less expensive way to validate my BLOB? Any command/s to read meta data and validate the BLOB?

I would like to check whether the BLOB object is corrupted or not
That's not something you should do in the database. A BLOB is a binary file which is interpreted by the appropriate client software (Adobe Reader, MS Word, whatever). As far as the database is concerned it's a black box. So your application ought to validate the file before it uploads it into the database.
However, there is a workaround. You can build an Oracle Text CONTEXT index on your BLOB column. CONTEXT is really designed for free text searching of documents but indexing is a way to prove that the uploaded file is readable.
The snag with CONTEXT indexes is that they aren't transactional: normally there's a background job running which indexes new documents but for this purpose you would probably want to call CTX_DDL.SYNC_INDEX() as part of the upload to present the user with timely feedback. Find out more.
I will reiterate that Text is a workaround, and expensive in terms of database resources. The index itself will consume space and the indexing process requires time and cpu cycles. That's a big investment unless you're going to work with the document inside the database.

Related

Azure Data Factory (Graph Data Connect/Office365 Linked Service): how to work with Binary sink dataset?

Here's what I'm doing.
My company needs me to dump all group members and their corresponding groups into an SQL database. Power Automate takes forever with too many loops and API calls...so I'm trying Data Factory for the first time.
Using the Office365 Linked Service, we can get all organization members--but the only compatible sink option is Azure Blob storage (or DataLake) because the sink MUST be binary.
Ok, fine. So we got a Azure Blob storage account configured and set up.
But now that the pipeline 'copy data' has completed (after 4 hours?), I don't know what to do with this binary data. There seems to be no function, method or dataflow option to interpret the binary data as JSON, delimited text, or otherwise. The storage account shows 1042 different blobs, ranging haphazardly from a few kilobytes to dozens of megabytes (why???). Isn't there anything in Data Factory that can interpret this binary data and allow me to dump the columns I need into SQL?
I was able to load the blob data into Power Automate and parse it into usable JSON using the base64 and json functions, but this is robbing Peter to pay Paul because I have to us a loop to load the contents of 1042 different blobs and I'm exceeding our bandwidth quota. Besides that, some of the contents of the blobs are empty!! (again...why??)
I've looked everywhere for answers, no luck. So thank you for any insight.
You can use Binary dataset in Copy activity, GetMetadata activity, or
Delete activity. When using Binary dataset, the service does not parse
file content but treat it as-is.
So, The data flow activity which is used to transform the data in Azure Data Factory isn't supported for Binary dataset.
Hence, you can use Azure Service for another approach like Azure Databricks in which you can use Python OpenCV or any other Data Engineering library in preferred programming language.

Core Data or sqlite for fast search?

This is a description of the application I want to build and I'm not sure whether to use Core Data or Sqlite (or something else?):
Single user, desktop, not networked, only one frontend is accessing datastorage
User occasionally enters some data, no bulk data importing or large data inserts
Simple datamodel: entity with up to 20-30 attributes
User searches in data (about 50k datasets max.)
Search takes place mostly in attribute values, not looking for any keys here, but searching for text in values
Writing the data is nothing I see as critical, it happens not very often and with small amounts of data. The text search in the attributes has to be blazingly fast, a user would expect almost instant results. This is absolutely critical.
I would rather go with Core Data, but is this a scenario CD can handle?
Thanks
-Fish
Core Data can handle this scenario. But because you're looking for blazingly fast full text search, you'll have to do some extra work. Session 211 of WWDC 2013 goes into depth about how to do this (slides 117-131). You'll probably want to have a separate Entity with text search tokens: all of the findable words in your dataset.
Although one of the FTS extensions is available in Apple's deployment of SQLite, it's not exposed in Core Data.

Serializeable In-Memory Full-Text Index Tool for Ruby

I am trying to find a way to build a full-text index stored in-memory in a format that can be safely passed through Marshal.dump/Marshal.load so I can take the index and encrypt it before storing it to disk.
My rationale for needing this functionality: I am designing a system where a user's content needs to be both encrypted using their own key, and indexed for full text searching. I realize there would be significant overhead and memory usage if for each user of the system I had to un-marshal and load the entire index of their content into memory. For this project security is far more important than efficiency.
A full text index would maintain far too many details about a user's content to leave unencrypted, and simply storing the index on an encrypted volume is insufficient as each user's index would need to be encrypted using the unique key for that user to maintain the level of security desired.
User content will be encrypted and likely stored in a traditional RDBMS. My thought is that loading/unloading the serialized index would be less overhead for a user with large amounts of content than decrypting all the DB rows belonging to them and doing a full scan for every search.
My trials with ferret got me to the point of successfully creating an in-memory index. However, the index failed a Marshal.dump due to the use of Mutex. I am also evaluating xapian and solr but seem to be hitting roadblocks there as well.
Before I go any further I would like to know if this approach is even a sane one and what alternatives I might want to consider if its not. I also want to know if anyone has had any success with serializing a full-text index in this manner, what tool you used, and any pointers you can provide.
Why not use a standard full-text search engine and keep each client's index on a separate encrypted disk image, like TrueCrypt? Each client's disk image could have a unique key, it would use less RAM, and would probably take less time to implement.

What specific issues will I have to consider when saving files as binary data to a SQL Server 2005 database?

I'm writing an online tax return filing application using MVC3 and EF 4.1. Part of the application requires that the taxpayer be able to upload documents associated with their return. The users will be able to come back days or weeks later and possibly upload additional documents. Prior to finally submitting their return the user is able to view a list of files that have been uploaded. I've written the application to save the uploaded files to a directory defined in the web.config. When I display the review page to the user I loop through the files in the directory and display it as a list.
I'm now thinking that I should be saving the files to the actual SQL Server as binary data in addition to saving them to the directory. I'm trying to avoid what if scenarios.
What if
A staff member accidentally deletes a file from the directory.
The file server crashes (Other agencies use the same SAN as us)
A staff member saves other files to the same directory. The taxpayer should not see those
Any other scenario that causes us to have to request another copy of a file from a taxpayer (Failure is not an option)
I'm concerned that saving to the SQL Server database will have dire consequences that I am not aware of since I've not done this before in a production environment.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!

How is wordweb english dictionary implemented?

We need to keep some in-memory data structure to keep english word dictionary in memory.
When the computer/wordweb starts,we need to read dictionary from disk into an in-memory data structure.
This question asks how do we populate from disk to in-memory data structure in typical real world dictionaries say wordweb?
Ideally we would like to keep dictionary in disk in the way, we require it in in-memory, so that we don't have to spend time building in-memory data structure, we just read it off the disk. But for linked lists, pointers etc, how do we store the same image in disk. Some relative addresses etc would help here?
Typically, is the entire dictionary read and stored in memory. or only part/handlers and leaf page IOs are done, when searching for a specific word.
If somebody wants to help with what that in-memory data structure is typically, please go ahead.
Thanks,
You mentioned pointers, so I'm assuming you're using C++; if that's the case and you want to read directly from disk into memory without having to "rebuild" your data structure, then you might want to look into serialization: How do you serialize an object in C++?
However, you generally don't want to load the entire dictionary anyway, especially if it's a user application. If the user is looking up dictionary words, then reading from disk happens so fast that the user will never notice the "delay." If you're servicing hundreds or thousands of requests, then it might make sense to cache the dictionary into memory.
So how many users do you have?
What kind of load are you expecting to have on the application?
Wordweb is using Sqlite Database at backend. It makes sense to me to use a Database system to store the content so its easier to GET the content which the user is looking for quickly.
Wordweb has Word prediction as well... so it will be a query to database like
select word from table where word='ab%';
on the other hand, when the user presses enter for the word
select meaning from table where word='abandon'
You do not want to be Serializing the content from disk to memory while the user is typing or after he has pressed Enter to search. Since the data will be large (Dictionary), Serialization will probably take time more then the user will tolerate for every word search.
Else why don't you create a JSON format File containing all the meaning creating a short form of Dictionary ?

Resources