Oracle: What is the difference between array binding and sql loader? - oracle

I was using bcp (sybase mass insert) to insert millions of records but my company is migrating to oracle.
I am not sure if I should use array binding or sql loader. I have a lot of data in memory. I can either 1. Create a text file with the data and use sql-loader to insert it or 2. use the array binding library to insert data. I'm not sure which is more practical for my application. What are the differences between one and the other. Is one better for certain applications?
Which should I use to replace bcp?

SQL*Loader is the most direct replacement for bcp. If you have an existing process that uses bcp, moving to SQL*Loader is probably the path of least resistance.
You say that you already have the data in memory already. I assume that means that the data is in memory on a client machine not on the database server. Given that starting point, I'd generally prefer a direct path load assuming that whatever database access API you are using provides a direct path API. Incurring the overhead of writing a bunch of data to a file only to have SQL*Loader incur the overhead of reading that data back off disk just to use (assuming you set it up to do so) the direct path API to load the data should make SQL*Loader less efficient. Of course, as a purpose-built tool, it is likely that a SQL*Loader solution can be cobbled together with acceptable performance more quickly than you can write your own code to do so particularly if you're just learning the API.
If you don't have access to a direct path API and you're debating between an application doing a conventional path load using array binds or a SQL*Loader solution doing a direct-path load, the question is much closer. You'd probably need to benchmark the two. Direct-path loading is more efficient than a conventional path load. But writing all the data to disk and reading it all back is going to incur an additional cost. Whether the cost of reading and writing that data to disk outweighs the benefit of a direct-path load will depend on a variety of factors that are specific to your application (data volumes, disk speed, network I/O, etc.).
One additional option to consider may be to write the file to disk, copy the file to the database server, and then use an external table to expose the data to the database. This is generally more efficient than using SQL*Loader from a client machine. Whether it performs better than a direct-path load from an application and whether the additional complexity of writing files, moving them around, and generally moving control from an application to operating system utilities and back to the application outweighs the complexity of writing a bit more code in the application is something that you'd need to answer for yourself.

Related

Why is it recommended practice to store images on disk rather than in a Realm

I am using Realm as the database solution for my app. I need persistent storage ability for my images so I can load them when offline. I also need a cache so I can load the images from there rather than fetching them from the API each time a cell draws them. My first thought was that a Realm database could serve both of these functions just fine if I were to store the images in Realm as NSData. But I have found two answers on SE (here and here) that recommend not doing this if you have many images of a largish size that will change often. Instead they recommend saving the images to disk, and then storing the URL to those images in Realm.
My question is why is this best practice? The answers linked to above don't give reasons why except to say that you end up with a bloated database. But why is that a problem? What is the difference between a having lots of images in my database vs having lots of images on disk?
Is it a speed issue? If so, is there a marked speed difference in an app being able to access an image from disk to being able to access it from a database solution like Realm?
Thanks in advance.
This isn't really just a problem localised to Realm. I remember the same advice being given with Core Data too.
I'm guessing the main reason above all else as to why storing large binary data in a database isn't recommended is because 'You don't gain anything, and actually stand to lose more than you otherwise would'.
With Core Data (i.e. databases backed by SQLite), you'll actually take a performance hit as the data will be copied into memory when you perform the read from SQLite. If it's a large amount of data, then this is wholly unacceptable.
With Realm at least, since it uses a zero-copy, memory-mapped mechanism, you'll be provided with the NSData mapped straight from the Realm file, but then again, this is absolutely no different than if you simply loaded the image file from disk itself.
Where this becomes a major problem in Realm is when you start changing the image often. Realm actually uses an internal snapshotting mechanism when working with changing data across threads, but that essentially means that during operation, entire sets of data might be periodically duplicated on-disk (To ensure thread-safety). If the data sets include large blobs of binary data, these will get duplicated too (Which might also mean a performance hit as well). When this happens, the size of the Realm file on disk will be increased to accomodate the snapshots, but when the operation completes and the snapshots are deleted, the file will not shrink back to it's original size. This is because reclaiming that disk space would be a costly performance hit, and since it's easily possible the space could be needed again (i.e. by another large snapshotting operation), it seems inefficient to pre-emptively do (hence the 'bloat').
It's possible to manually perform an operation to reclaim this disk space if necessary, but the generally recommended approach is to optimise your code to minimise this from happening in the first place.
So, to sum that all up, while you totally can save large data blobs to a database, over time, it'll potentially result in performance hits and file size bloat that you could have otherwise avoided. These sorts of databases are designed to help transform small bits of data to a format that can be saved to and retrieved from disk, so it's essentially wasted on binary files that could easily be directly saved without any modification.
It's usually much easier, cleaner and more efficient to simply store your large binary data on disk, and simply store a file name reference to them inside the database. :)

Use Vertica Database for OLTP data?

Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all
Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.
HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time
I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.
This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.
Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.

Running out of RAM memory

I may need to build a hash table that may grow very large in size. I am wondering if the hash table does not fit in memory what is the best way to address this problem as to avoid having the application crash when it runs out of memory.
Use case: This hash table contains a bunch of ids that are referenced in a for loop that needs to consult the id for a particular word.
Any time you have data that can not be easily recreated on the fly, then you need to make provisions to get it out of RAM and onto disk. Any sort of data store will do that. You could use a flat or text file, or a YAML file.
If you need fast access then you'll be looking at some sort of database, because reading a flat/text file doesn't easily allow random access. SQLLite can do it, or a no-sql database.
If you need to allow multiple processes access to the data and have good access restriction, and/or store the data on one machine and access it from another, then you'll be looking at a database of some sort. At that point I'd look into MySQL or Postgres. I prefer the later, but they'll both work.
If you really think the hash will grow so big, then maybe you should not store this data in a hash in your ram. I don't think you can easily avoid a crash when your app runs out of memory. I guess the key is create mechanisms to avoid major memory consumption.
I don't know your situation, but I really doubt the hash table you described would make a reasonable computer run out of memory. If you really think so, maybe you should use a key value storage database (Redis is fairly easy to learn http://redis.io/) or other kind of NoSQL database.

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.

Resources