Google BigQuery is running queries slowly - performance

I'm running a simple bigQuery over my dataset which is about 84GB of log data.
The query takes approx 110 seconds to complete. Is this normal for a data set of this size?

After further investigation, it looks like your table was heavily fragmented. We usually have a coalesce process running to prevent this situation, but it had been off for a couple of weeks while we were verifying a bug fix. I've restarted the coalescer and run it against your table. Please let me know if you continue to see poor performance.
As a best practice, you may be better off importing somewhat less frequently in larger chunks, or splitting your data into time-based tables. BigQuery isn't really designed to handle high-volume small imports to the same table.

Related

How to test tuned oracle sql and how to clear clear system/hardware buffer?

I want to know the way to test right sqls before tuned and after tuned.
but once I executed the original sql, I got results too fast for tuned sql.
I found below...
How to clear all cached items in Oracle
I did flush data buffer cache and shared pool but it still didn't work.
I guess this answer from that question is related to what I want to know more:
Keep in mind that the operating system and hardware also do caching which can skew your results.
Oracle's version is 11g and Server is HP-UX 11.31.
If the server was Linux, I could've tried clearing buffer using '/proc/sys/vm/drop_caches'.(I'm not sure it would works)
I'm searching quite long time for this problem. Is there anyone has this kind of problem?
thanks
If your query is such that the results are being cached in the file system, which your description would suggest, then the query is probably not a "heavy-hitter" overall. But if you were testing in isolation, with not much activity on the database, when the SQL is run in a production environment performance could suffer.
There are several things you can do to determine which version of two queries is better. In fact, entire books have been written on just this topic. But to summarize:
Before you begin, ensure statistics on the tables and indexes are up to date.
See how often the SQL will be executed in the grand scheme of things. If it runs once or twice a day, and takes 2 seconds to run, don't bother trying to tune.
Do a explain plan on both and look at the estimated costs and number of steps.
Turn on tracing for both optimizer steps and execution statistics, and compare.

What is the best way to extract big data to file?

I am using Oracle as a DBMS and Tuxedo for application server.
Customer has the need to export data from Oracle to SAMFILE for interface purpose.
Unfortunately, the total number of records size is huge (over 10 million) so
I was wondering what is the best practice to extract big amounts of data to a file on the database server.
I am used to creating a cursor and fetching a record then writing to file.
Is there a better i.e. faster way to handle this? It is a recurring task.
I suggest you read Adrian Billington's article on tuning UTL_FILE. It covers all the bases. Find it here.
The important thing is buffering records, so reducing the number of file I/O calls. You will need to benchmark the different implementations, to see which works best in your situation.
Pay attention to his advice on query performance. Optimising file I/O is pointless if most of the time is spent on data acquisition.

Use Vertica Database for OLTP data?

Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all
Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.
HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time
I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.
This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.
Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.

When is it an overkill to use Hadoop?

I have an Oracle database (roughly 1.2 billion records) of data with a web application sitting on top of it that generates queries (generates SQL code and returns counts). Basically you generated SQL queries graphically through an AJAX UI...and it runs pretty nice performance-wise.
This is roughly a 400 GB database. I've been looking at Hadoop and thinking about using it instead of Oracle (have my app generate HIVE query code), BUT it seems to me like it's an overkill....isn't hadoop targeted more towards tens of terabytes to petabyte scale datasets? Is it suitable in place of a relational database (like Oracle) for the task I'm doing??
It's hard to say without more details. However, in my experience, if all your data is in SQL than your SQL engine probably has more optimizations than simple map reduce has.
Without knowing what you want to crunch exactly and the state of the data, then unless you are hitting some major edge case with your environment, you probably would have more trouble setting up and using hadoop in your case and it would probably wouldn't end up taking a lot longer.
If all your data in Oracle, it's probably all parsed, indexed, and hopefully somewhat regular. If the crunching exists entirely in that domain (and you are not trying to work with something uncommon like massive BLOBs or other weird situtations), most of the time its better letting your database engine handle it.
Moral of the story:
Hadoop is really awesome but it's not
magic and doesn't make regular old SQL faster!
isn't hadoop targeted more towards tens of terabytes to petabyte scale datasets?
Maybe. But it's suitable to a wide variety of problems. It's also suitable for very small datasets where the Hadoop "functional" style of programming helps.
SQL is not the perfect query language. It's just widely-adopted.
Is it suitable in place of a relational database (like Oracle) for the task I'm doing??
Without too many requirements, it's almost impossible to tell. However, if you're doing transactional stuff with lots of inserts, updates and deletes, then SQL RDBMS is probably necessary.
If you're not doing complex transactions; if you're doing bulk loads and bulk queries, then the database is getting in your way. The file system will be faster. And often simpler.
Basically if something isn't broke don't try to repair it. From what i read in wikipedia it definitely is an overkill, but other than that you're saying the application "runs pretty nice performance-wise."

SQL Server - Merging large tables without locking the data

I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.
I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.
Thanks
Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.
You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.
Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.
I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.
An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.
I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.
When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.
Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.
Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:
http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx
From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.
For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.
The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.
Having the NOLOCK hints on all of your front-end reads is an absolute must, always.

Resources