How to know that Fusion Tables usage is over limit? - google-api

I use Fusion Tables API to add rows to my tables. I found this answer explaining the limits. Will I get some error message when these limits are reached?
I am getting currently the following error
https://www.googleapis.com/upload/fusiontables/v1/tables/my-table-id/import?uploadType=media&alt=json
returned "Internal error when processing import. Please try again.">
and don't know what is the reason.
Total number of rows in my document is 464'938. Number of cells is 13 * 464'938 (non-empty cells - 5'295'364). Downloaded file size is 43M (not sure how to check file size directly on Google side). But when I've created new table, it started to work well.

Looks like a capacity limit was hit.
In December 2015, Fusion Tablkes announced increased limits.
We are happy to announce that, starting immediately:
All users have 1 GB of storage quota for their tables. There continues
to be a 250 MB limit per table.
Newly created tables can show up to 350,000 features on a map. There
continues to be a limit of 1 M characters per cell and 10 M vertices
per table. You can activate the new limit for existing tables by
opening the row editor and then clicking the “Save” button.
If you try again it may work, but 465k rows is still high.

Related

ClickHouse - Inserting more than a hundred entries per query

I do not figure out how to increase the max number of entries per query. I would like to insert a thousand entries per query, and the default value is 100.
According to the doc, the parameter max_partitions_per_insert_block defines the limit of simultaneous entries.
I've tried to modify it from the ClickHouse client, but my insertion still fails :
$ clickhouse-client
my-virtual-machine :) set max_partitions_per_insert_block=1000
*SET* max_partitions_per_insert_block = 1000
Ok.
0 rows in set. Elapsed: 0.001 sec.
Moreover, this is no max_partitions_per_insert_block field in the /etc/clickhouse-server/config.xml file.
After modifying max_partitions_per_insert_block, I've tried to insert my data, but I'm stuck with this error :
infi.clickhouse_orm.database.ServerError: Code: 252, e.displayText() = DB::Exception: Too many partitions for single INSERT block (more than 100). The limit is controlled by 'max_partitions_per_insert_block' setting. Large number of partitions is a common misconception. It will lead to severe negative performance impact, including slow server startup, slow INSERT queries and slow SELECT queries. Recommended total number of partitions for a table is under 1000..10000. Please note, that partitioning is not intended to speed up SELECT queries (ORDER BY key is sufficient to make range queries fast). Partitions are intended for data manipulation (DROP PARTITION, etc). (version 19.5.3.8 (official build))
EDIT: I'm still stuck with this. I cannot even manually set the parameter to the value I want with SET max_partitions_per_insert_block = 1000: the value is changed but goes back to 100 after exiting and reopening clickhouse-client (even with sudo, so it does not look like a permission problem).
I figured it out when reading again the documentation, especially this document. I have recognized in the web profile settings I saw in the system.settings table. I just tried to insert the following in my default's profile, reloaded, and my insert of a thousand entries wen well : <max_partitions_per_insert_block>1000</max_partitions_per_insert_block>
I guess it was obvious for some, but probably not for unexperimented people.
Most likely you should change the partitioning scheme. Each partition generates several files on the file system, which can lead to disruption of the OS. In addition, this may be the cause of long mergers.

Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed?

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

Resizing Oracle Datafiles

I have an Tablespace with 3 datafiles (autoextend enabled). Actually Datafile_1 and Datafile_2 are 32GB in size and Datafile_3 size is 10GB.
I Dropped one huge table and Datafile_2 occupation dropped to 4GB. Using some queries**[1]** over the dba_extents view I could see the HWM was still at 32GB. I was able to move/shrink all the objects at the end of the datafile. Using the same query again i could see the HWM dropped to about 4GB.
Then I tried:
ALTER DATABASE DATAFILE '+DATA/myDatabase/datafile/datafile_2' resize 5G;
And got:
ORA-03297: file contains used data beyond requested RESIZE value;
I tried to increase the size until 30GB without sucess.
I did some research and found this article:
http://www.dbi-services.com/index.php/blog/entry/resize-your-oracle-datafiles-down-to-the-minimum-without-ora-03297
Here, instead of dba_extents the author access directly sys.x$ktfbue to view the metadata. His script show that Datafile_2 HWM was still 32G. For some reason MAX(block_id) in sys.x$ktfbue is different from the one in dba_extents.
Well, I did some more research to discover how to map the block_id on x$ktfbue to a database object. Found another script with that. After running it I could see that all the blocks at the end of the Datafile_2 was marked as "free space".
Well, now I am on a situation where I don't know what to do. Maybe having more than one datafile is the problem? Any tips?
[1] This is the query:
SELECT (MAX((block_id + blocks-1)*8192))/1024/1024 "HWM (MB)" FROM dba_extents WHERE file_id=8;
*file_id 8 represents datafile_2
EDIT: Tried one other thing. I created a new tablespace and moved all the objects on Datafile_3 to it. I can see now that Datafile_3 is completely empty via DBA_EXTENTS. Also purged recyclebin.
Then tried to resize it without success(ORA-03297), also tried to drop it and got "ORA-03262: the file is not empty".
Finally I decided to sum all the free extents from Datafile_3 on DBA_FREE_SPACE and compare it to the value at DBA_DATA_FILES. There is a difference of 1MB somewhere! I am going crazy. :)

Caching expensive SQL query in memory or in the database?

Let me start by describing the scenario. I have an MVC 3 application with SQL Server 2008. In one of the pages we display a list of Products that is returned from the database and is UNIQUE per logged in user.
The SQL query (actually a VIEW) used to return the list of products is VERY expensive.
It is based on very complex business requirements which cannot be changed at this stage.
The database schema cannot be changed or redesigned as it is used by other applications.
There are 50k products and 5k users (each user may have access to 1 up to 50k products).
In order to display the Products page for the logged in user we use:
SELECT TOP X * FROM [VIEW] WHERE UserID = #UserId -- where 'X' is the size of the page
The query above returns a maximum of 50 rows (maximum page size). The WHERE clause restricts the number of rows to a maximum of 50k (products that the user has access to).
The page is taking about 5 to 7 seconds to load and that is exactly the time the SQL query above takes to run in SQL.
Problem:
The user goes to the Products page and very likely uses paging, re-sorts the results, goes to the details page, etc and then goes back to the list. And every time it takes 5-7s to display the results.
That is unacceptable, but at the same time the business team has accepted that the first time the Products page is loaded it can take 5-7s. Therefore, we thought about CACHING.
We now have two options to choose from, the most "obvious" one, at least to me, is using .Net Caching (in memory / in proc). (Please note that Distributed Cache is not allowed at the moment for technical constraints with our provider / hosting partner).
But I'm not very comfortable with this. We could end up with lots of products in memory (when there are 50 or 100 users logged in simultaneously) which could cause other issues on the server, like .Net constantly removing cache items to free up space while our code inserts new items.
The SECOND option:
The main problem here is that it is very EXPENSIVE to generate the User x Product x Access view, so we thought we could create a flat table (or in other words a CACHE of all products x users in the database). This table would be exactly the result of the view.
However the results can change at any time if new products are added, user permissions are changed, etc. So we would need to constantly refresh the table (which could take a few seconds) and this started to get a little bit complex.
Similarly, we though we could implement some sort of Cache Provider and, upon request from a user, we would run the original SQL query and select the products from the view (5-7s, acceptable only once) and save that result in a flat table called ProductUserAccessCache in SQL. Next request, we would get the values from this cached-table (as we could easily identify the results were cached for that particular user) with a fast query without calculations in SQL.
Any time a product was added or a permission changed, we would truncate the cached-table and upon a new request the table would be repopulated for the requested user.
It doesn't seem too complex to me, but what we are doing here basically is creating a NEW cache "provider".
Does any one have any experience with this kind of issue?
Would it be better to use .Net Caching (in proc)?
Any suggestions?
We were facing a similar issue some time ago, and we were thinking of using EF caching in order to avoid the delay on retrieving the information. Our problem was a 1 - 2 secs. delay. Here is some info that might help on how to cache a table extending EF. One of the drawbacks of caching is how fresh you need the information to be, so you set your cache expiration accordingly. Depending on that expiration, users might need to wait to get the fresh info more than they would like to, but if your users can accept that they migth be seing outdated info in order to avoid the delay, then the tradeoff would worth it.
In our scenario, we decided to better have the fresh info than quick, but as I said before, our waiting period wasn't that long.
Hope it helps

Oracle SQL*loader running in direct mode is much slower than conventional path load

In the past few days I've playing around with Oracle's SQL*Loader in attempt to bulk load data into Oracle. After trying out different combination of options I was surprised to found the conventional path load runs much quicker than direct path load.
A few facts about the problem:
Number of records to load is 60K.
Number of records in target table, before load, is 700 million.
Oracle version is 11g r2.
The data file contains date, character (ascii, no conversion required), integer, float. No blob/clob.
Table is partitioned by hash. Hash function is same as PK.
Parallel of table is set to 4 while server has 16 CPU.
Index is locally partitioned. Parallel of index (from ALL_INDEXES) is 1.
There's only 1 PK and 1 index on target table. PK constraint built using index.
Check on index partitions revealed that records distribution among partitions are pretty even.
Data file is delimited.
APPEND option is used.
Select and delete of the loaded data through SQL is pretty fast, almost instant response.
With conventional path, loading completes in around 6 seconds.
With direct path load, loading takes around 20 minutes. The worst run takes 1.5 hour to
complete yet server was not busy at all.
If skip_index_maintenance is enabled, direct path load completes in 2-3 seconds.
I've tried quite a number of options but none of them gives noticeable improvement... UNRECOVERABLE, SORTED INDEXES, MULTITHREADING (I am running SQL*Loader on a multiple CPU server). None of them improve the situation.
Here's the wait event I kept seeing during the time SQL*Loader runs in direct mode:
Event: db file sequential read
P1/2/3: file#, block#, blocks (check from dba_extents that it is an index block)
Wait class: User I/O
Does anyone has any idea what has gone wrong with direct path load? Or is there anything I can further check to really dig the root cause of the problem? Thanks in advance.
I guess you are falling fowl of this
"When loading a relatively small number of rows into a large indexed table
During a direct path load, the existing index is copied when it is merged with the new index keys. If the existing index is very large and the number of new keys is very small, then the index copy time can offset the time saved by a direct path load."
from When to Use a Conventional Path Load in: http://download.oracle.com/docs/cd/B14117_01/server.101/b10825/ldr_modes.htm

Resources