How to insert 1 Billion rows in google-bigquery? - sorting

I need to insert 1 billion rows in Google bigquery table. But I don't have the data readily available.
I will have to make several millions of asynchronous http requests(1 asynchronous request = 1000 rows of ordered data). Each row has a column called ID and I need the billion rows in bigquery table ordered by ID once all requests are completed because it is timeseries data.
Challenge is, asynchronous calls doesn't have any order.(Also I will use parallelization across multiple CPUs). If I insert rows as they come and wait till all the billion are inserted, I am afraid sorting billion rows might take lot of time at the end.
One naïve way is to create the ID column with billion integers before hand, create empty columns for other fields and insert data by searching the ID, which I think is also inefficient.
Is there an efficient way of achieving this?

Related

Clickhouse: Should i optimize MergeTree table manually?

I have a table like:
create table test (id String, timestamp DateTime, somestring String) ENGINE = MergeTree ORDER BY (id, timestamp)
i inserted 100 records then inserted another 100 records and i run select query
select * from test clickhouse returning with 2 parts their lengths are 100 and they are ordered in themselves. Then i run the query optimize table test and it started to return with 1 part and its length is 200 and ordered. So should i run optimize query after all insert and does it increase select query performance like select count(*) from test where id = 'foo' ?
Merges are eventual and may never happen. It depends on the number of inserts that happened after, the number of parts in the partition, size of parts. If the total size of input parts are greater than the maximum part size then they will never be merged.
It is very unreasonable to constantly merge up to one part.
Merger does not have such goal. In the contrary the goal is to have the minimum number of parts withing smallest number of merges. Merges consume the huge amount of disk and processor resources.
It makes no sense to merge two 300GB parts into one 600GB part for 3 hours. Merger have to read, decompress 600GB, merge, compress, write them back, after that the performance of the selects will not grow at all or will grow minimally.
Usually not, you can rely on Clickhouse background merges.
Also, Clickhouse has no intention to merge all the data from the partition into one part file, because "over-optimization" can affect performance too

Cassandra Wide Vs Skinny Rows for large columns

I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.

When is the right time to create Indexes in Oracle?

A brand new application with Oracle as DataStore is going to be pushed in Production. The Databases use CBO and I have indentified some columns to do indexing. I am expecting the total number of records in a particular table to be 4 million after 6 months. After that very few records will be added and there will not be any updates in the records of Indexed columns. I mean most of the updates will be on NonIndexed columns.
Is it advisable to create Index now? or I need to wait for a couple of months?
If table requires indexes, you will incur a lot of poor performance (full table scan + actual I/O) after the number of rows in the table goes beyond what might reasonably be kept the cache. Assume that is 20000 rows. We'll call it magic number. You hit 20000 rows in a week of production. After that the queries and updates on the table will grow progressively slower, on average, as more rows are added.
You are probably worried about the overhead of inserting new rows with indexed fields. That is a one-time hit. You a trading that against dozens of queries and updates when you delay adding indexes.
The trade off is largely in favor of adding indexes right now. Especially since we do not know what that magic number (20000?) really is. Could be larger. Or smaller.

Populating a star-schema from a single staging table.

What is the best way of populating a star-schema from a single staging table.
The dataload is in number of millions of rows and the star-schema is one fact table with 10 associated dimension tables.
Scenario 1. Doing sequential inserts into dimensions first and afterwards a big insert into the fact table where I join the staging table with the updated dimension tables. My biggest concern here is the locking that might occur due to the concurrent inserts into dimension/fact tables due to the huge amount of data.
Scenario 2. Splitting the data load into smaller batches (10k rows) and looping through the entire staging table and inserting the batches in the same manner as described in Scenarion 1. The problem I'm seeing here is looping through a big table with cursors. Plus in case one of the batches fails to insert the data I would need to rollback changes for all of the inserts done previously.
Scenario 3. Write a big INSERT ALL statement and lock the star-schema for the whole duration of the insert. Moreover to the locking problems I would have a complex insert statment that will have to hold all of the business logic for the insert statements (a nightmare to debug and maintain)
You can try DBMS_PARALLEL_EXECUTE in 11g Release 2(!)
http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/d_parallel_ex.htm#ARPLS233
Works fine to separate a big table in smaller chunks and it lets you define the grade of parallelism very easyly. Don't use parallel hints or insert append inside the processing of the chunks though.
Your assumption that you can load the dimension tables without problems seem me too optimistic. In my experience you must cater for the situation that not all informations for the dimension data are valid at load time.

TSQL Merge Performance

Scenario:
I have a table with roughly 24 million records. The table has pricing history related to individual customers and is computed daily. There are on average 6 million records for each day. Every morning a the price list is generated and a merge statement is ran to reflect the changes in their pricing.
The merge statement begins with the previous day's previous data being inserted into a variable table, that table is then merged into the actual table. The main problem is that the merge statement takes pretty long.
My real question centers around the performance of using a variable table vs physical table vs temp table. What is the best practice for large merges like this?
Thoughts
I'd consider a temp table: these have statistics which will help. A table variable is always assumed to have one row. Also, the IO can be shunted onto separate drives (assuming you have tempdb separately)
If a single transaction is not required, I'd split the MERGE too into a DELETE, UPDATE, INSERT sequence to reduce the amount of work needed in each action (which reduces the amount of rollback info needed and the amount of locking etc
Temp tables often perform better than table variables for large data sets. Additionally you can put the data into the temp table and then index it.
Check if you indexes on the tables. Indexes would be updated every time you add/delete records on that table.
Try removing the indexes before merging the records and then re-create it again after the merge.

Resources