Pytables time performance - performance

I'm working on a project related to text detection in natural images. I have to train a classifier and for that i'm using Pytables to store information. I have:
62 classes (a-z,A-Z,0-9)
Each class has between 100 and 600 tables
Each table has 1 single column to store a 32bit Float
Each column has between 2^2 and 2^8 rows (depending on parameters)
My problem is that after I train the classifier, it takes a lot of time to read the information in the test. For example: One database has 27900 tables (62 classes * 450 tables per class) and there are 4 rows per table , it took aprox 4hs to read and retrieve all the information I need. The test program read each table 390 times (for classes A-Z, a-z) and 150 times for classes 0-9 to get all the info I need. Is that normal?
I tried to use the index option for the unique column , but I dont see any performance. I work on a VirtualMachine with 2GB Ram on a HP Pavillion Dv6 (4GB Ram DDR3, Core2 Duo).

This is likely because column lookup on tables is one of the slower operations you can do and this is where ALL of your information lives. You have two basic options to increase performance for Tables with many columns and few rows:
Pivot this structure such that you have a Table with many rows and few columns.
Move to a more efficient data structure like a CArray or EArray for every row / column.
Additionally, you can try using compression to speed things up. This is sort of generic advice, because you haven't included any code.

Related

Vertica GET_COMPLIANCE_STATUS() discrepancy

We use Vertica Community Edition which allows us to store data up to 1 TB. Vertica is hosted on-premise and we have allocated 153 GB for it to use, out of which 57 GB (39%) is used so far.
When I run SELECT GET_COMPLIANCE_STATUS(), it shows I have used 0.91 TB (91%) of the allowed disk space. I have executed SELECT AUDIT_LICENSE_SIZE() to make sure we get the latest compliance data.
I am wondering why these numbers do not match.
The disk usage is the compressed/encoded ROS files.
For license calculation Vertica uses uncompressed files to sum over the total size. 1 TB for community edition is uncompressed size of the files.
Disk usage can vary depending upon how many projections you create.
Note: Additional Projections does not count to the license size. Only additional Tables and External tables do.
As #minatverma says - the audit size is the uncompressed size of the data.
It's actually the answer to the question of how many terabytes the export files would occupy if you exported all data tables to CSV files, not counting the delimiters and counting 0 bytes for NULL values.
This has only a very theoretical correlation with the size of the ROS files on disk. Vertica is a columnar database. Each column , roughly , is one file.
So, if you have, for example, a gender column that can only assume 'M' or 'F', have the projection ordered by this column first, and the column encoded as Run-Length-Encoding (RLE), this file will not occupy more than some twenty bytes - whether the table has 100 or 1 million rows: The value 'F', followed by the integer 500002 (the value occurs so many times), and the value 'M', followed by the integer 499998.
So, you see, they have little to do with each other: in the CSV file, you have one million times 1 byte for that.

Discrepancy in Oracle table ETL processing duration into Data Warehouse – possible reasons with data or table structure?

We’ve 3 tables in our Oracle database that get ETL’d (via SSIS) into our DW environment during out-of-hours.
Table 1 contains 16million rows, 52 columns (~550k used blocks), and reportedly takes just ~6mins to load into the Data Warehouse.
In contrast, tables 2 and 3 are smaller with both containing <3million rows, ~45 columns (~200k used blocks each), but they take almost and hour each to load.
There’s no CLOB or BLOB data in any of these tables.
Data Warehouse team advises me that they’re using identical ETL packages to pull the data, and therefore suspect there must be something database-side that’s influencing this.
On the premise that tables 2 & 3 are processing non-typically slowly, is there anything from a data or table-structure I should be investigating as a possible factor in this big discrepancy between processing times?

Hive partition scenario and how it impacts performance

I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.

SQLite performance on large tables

EDIT: My fault here was very basic: I did not use a PRIMARY KEY for indexing. To make this thread a bit more useful I added performance data for searching my table with and without indexing for performance comparison.
I'm using sqlite3 in python in an application running both under windows and linux. My database file is currently in the range of 700 MB.
I recognized one special performance issue regarding the number of entries in my largest table. It consists of 10 columns being integer and float numbers and one varchar.
The table has 1.6 Mio rows. For that size each SELECT or UPDATE command takes 327ms. That is by far too long for my application, since it mainly waits on sqlite now.
I recognized, that performance drastically increases with table size dropping. I found:
1.6M entries 327 ms w/o indexing => 29.7 ms with indexing
670k entries 149 ms w/o indexing => 28.8 ms with indexing
280k entries 71 ms w/o indexing => 28.5 ms with indexing
147k entries 44 ms w/o indexing => 28.0 ms with indexing
19k entries 25 ms w/o indexing => 25.0 ms with indexing
CONCLUSION: using indexing search times almost stay constant while seach times w/o indexing almost linearily rise with table size. Only for very small tables the difference is negligible.
When query time scales linearly with table size, your queries are probably doing a full table scan, meaning they have to read all the rows in the table. This generally means they're not using indexes.
We can't tell you what you should index without seeing your schema and queries. You can see what your query is doing by putting EXPLAIN QUERY PLAN in front of it like EXPLAIN QUERY PLAN SELECT * FROM foo. If you see "SCAN TABLE" that's a full table scan. if you see "USING INDEX" that's using indexes.
Make sure that each column in the WHERE (and JOIN, if used) clause of your SELECT and UPDATE appears in an index, or is part of the primary key of your table.
Note also that the performance improvement due to index is tied to the constant size of the query result. If the number of query results grows linearly with table size, the effect of the index gets limited because the resulting data amount transfered back to application can not be meaningfully reduced. In this case you may need to make deeper performance analysis.

Cassandra Wide Vs Skinny Rows for large columns

I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.

Resources