Should one store a search_data tsvector in the same table or external table? - performance

I am implementing full text search in postgres.
I would like to search all posts in my system. The posts fulltext index is an amalgamation of the post title and post body.
I have two ways of achieving this:
create a tsvector column in the posts table, trigger an update to it.
create a second table (posts_search) with a post_id and tsvector column containing the index data.
create a simple gin index ... (out of the question, cause my real world problem needs data in multiple tables for the index)
What is going to perform better, considering I sometimes need to filter down the search by other attributes in the table (like deleted_at is null and so on).
Is it a better approach to keep the tsvector column in the same table as the data (side effect select * now sucks) or a separate table (side effect, join required, index filtering is complicated)?

In my experiments, typical size of tsvector column is about 1% of the size of text field this tsvector was computed from using to_tsvector().
With this in mind, storing tsvector column in another table should provide performance benefit. For example, even if you do not use SELECT * (and you shouldn't, really), any seqscan in original single table will still have to load pages which contain original text. If you offload tsvector field to separate table, page loading will be faster by 100x.
In other words, I would favor second solution of offloading tsvector field to separate table. Or, alternatively, offloading posts (original text) deeper into your table hierarchy (but I guess it is almost the same thing).
Note that for full text search to work, original text is not necessary. You way want to even not store it in database, or store it in highly compressed format (and not necessarily easily accessible by SQL routines). It would work as long as something can create tsvector based on original text, or update when it changes.

Related

Saving float changes to a float or to a varchar2 column?

I need to save before and after value changes of certain fields of an items table to an items_log table. Changes are saved by an after change trigger on the items table.
Some of the items table columns are varchar2 type and some are number(*) type.
What is the better approach? Saving to separate two before and after number fields and two before and after varchar2 fields? Or conserving space by saving everything to two before and after varchar2 fields?
The purpose of this log table is to record which user changed a field and the before and after values.
Could saving a float value to a string field lead to an unexpected diversion from the original value?
Thanks in advance
"What is the better approach?"
There is no "better" approach. There is only an approach that's good enough for your application. If your table will have a few thousand rows in it, it doesn't really matter. If your table will have a few million rows, then space may be more of a concern.
If your goal is to display to a user what changes occurred to your item and it's not going to see a lot of activity, storing everything as a varchar may be good enough. You probably don't want to store rows for fields that did not change.
I use APC's approach often. The items_log table is the same as the item table, and includes a history id, timestamp, action (I, U, or D), and user along with all the columns of the item row. Everything is maintained by a trigger. There are also built-in Oracle auditing features to do auditing for you.

Set sort on first column as default in TOAD

I'm using Toad for Oracle 12.5 and a little thing anoy me : when I look into the "Data" tab of a table, the row order is all jumbled up.
On any other DB software I used (SQL developper, phpmyadmin, etc), the default data view would retur the rows ordered by the primary key
So, I would like it to automaticly by default sort the data in the "Data" tab of each table to the first column, or even better, to the table primary key.
I've looked in the options but I can't see anything related to this.
Have some of you had the same problem ?
No oracle client that I have seen ever tacks an "order by" onto a statement on its own accord. It returns what the query returns in the order (or lack thereov) that it receives it.
Now it may LOOK ordered if the rows were inserted in order, but that is a fluke. Period.
And frankly, I'd be upset if a UI arbitrarily added expensive sorts to my queries unless I specifically told it to.
I have some BIG tables. presuming that I want the UI to take the time to scan the index and grab the lowest PK values just because I opened the DATA tab? No. Dear me - NO!
If I want it ordered, I will open the sort/filter dialog and specify so, or click on the appropriate column header to sort the results.
ADDITION:
If there ARE some tables where you want this behaviour (I can see the convenience if checking code tables for example), then use the sort/filter dialog on the data grid for that table to set an order by and TOAD will remember that setting for that table in this schema until you remove it. So you CAN set this behaviour where you want and not deal with the performance aspects where you don't.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.
It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.
One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.
A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.
Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Enumerate indexes on a Extensible Storage Engine (ESENT) table

Background
I'm writing an adapter for ESE to .NET and LINQ in a Google Code project called eselinq. One important function I can't seem to figure out is how to get a list of indexes defined for a table. I need to be able to list available indexes so the LINQ part can automatically determine when indexes can be used. This will allow much more efficient plans for user queries if appropriate indexes can be found.
There are two related functions for querying index information:
JetGetTableIndexInfo - get index information by tableID
JetGetIndexInfo - get index information by tableName
These only differ in how the related table is specified (name or tableid). It sounds like these would support the function I want but all the info levels seem to require that I already have a certain index to query information for. The only exception is JET_IdxInfoCount, but that only counts how many indexes are present.
JET_IdxInfo with its JET_INDEXLIST sounds plausible but it only lists the columns on a specific index.
Alternatives
I am aware that I could get the index information another way, like annotations on .NET types corresponding to database tables, or by requiring a index mapping be provided ahead of time. I think there's enough introspection implemented to make everything else work out of the box without the user supplying extra information, except for this one function.
Another option may be to examine the system tables to find related index objects, but this is would mean depending on an undocumented interface.
To satisfy this question, I want a supported method of enumerating the indexes (just the name would be sufficient) on a table.
You are correct about JetGetTableIndexInfo and JetGetIndexInfo and JET_IdxInfo. The twist is that the data is returned in a somewhat complex: a temporary table is returned containing a row for the index and then a row for each column in the table. To just get the index names you will need to skip the column rows (the column count is given by the value of the columnidcColumn column in the first row).
For a .NET example of how to decipher this, look at the ManagedEsent project. In the MetaDataHelpers.cs file there is a method called GetIndexInfoFromIndexlist that extracts all the data from the temporary table.

Resources