How to store data with that have multiple changing characteristics?

How to store data with that have multiple changing characteristics? - data-structures

I have some data in the following format:
Some keywords
Multiple characteristics for every keyword that change over time, biweekly
I am looking for a way to store this data. I cannot use a normal table as there are multiple characteristics for each word, not just one.
Any help would be appreciated.
I tried creating different tables for each characteristic, so that each table would have the keywords, the date, and said characteristic for each keyword at that specific date, but I was wondering if there is a data structure where I can store all the characteristics in the same place, maybe something like a 3D data structure.

Related

Elasticsearch - Modelling video catalogue information into one index vs multiple indexes

I need to model a video catalogue composed of movies, tv shows, episodes, TV channels and live programs information into elasticsearch. Some of these entities are correlated, some not.
The attributes of these entities are quite different, even if there are some common ones.
Now since I may need to do query cross-entity, imagine the scenario of a customer searching for something that could be a movie, a tv channel or a live event program, is it better to have 1 single index containing a generic entity marked with a logical type attribute, or is it better to have multiple indexes, 1 for each entity (movie, show episode, channel, program) ?
In addition, some of these entities, like movies, can have metadata attributes into multiple languages.
Coming from a relational data model DB, I would create different indexes, one for every entity and have a language variant index for every language. Any suggestion or better approach in order to have great search performance and usability?

Whether to use several indexes or not very much depends on the application, so I cannot provide a definite answer, rather a few thoughts.
From my experience, indexes are rather a means to help maintenance and operations than for data modeling. It is, for example, much easier to delete an index than delete all documents from one source from a bigger index. Or if you support totally separate search applications which do not query across each others data, different indexes are the way to go.
But when you want to query, as you do, documents across data sources, it makes sense to keep them in one index. If only to have comparable ranking across all items in your index. Make sure to re-use fields across your data that have similar meaning (title, year of production, artists, etc.) For fields unique to a source we usually use prefix-marked field names, e.g. movie_... for movie-only meta data.
As for the the language you need to use language specific fields, like title_en, title_es, title_de. Ideally, at query time, you know your user's language (from the browser, because they selected it explicitly, ...) and then search in the language specific fields where available. Be sure to use the language specific analyzers for these fields, at query and at index time.
I see a search engine a bit as the dual of a database: A database stores data but can also index it. A search engine indexes data but can also store it. A database tends to normalize the schema to remove redundancy, a search engine works best with denormalized data for query performance.

Oracle 11g - Building a Type 2 SCD based on existing historical data in a relational model

I'm an ETL developer that's currently being tasked with developing a type 2 SCD from existing historical data in a relational database. I'm perfectly capable of creating a type 2 SCD that's responsible for tracking future changes to the data, but I'm completely useless when it comes to the task at hand.
The relational model is in our ODS . Based on that relational model, I'm supposed to build flat records in our DW dimension. There are multiple attributes which need to be monitored for changes, each in specific related tables in the relational model. Historical changes must be kept on a daily basis, and if multiple changes to the same attribute occur on the same day, only the last subsists.
How can I tackle this? I'm lost. Thanks in advance.
P.S. we're talking tables with 20-30 million rows and multiple attributes that may change at any given time and therefore must result in a new record in the SCD.

This will indeed be painful. I'm assuming from your question that the tables containing the attribute values are currently varying independently (or you wouldn't need to ask the question).
If you have a table 'Table1' containing 'Key', 'Attribute1' and 'Effective From','Effective To' columns, then you can 'explode' that table into a virtual table in the form 'Key','Attribute1','Date', projecting out one row for every date where that attribute was current.
(Note that you probably don't want to do this as a ranged join against your date dimension, because this will be a Triangular Join (ie perform really badly), you probably need to explode the rows in an ETL tool/programmatically)
If you perform this process across multiple tables, you will have a set of tables giving you the full day-by-day snapshot of each attribute for every day that you care about. It's then fairly easy to join those tables based on 'FK' and 'Date' to give you the complete daily snapshot across all of the attribute values.
Then, of course, you need to run this though another process to collapse rows with the same Key, contiguous dates and all the same attribute values, ie 'unexplode' the rows, back into 'effective from','effective to' form. Note again, that this is fundamentally a row-by-row operation (or at very least a windowing function), and a set-based approach will perform very badly. Personally I'd just stream it all though some .net/java code to achieve this.
Given data volumes this will take a while, but should be achievable.

Does Core Data/SQLite compress redundant information?

I want to use Core Data (probably with SQLite backing) to store a large database. Much of the string data will be the same between numerous rows. Does Core Data/SQLite see such redundancy, and automatically save space in the db files?
Do I need to make sure that the same text in different rows is the same string object before adding it to the db? If so, how do I detect that a new piece of text matches something anywhere in the existing db?

No, Core Data does not attempt to analyze your data to avoid duplication. If you want to save 10 million objects with the same attributes, you'll get 10 million copies.
If you want to avoid creating duplicate instances, you need to do a fetch for matching instances before creating a new one. The general approach is
Fetch objects matching new data-- according to whatever standard indicates a duplicate for your app. Use a predicate with the fetch that contains the attribute(s) that you don't want to duplicate.
If you find anything, either (a) update the instances you find with any new values you have, or (b) if there are no new values, do nothing.
If you don't find anything, create a new instance.

Application-layer logic can help reduce space at the cost of application complexity.
Say your name field can contain either an integer or a string. (SQLite's weak typing makes this easy to do).
If string -- that's the name right there.
If integer -- go look it up on a name table, using the int as key
Of course you have to create that name table, either on the fly as data is inserted, or a once-in-a-while trawl through the data for new names that are worth surrogating in this way.

Hbase Map reduce and Index

I am crawling different industry data and storing the data into single hbase table. For example I am crawling Electronics and Computer industries and stored in a table called 'industry_tbl'. Now I want to run a map reduce on the sets of data namely for Electronics and computer industries and produce the reducer output with the different sets of data collected but currently hbase is taking the entire data of both the industries and giving me the reduced results which I cant differentiate by Industries.
Any Help or idea on how to solve this?

Include industry as part of the key you emit in the mapper.

Make industry the most-significant part of your hbase key and use pass that to the SCAN you define for the map-reduce

You could also do a Column Scan on the Hbase Table.
In order to do that, put all the information for a particular industry under a particular industry column family.
For example, my industry table would probably look like this.
For a given row: cf1-science cf2-technology etc.
This way, your industry data would be closely partitioned in certain regions, bringing down your query time.
Now I would just query by using the Scan api and include a particular column family to scan.
So the scan would return me only the details pertaining to a particular industry.
The row in this case would still remain the same as you would have had it previously.
Hope this explanation helps.

Best-performing method for associating arbitrary key/value pairs with a table row in a Postgres DB?

I have an otherwise perfectly relational data schema in place for my Postgres 8.4 DB, but I need the ability to associate arbitrary key/value pairs with several of my tables, with the assigned keys varying by row. Key/value pairs are user-generated, so I have no way of predicting them ahead of time or wrangling orderly schema changes.
I have the following requirements:
Key/value pairs will be read often, written occasionally. Reads must be reasonably fast.
No (present) need to query off of the keys or values. (But it might come in handy some day.)
I see the following possible solutions:
The Entity-Attribute-Value pattern/antipattern. Annoying, but the annoyance would be generally offset by my ORM.
Storing key/value pairs as serialized JSON data on a text column. A simple solution, and again the ORM comes in handy, but I can kiss my future self's need for queries good-bye.
Storing key/value pairs in some other NoSQL db--probably a key/value or document store. ORM is no help here. I'll have to manage the separate queries (and looming data integrity issues?) myself.
I'm concerned about query performance, as I hope to have a lot of these some day. I'm also concerned about programmer performance, as I have to build, maintain, and use the darned thing. Is there an obvious best approach here? Or something I've missed?

That's precisely what the hstore datatype is for in PostgreSQL.
http://www.postgresql.org/docs/current/static/hstore.html
It's really fast (you can index it) and quite easy to handle. The only drawback is that you can only store character data, but you'd have that problem with the other solutions as well.
Indexes support "exists" operator, so you can query quite quickly for rows where a certain key is present, or for rows where a specific attribute has a specific value.
And with 9.0 it got even better because some size restrictions were lifted.

hstore is generally good solution for that, but personally I prefer to use plain key:value tables. One table with definitions, other table with values and relation to bind values to definition, and relation to bind values to particular record in other table.
Why I'm against hstore? Because it's like a registry pattern. Often mentioned as example of anti pattern. You can put anything there, it's hard to easy validate if it's still needed, when loading a whole row (in ORM especially), the whole hstore is loaded which can have much junk and very little sense. Not mentioning that there is need to convert hstore data type into your language type and convert back again when saved. So you get some overhead of type conversion.
So actually I'm trying to convert all hstores in company I'm working for into simple key:value tables. It's not that hard task though, because structures kept here in hstore are huge (or at least big), and reading/writing an object crates huge overhead of function calls. Thus making a simple task like that "select * from base_product where id = 1;" is making a server sweat and hits performance badly. Want to point that performance issue is not because db, but because python has to convert several times results received from postgres. While key:value is not requiring such conversion.

As you do not control data then do not try to overcomplicate this.
create table sometable_attributes (
sometable_id int not null references sometable(sometable_id),
attribute_key varchar(50) not null check (length(attribute_key>0)),
attribute_value varchar(5000) not null,
primary_key(sometable_id, attribute_key)
);
This is like EAV, but without attribute_keys table, which has no added value if you do not control what will be there.
For speed you should periodically do "cluster sometable_attributes using sometable_attributes_idx", so all attributes for one row will be physically close.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio