Database Schema vs. Data Structure - data-structures

What's the conceptual difference between a database schema and any data structure, in general? Don't both convey the organisation of data for efficiency? Or am I mixing two completely different things?

Database schema is the logical view of the entire DB.
Data structures are the specific formats that are used to store data (File, array, trees, etc)
Another way to think of it is that a DB schema will contain various data structures.

Related

What are the typical ways to cache the result of a relational database query using Redis?

What do developers commonly use as the key and value to cache the result from a SQL query into Redis? For example, if I have a Users table, and I want to cache the results from the query:
SELECT name, age FROM Users
1) Which Redis data structure should I use? Should I just have a single Key for the query and store the entire object returned by the database as the Value as such:
{ key: { object returned by database } }
Or should I use Redis' List data structure and loop through the rows individually and push them into the List as such:
{ key: [ ... ]}
Wouldn't this add computation time of O(N)? How is this more effective than just simply storing the object returned by the database?
Or should I use Redis' Hash Map data structure and loop through the rows individually and set a unique Key for each row with its corresponding attributes as such:
{ key1: {name: 'Bob', age: 25} }, { key2: {name: 'Sally', age: 15} }, ...
2) What would be a good rule of thumb with regards to the Key? From my understanding, some people just use the SQL query as the Key? But if you do so, does that mean you would have to store the entire object returned by the database as the Value (as per question 1)? Is this the best way to do it? If you are using an ORM, do you still use the SQL query as the key?
This is nicely analyzed in the Database Caching Strategies Using Redis whitepaper, by AWS.
Here the options discussed in the document. What is best is really a design decision based on tradeoffs you have to make for your specific use-case.
Cache the Database SQL ResultSet
Cache a serialized ResultSet object that contains the fetched database
row.
Pro: When data retrieval logic is abstracted (e.g., as in a Data Access Object or DAO layer), the consuming code expects only a
ResultSet object and does not need to be made aware of its
origination. A ResultSet object can be iterated over, regardless of
whether it originated from the database or was deserialized from the
cache, which greatly reduces integration logic. This pattern can be
applied to any relational database.
Con: Data retrieval still requires extracting values from the ResultSet object cursor and does not further simplify data access; it
only reduces data retrieval latency.
Cache Select Fields and Values in a Custom Format
Cache a subset of a fetched database row into a custom structure that
can be consumed by your applications.
Pro: This approach is easy to implement. You essentially store specific retrieved fields and values into a structure such as JSON or
XML and then SET that structure into a Redis string. The format you
choose should be something that conforms to your application’s data
access pattern.
Con: Your application is using different types of objects when querying for particular data (e.g., Redis string and database
results). In addition, you are required to parse through the entire
structure to retrieve the individual attributes associated with it.
Cache Select Fields and Values into an Aggregate Redis Data Structure
Cache the fetched database row into a specific data structure that can
simplify the application’s data access.
Pro: When converting the ResultSet object into a format that simplifies access, such as a Redis Hash, your application is able to
use that data more effectively. This technique simplifies your data
access pattern by reducing the need to iterate over a ResultSet object
or by parsing a structure like a JSON object stored in a string. In
addition, working with aggregate data structures, such as Redis Lists,
Sets, and Hashes provide various attribute level commands associated
with setting and getting data, eliminating the overhead associated
with processing the data before being able to leverage it.
Con: Your application is using different types of objects when querying for particular data (e.g., Redis Hash and database results).
Cache Serialized Application Object Entities
Cache a subset of a fetched database row into a custom structure that
can be consumed by your applications.
Pro: Use application objects in their native application state with simple serializing and deserializing techniques. This can
rapidly accelerate application performance by minimizing data
transformation logic.
Con: Advanced application development use case
Regarding 2)
What would be a good rule of thumb with regards to the Key?
Using the SQL query as the Key is OK for as long as you are sure it is unique. Add prefixes if there is a risk of not-uniqueness. You may have other databases with the same table names, leading to the same queries. Also make them invariant: all lower case or upper case. Redis keys are case-sensitive.
But if you do so, does that mean you would have to store the entire object returned by the database as the Value (as per question 1)?
Not necessarily, it comes down to what processing you are doing with the query. Chances are some are best stored as raw entire object for processing, some as JSON-stringified object to return quickly to the client, some as rows, etc. The best is to adapt accordingly.
Is this the best way to do it?
Not necessarily.
If you are using an ORM, do you still use the SQL query as the key?
You may if your ORM easily exposes the SQL Query programmatically, and it is consistent.
I wouldn't get fixed on the idea of using the SQL Query as key, use something you can be sure it is consistent, it will optimize your processing, and you'll have clear rules to invalidate. It could be the method call with parameters, the web API call, etc.

neo4j: how to import data from Oracle

I have 5 tables: 3 for nodes and 2 for relationships between them ( relationship = child). How to transfer the data from Oracle to neo4j ?
The Neo4j site has an entire documentation section on moving data from relational databases to neo4j. There are a bunch of different possibilities.
The simplest way though is to use your chosen database tools to export your tables to a CSV format, and then to use Cypher's LOAD CSV command to pull the data in.
The data can't be directly transferred though, in the sense that your tables represent entities and relationships between them; when moving the data to neo4j, this requires that you consider what you want your graph data model to look like.
Because the flexibility and the power that neo4j will give you will ultimately have a lot to do with how you modeled your data, you should give this careful consideration before you dump the CSV and try to import it into neo4j.

Hive enforces schema during read time?

What is the difference and meaning of these two statements that I encountered during a lecture here:
1. Traditional databases enforce schema during load time.
and
2. Hive enforces schema during read time.
You touch on one of the reasons why Hadoop and other NoSQL strategies have been so successful, so I'm not sure if you were expecting to get a dissertation or not, but here it is! The extra flexibility and agility in data analysis has probably contributed to the explosion of "data science", just because it makes large-scale data analysis easier in general.
A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store.
Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store. Consider the following line of raw text:
A:B:C~E:F~G:H~~I::J~K~L
There are a couple ways to interpret this. ~ could be the delimiter or maybe : could be the delimiter. Who knows? With schema on read, it doesn't matter. You decide what the schema is when you analyze the data, not when you write the data. This example is a bit ridiculous in that you probably won't ever encounter this case, but it gets the point across hopefully.
With schema on read, you just load your data into the data store and think about how to parse and interpret later. At the core of this explanation, schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after.
There is a tradeoff here. Some of these are subjective and my own opinion.
Benefits of schema on write:
Better type safety and data cleansing done for the data at rest
Typically more efficient (storage size and computationally) since the data is already parsed
Downsides of schema on write:
You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)
Typically you throw away the original data, which could be bad if you have a bug in your ingest process
It's harder to have different views of the same data
Benefits of schema on read:
Flexibility in defining how your data is interpreted at load time
This gives you the ability to evolve your "schema" as time goes on
This allows you to have different versions of your "schema"
This allows the original source data format to change without having to consolidate to one data format
You get to keep your original data
You can load your data before you know what to do with it (so you don't drop it on the ground)
Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data
Downsides of schema on read:
Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML)
The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)
More error prone and your analytics have to account for dirty data

Is no sql database a good solution for a small application?

I am developing an internal web application that needs a back end. The data stored is not really RDBMS type. Currently it is in XML document fashion that the application parses (XQuery) to display html tables and other type of fields.
It is likely that I will have a few more different types of XML documents and CSV(comma separated values) coming up. Given the scenario, I can always back the data up with a mySQL database, breaking the process that generates XML or CSV to insert straight in to database.
Is no-sql database a good choice in this scenario? or mySQL is still better? I do not see any need for clustering/high availability/distributed processing scenarios.
Define "better".
I think the choice should be made based on how relational (MySQL) or document-based (NoSQL) your data is.
A good way to know is to analyze typical use cases. Better yet, write two prototypes and measure.

Data structure functioning like Database in C or C++

Is there a data structure which gives you functions of a database (like insert, update, delete etc)? For example:
create a struct like the database table
store data on it and query on it
selectively delete it
I know that with a hashtable you can do this (ex: uthash library). But as far as I know updating one column element only is not easy in a hash table.
Look at sqlite. Rather than a relational database system, it is essentially a connectionless, file-based database library that supports SQL. You link your program against it and it provides functions to perform SQL queries over data files.
Look At NoSQL itis The RMDBS used By FaceBook
Use C structs to represent rows of data and then trees (or maybe hashes) for indexes. There are a lot of little problems you will need to solve, specially in order to make all the operations efficient, but this forms the basis for an in-memory table.
For simple things, a tree structure may be enough.

Resources