ClickHouse insert nested data - ruby

How I can insert to ClickHouse data below?
tests => [{"test_id"=>1099803, "test_number"=>"35545585544", "test_number_2"=>"123456", "test_source"=>nil}, "test_id"=>1099804, "test_number"=>"1313113", "test_number_2"=>"654321", "test_source"=>nil}, {...}]

ClickHouse offers few approaches that can be helpful in your case:
Store information as JSON. It's slower to query and potentially taking larger disk footprint, it's also less structured but giving you full freedom of expression. Column data type will be string, you can use JSONExtract* to access the data.
Use nested data format.

Related

What are the typical ways to cache the result of a relational database query using Redis?

What do developers commonly use as the key and value to cache the result from a SQL query into Redis? For example, if I have a Users table, and I want to cache the results from the query:
SELECT name, age FROM Users
1) Which Redis data structure should I use? Should I just have a single Key for the query and store the entire object returned by the database as the Value as such:
{ key: { object returned by database } }
Or should I use Redis' List data structure and loop through the rows individually and push them into the List as such:
{ key: [ ... ]}
Wouldn't this add computation time of O(N)? How is this more effective than just simply storing the object returned by the database?
Or should I use Redis' Hash Map data structure and loop through the rows individually and set a unique Key for each row with its corresponding attributes as such:
{ key1: {name: 'Bob', age: 25} }, { key2: {name: 'Sally', age: 15} }, ...
2) What would be a good rule of thumb with regards to the Key? From my understanding, some people just use the SQL query as the Key? But if you do so, does that mean you would have to store the entire object returned by the database as the Value (as per question 1)? Is this the best way to do it? If you are using an ORM, do you still use the SQL query as the key?
This is nicely analyzed in the Database Caching Strategies Using Redis whitepaper, by AWS.
Here the options discussed in the document. What is best is really a design decision based on tradeoffs you have to make for your specific use-case.
Cache the Database SQL ResultSet
Cache a serialized ResultSet object that contains the fetched database
row.
Pro: When data retrieval logic is abstracted (e.g., as in a Data Access Object or DAO layer), the consuming code expects only a
ResultSet object and does not need to be made aware of its
origination. A ResultSet object can be iterated over, regardless of
whether it originated from the database or was deserialized from the
cache, which greatly reduces integration logic. This pattern can be
applied to any relational database.
Con: Data retrieval still requires extracting values from the ResultSet object cursor and does not further simplify data access; it
only reduces data retrieval latency.
Cache Select Fields and Values in a Custom Format
Cache a subset of a fetched database row into a custom structure that
can be consumed by your applications.
Pro: This approach is easy to implement. You essentially store specific retrieved fields and values into a structure such as JSON or
XML and then SET that structure into a Redis string. The format you
choose should be something that conforms to your application’s data
access pattern.
Con: Your application is using different types of objects when querying for particular data (e.g., Redis string and database
results). In addition, you are required to parse through the entire
structure to retrieve the individual attributes associated with it.
Cache Select Fields and Values into an Aggregate Redis Data Structure
Cache the fetched database row into a specific data structure that can
simplify the application’s data access.
Pro: When converting the ResultSet object into a format that simplifies access, such as a Redis Hash, your application is able to
use that data more effectively. This technique simplifies your data
access pattern by reducing the need to iterate over a ResultSet object
or by parsing a structure like a JSON object stored in a string. In
addition, working with aggregate data structures, such as Redis Lists,
Sets, and Hashes provide various attribute level commands associated
with setting and getting data, eliminating the overhead associated
with processing the data before being able to leverage it.
Con: Your application is using different types of objects when querying for particular data (e.g., Redis Hash and database results).
Cache Serialized Application Object Entities
Cache a subset of a fetched database row into a custom structure that
can be consumed by your applications.
Pro: Use application objects in their native application state with simple serializing and deserializing techniques. This can
rapidly accelerate application performance by minimizing data
transformation logic.
Con: Advanced application development use case
Regarding 2)
What would be a good rule of thumb with regards to the Key?
Using the SQL query as the Key is OK for as long as you are sure it is unique. Add prefixes if there is a risk of not-uniqueness. You may have other databases with the same table names, leading to the same queries. Also make them invariant: all lower case or upper case. Redis keys are case-sensitive.
But if you do so, does that mean you would have to store the entire object returned by the database as the Value (as per question 1)?
Not necessarily, it comes down to what processing you are doing with the query. Chances are some are best stored as raw entire object for processing, some as JSON-stringified object to return quickly to the client, some as rows, etc. The best is to adapt accordingly.
Is this the best way to do it?
Not necessarily.
If you are using an ORM, do you still use the SQL query as the key?
You may if your ORM easily exposes the SQL Query programmatically, and it is consistent.
I wouldn't get fixed on the idea of using the SQL Query as key, use something you can be sure it is consistent, it will optimize your processing, and you'll have clear rules to invalidate. It could be the method call with parameters, the web API call, etc.

Is aggregating outside of Hive a better choice?

I have more of a conceptual question. I'm using Hive to pull data and then I want to insert all the retrieved values into IBM BigSQL (basically DB2) so that aggregating data would be easier/faster. So I want to create a view in Hive that I will use nightly perform CTAS so that I can take the table and migrate it to db2 and do the rest of the aggregation.
Is there a better practice?
I wanted to do everything including aggregation in Hive but it is extremely slow.
Thanks for your suggestions!
Considering that you are using Cloudera, is there a reason why you don't perform the aggregations in Impala? convert the json data to parquet (I would recommend this if there is not a lot of nested structure) shouldn't be really expensive. Another alternative depending the kind of aggregations that you are doing is use Spark to convert the data (also will depend a lot of your cluster size). I would like to give you more specific hints but without know what aggregations you are doing is be complicated

how to we define hbase rowkey so we get reords in optimize manner when millons of records in table

I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data

How to perform a bulk insert with Riak?

How do I insert millions keys into a Riak bucket?
Inserting them one at a time takes too long.
Ideally I'd like something like MySQL's "LOAD DATA INFILE".
AFAIK riak does not have bulk insert mode. you can increase the performance by using the protobuffer protocol.
reading the readme here https://github.com/basho/riak-ruby-client also indicates that map types support batch operations.

Serializing objects as BLOBs in Oracle

I have a HashMap that I am serializing and deserializing to an Oracle db, in a BLOB data type field.
I want to perform a query, using this field.
Example, the application will make a new HashMap, and have some key-value pairs.
I want to query the db to see if a HashMap with this data already exists in the db.
I do not know how to do this, it seems strange if i have to go to every record in the db, deserialize it, then compare, Does SQL handle comparing BLOBs, so i could have...select * from PROCESSES where foo = ?....and foo is a BLOB type, and the ? is an instance of the new HashMap?
Thanks
Here's an article for you to read: Pounding a Nail: Old Shoe or Glass Bottle
I haven't heard much about your application's underlying architecture, but I can tell you immediately that there is never a reason why you should need to use a HashMap in this way. Its a bad technique, plain and simple.
The answer to your question is not a clever Oracle query, its a redesign of your application's architecture.
For a start, you should not serialize a HashMap to a database (more generally, you shouldn't serialize anything that you need to query against). Its much easier to create a table to represent hashmaps in your application as follows:
HashMaps
--------
MapID (pk int)
Key (pk varchar)
Value
Once you have the content of your hashmaps in your database, its trivial to query the database to see if the data already exists or produce any other kind of aggregate data:
SELECT Count(*) FROM HashMaps where MapID = ? AND Key = ?
Storing serialized objects in a database is almost always a bad idea, unless you know ahead of time that you don't need to query against them.
How are you serializing the HashMap? There are lots of ways to serialize data and an object like a HashMap. Comparing two maps, especially in serialized form, is not trivial, unless your serialization technique guarantees that two equivalent maps always serialize the same way.
One way you can get around this mess is to use XML serialization for some objects that rarely need to be queried. For example, where I work we have a log table where a certain log message is stored as an XML file in a CLOB field. This xml data represents a serialized Java object. Normally we query against other columns in the record, and only read/write the blob in single atomic steps. However once or twice it was necessary to do some deep inspection of the blob, and using XML allowed this to happen (Oracle supports querying XML in varchar2 or CLOB fields as well as native XML objects). It's a useful technique if used sparingly.
Look into dbms_crypto.hash to make a hash of your blob. Store the hash alongside the blob and it will give you something to narrow down the search to something manageable. I'm not recommending storing the hash map, but this is a general technique for searching for an exact match between blobs.
See also SQL - How do you compare a CLOB
i cannot disagree, but i'm being told to do so.
i appreciate your solution, and that's sort of what i had previously.
thanks
I haven't had the need to compare BLOBs, but it appears that it's supported through the dbms_lob package.
See dbms_lob.compare() at http://www.psoug.org/reference/dbms_lob.html
Oracle can have new data types defined with Java (or .net on windows) you could define a data type for your serialized object and define how queries work on it.
Good lack if you try this...
If you serialize your data to xml, and store the data in an xml you can then use xpaths within your sql query. (Sorry as I am more of a SqlServer person, I don’t know the details of how to do this in Oracle.)
If you EVERY need to update only part of the serialized data don’t do this.
Likewise if any of the data is pointed to by other data or points to other data don’t do this.

Resources