Generating time series data using cassandra stress test - cassandra-stress

Is it possible to configure Cassandra stress test tool to generate insert workload for time-series data. More specifically provide a columnspec property on a timestamp column that will either
Increment the inserted value as the test progresses (say x secs for every record)
Use the current system time (I'm ok running the test for a day)

Why don't you try to insert epoch time, this way you can give the range for long values in your columnspec as below.
columnspec:
- name: col1
population: uniform(1497510499000..1497596899000)
will give you the range from 2017-06-15 07:08:19.000000+0000 to 2017-06-16 07:08:19.000000+0000. You can change the distribution as per your requirement.

Related

clickhouse dateTime with milliseconds

ClickHouse doesn't support, yet, DateTime with milliseconds.
I saw two possible suggestion regarding fields like: 2019-03-17T14:00:32.296Z
multiply by 100 an store it in UInt32/64. How do I use the multiply by 100 and store as UInt32?
to store milliseconds separately. Is there a way to remove milliseconds from 2019-03-17T14:00:32.296Z => 2019-03-17 14:00:32?
Thanks for your help!
Should use the datetime64 type - https://clickhouse.com/docs/en/sql-reference/data-types/datetime64/
In my mind, the main idea, why ClickHouse does not support milliseconds in DateTime is worse compression.
Long story short: use DateTime and precession by seconds. If you want to store milliseconds, you can go ahead with two ways:
Store milliseconds separately, so you will have a DateTime with your date, that you could use in all possible DateTime functions, as well as primary keys. And put milliseconds part in separate column with type UInt16. You have to prepare data separately before storing. Depends on what language do you use for preprocess data before storing, it could be different ways to do it. In golang it could be done:
time.Now().UnixNano() / 1e6 % 1e3
Another way, is to store whole as timestamp. This means you should convert your date to unix timestamp with milliseconds by your own and put it into ClickHouse as Uint64. It also depends on what do you use for preparing inserts. For golang it could like:
time.Now().UnixNano() / 1e6

Oracle DB: Convert String(Time stamp) into number(minutes)

So, I am trying to build a query in RMAN Catalogue ( using RC_RMAN_BACKUP_JOB_DETAILS) to compare the most recent backup duration (TIME_TAKEN_DISPLAY) for each database (DB_NAME) with its historical average AVG backup duration (TIME_TAKEN_DISPLAY).
How do I convert TIME_TAKEN_DISPLAY(timestamp; HH:MM:SS), i.e. in VARCHAR2 Format to a minute format, i.e number only, so as to run the query against the entire RC_RMAN_BACKUP_JOB_DETAILS to compare AVG time taken in past with time takes for last backup for each DB.
One thing that may work is converting String(Time_taken_display)->To_TIME(Time_taken_display in Time format)->TO_NUM(Time_taken_display in minutes in number format), but this will be so highly inefficient.
The solution can be pretty simple and complex depending on the requirements:
One simple solution is:
select avg(substr(TIME_TAKEN_DISPLAY, 0,2)*60 + substr(TIME_TAKEN_DISPLAY, 4,2) + substr(TIME_TAKEN_DISPLAY, 7,2)/60) from RC_RMAN_BACKUP_JOB_DETAILS;
Using Type Casting Functions:
Cast TIME_TAKEN_DISPLAY into time format using TO_TIMESTAMP and then cast to TO_NUMBER, but I did not want to take this approach as I plan to run my scripts against all databases logged in the view, and multiple casting will leave the performance highly inefficient.
But as per #alex Poole comment, I will be using ENLAPSED_SECONDS field as it is readily available in seconds and number data type.

AWS Kinesis Stream Aggregating Based on Time Spans

I currently have a Kinesis stream that is populated with JSON messages that are in the form of:
{"datetime": "2017-09-29T20:12:01.755z", "payload":"4"}
{"datetime": "2017-09-29T20:12:07.755z", "payload":"5"}
{"datetime": "2017-09-29T20:12:09.755z", "payload":"12"}
etc...
What im trying to accomplish here is to aggregate the data in terms of time chunks. In this case, i'd like to group the averages for 10 minute spans. For example, from 12:00 > 12:10, I want to average the payload value and save it as the 12:10 value.
For example, the above data would produce:
Datetime: 2017-09-29T20:12:10.00z
Average: 7
The method that i'm thinking of is to use caching at the service level and then some type of way to track the time. If the messages ever move into the next 10 minute timespan, I average the cached data, store it to the DB and then delete that cache value.
Currently, my service sees 20,000 messages every minute with higher volume to be expected in the future. I'm a little stuck on how to implement this to guarantee I get all the values for that 10 minute time period from Kinesis. Those of you that are more familiar with Kinesis and AWS, is there a simple way to go about this?
The reason for doing this is to shorten the query times for data from large timespans, such as for 1 year. I wouldn't want to grab millions of values but rather, a few aggregated values.
Edit:
I have to keep track of many different averages at the same time. For example, the above JSON may just pertain to one 'set', such as the average temperature per city in 10 minute timespans. This requires me to keep track of each cities averages for every timespan.
Toronto (12:01 - 12:10): average_temp
New York (12:01 - 12:10): average_temp
Toronto (12:11 - 12:20): average_temp
New York (12:11 - 12:20): average_temp
etc...
This could pertain to any city worldwide. If new temperatures arrive for say, Toronto and it pertains to the 12:01 - 12:10 timespan, I have to recalculate and store that average.
This is how I would do it. Thanks for the interesting question.
Kinesis Streams --> Lambda (Event Insertor) --> DynamoDB(Streams) --> Lambda(Count and Value incrementor) --> DynamoDB(streams) --> Average (Updater)
DynamoDB Table Structure:
{
Timestamp: 1506794597
Count: 3
TotalValue: 21
Average: 7
Event{timestamp}-{guid}: { event }
}
timestamp -- timestamp of the actual event
guid -- avoid any collision on a timestamp that occurred at same time
Event{timestamp}-{guid} -- This should be removed by (count and value incrementor)
If the fourth record for that timestamp arrives,
Get the time close to 10 min timespan, increment the count, increment the totalvalue. Neve read the value and increment, that will result in error unless you use strong consistency(which is very costly to read). Instead perform the increment operation with atomic increment.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters
Create DynamoDB streams from the above table, Listen on another lambda, Now calculate the average value and update the value.
When you calculate the average, don't perform a read from the table. Instead the data will be available over the stream, you just need to calculate the average and update it. (overwrite previous average value).
This will work on any scale and with high availability.
Hope it helps.
EDIT1:
Since the OP is not familier with AWS Services,
Lambda Documentation:
https://aws.amazon.com/lambda/
DynamoDB Documentation:
https://aws.amazon.com/dynamodb/
AWS cloud services used for the solution.

How to optimize Hive query on table with dynamic partitioning

I have to partition table according to date and hour from resultdate field which is in the format 2/5/2013 9:24:00 AM.
I am using dynamic partitioning with date & hour and doing an
insert overwrite table partition(date, hour)
{
select x,y,z, date , hour
}
from table 1.
I have about 1.5 million records, and it is taking about 4 hrs to complete. Is this normal, what would be some ways to optimize?
increase the cluster size otherwise it will take much time.
this is not normal, except if you are working in a virtual machine with 1 node :) .. Try setting this flag
set hive.optimize.sort.dynamic.partition=false;
I am not sure why it is set to true by default in some distros.
There are many scenarios to this,
Check whether TEZ engine can be used to make your execution time better.
whether the way we store the file can be changed, RC Format might help.
optimizing the hive.exec.max.dynamic.partitions & hive.exec.max.dynamic.partitions to a optimal value.
Increasing the cluster is also good ( if feasible )

Importing data incrementally from RDBMS to hive/hadoop using sqoop

I have an oracle database and need to import data to a hive table. The daily import data size would be around 1 GB. What would be the better approach?
If I import each day data as a partition, how can the updated values be handled?
For example, if I imported today's data as a partition and for the next day there are some fields that are updated with the new values.
Using --lastmodified we can get the values but where we need to send the updated values to the new partition or to the old (already existing) partition?
If I send to the new partition, then the data is duplicated.
If I want to send to the already existing partition, how we can it be achieved?
Your only option is to override the entire existing partition with 'INSERT OVERWRITE TABLE...'.
Question is - how far back are you going to be constantly updating the data?
I think of 3 approaches u can consider:
Decide on a threshold for 'fresh' data. for example '14 days backwards' or '1 month backwards'.
Then each day you are running the job, you override partitions (only the ones which have updated values) backwards, until the threshold decided.
With ~1 GB a day it should be feasible.
All the data from before your decided time is not guranteed to be 100% correct.
This scenario could be relevant if you know the fields can only be changed a certain time window after they were initially set.
Make your Hive table compatible with ACID transactions, thus allowing updates on the table.
Split your daily job to 2 tasks: the new data being written for the run day. the updated data that you need to run backwards. the sqoop will be responsible for the new data. take care of the updated data 'manually' (some script that generates the update statements)
Don't use partitions based on time. maybe dynamic partitioning is more suitable for your use case.It depends on the nature of the data being handled.

Resources