User_Sessions in vertica - vertica

I have a requirement where in I have to capture user_session details for last few months. When I query user_sessions table, I have information only for last three, four days. Is there anyway, we could get the user_sessions details for last 6 months?
Thank you,
Sadagopan

User_session is a view on top of 3 diffract data collectors tables , data collectors tables include info about many events and activity’s exists on Vertica , this info is being persists on disk with some default retention period .
You have two main options to have 6 months historical view of your sessions
1. Change the setting of the retention period of relevant DC tables to 6Mounts
2. Develop a script or process that will run each few days and merge the content of the user_session to user define local table .
For options #1 you need to run the below API for each one of the DC tables (be careful using this options require extra disk space on the Vertica side ) .
SELECT set_data_collector_time_policy('SessionEnds', '1 day'::interval);
SELECT set_data_collector_time_policy('SessionStarts', '1 day'::interval);
SELECT set_data_collector_time_policy('RuntimePriorityChanges', '1 day'::interval);

Related

Query to prevent booking overlap

I'm doing an app in Apex Oracle and trying to find a query that could prevent people from booking a room already booked. I managed to find a query that can prevent picking a date that starts or ends in between the booking time but I can't find how to prevent overlaping. By that I mean if someone books a conference room feb 2nd to feb 5th, someone can book the same room from feb 1st to feb 7th. That is what I'm trying to prevent. Thanks for the help!
Here's my first query
SELECT RES_ID_LOC FROM WER_RES
WHERE (CAST(RES_DATE_ARRIVE AS DATE) < CAST(TRY_RESERVE_START_DATE AS DATE) OR CAST(RES_DATE_DEPART AS DATE)
CAST(TRY_RESERVE_START_DATE AS DATE))
AND (CAST(RES_DATE_ARRIVE AS DATE) < CAST(TRY_RESERVE_END_DATE AS DATE) OR CAST(RES_DATE_DEPART AS DATE) > CAST(TRY_RESERVE_END_DATE AS DATE))
The main issue you'll have here is concurrency, namely (in chronological order)
User 1
runs overlap check query, see Room 5 is free, and inserts a row to book it
User 2
runs overlap check query, see Room 5 is free, and inserts a row to book it
User 1
commits
User 2
commits
and voila! You have a data corruption, even though the code all ran as you expected.
To avoid this, you'll need some way to lock a resource that multiple might want to book. Thus lets say you have a ROOMS table (list of available rooms) and a BOOKINGS table which is a child of ROOM.
Then your logic will need be something like:
select from ROOM where ROOM_NO = :selected_room for update;
This gives someone exclusive access to the room to check for bookings.
Now you can run your overlap check on that room against the BOOKINGS table. If that passes, then you insert your booking and commit the change to release the lock on the ROOMS row.
As an aside, take care with simply casting strings to dates, because you're at the whim of the format mask of the item matching that default of the database. Better to explicitly use a known format mask and TO_DATE

Custom function by date for related tables

"In a crosstab, latest month where Actuals has any data, show Actuals data for that and all previous months. Future months, show Forecast data."
.
I have two tables- Forecasts and Actuals- and the common columns between them are Team, Month, Value.
I'd like to show the data in a crosstab with Month as columns and Team as rows. I'm trying to write an expression to do this in the crosstab: The most recent month where Actuals has any data, I'd like to show Actuals data for that and all previous months, for all teams. For following months, I'd like to show Forecast data.
Any suggestions about how to go about this would be appreciated. I'm still piecing together my knowledge :)
Create a third table from a transformations:
Create Third table from - Pivot date on Team & Month (to ensure every possible combination) from first table
Add Rows from transforming the second table (pivot on Team & Month)
Join the two original tables to your newly created table (has every possible combination of Team & Month) so that both your data sets are now in one table.
Now use the third table in your cross table.
If you try using column matching instead of the above method only dates from the main table will show as the dates are matched and ones missing from the other will not display.

How do i extract multiple tables(35-40 tables) from a html website into one excel file?

Currently, am trying to retrieve data from this page: https://www.hdb.gov.sg/cs/infoweb/residential/renting-a-flat/renting-from-the-open-market/rental-statistics , as you can see, there are 4 quarters in a year, and for each quarter, there is a different table. I wish to extract the table but currently, i am unable to automate the process, only able to take one. On top of that, i wish to add two columns to the retrieved data table which is "Quarter" and "Year". Any suggestions? Attached photos are my workflow and my excel.
Get the number of years/ loop through the years (or start with the 1st year up to the last year).
For each year try to get the data via data scraping (the elements exist, just hidden/not expanded ; do one table datascraping for data modelling and reuse it within the loop). For the datascraping you need to change the selector, to make it usable for all tables by using the year and the quarter (just a generic example, like * year * quarter *). Columns are the same for all tables.
I haven't seen details within the website menu or within the page, is good to check if robots are allowed to scrape for data
Above would be the quickest way. More complex with FindChidren activity.

Track the rows which were updated or encrypted

I want to scrub(or encrypt) the email information from a few tables which are older than a few years.
This I am planning to do as part of a job, next time when I run the job how can I omit the rows which are already scrubbed or encrypted.
I am looking for an approach which will be having good performance.
"I want to scrub(or encrypt) the email information from a few tables which are older than a few years"
I hope this means you have a date column on these tables which you can use to determine which ones need to be scrubbed. The most efficient way of tackling the job is to track that date in an operational table, recording the most recent date scrubbed.
For example you have ten years' worth of data, and you need to scrub records which are more than four years old. Now this would work:
update t23
set email = null
where date_created < add_months(sysdate, -48);
But it seems like you want to batch things up. So build a tracking table, which at its simplest would be
create table tracker (
last_date_scrubbed);
Populate the last_date_scrubbed with a really old date say date '2010-01-01'
Now you can write a query like this
update t23
set email = null
where date_created
< (select last_date_scrubbed + interval '1' year from tracker);
That will clean all records older than 2011. Increment the date in the tracker table by one year. Run the query again to clean stuff from 2011. Repeat until you get to your target state of cleanliness. At which point you can switch to running the query monthly , with an interval of one month , or whatever.
Obviously you should proceduralize this. A procedure is the best way to encapsulate the steps and make sure everything is kept in step. Also you can use the database scheduler to run the procedure.
"there is one downside to this approach. I thought that you want to be free upon choosing which rows to be updated."
I don't see any requirement to track which individual rows have been scrubbed. After all, the end state is that every record older than a certain date has been scrubbed. When I have done jobs like this previously all anybody wanted to know was, "how many rows have we done so far and how many have we still got to do?" Which can be answered by tracking the sql%rowcount for each run.
For The best performance, you can add a Flag Column to your main table. a Column like IsEncrypted. then every time you try to run any query for the "not Encrypted rows" you easily use WHERE when IsEncrypted Column is false to condition on those rows only. there are other ways though.
EDIT
another way is to create a logger table. basically what this table does, is that it records any more information you want about a certain ID in another table. have another table called EncryptionLogger, in it you would have at least two columns: EmailTableId, IsEncrypted. then in any query you can simply get any rows WHERE their Ids are NOT IN this table.

Increase scan performance in Apache Hbase

I am working on an use case and help me in improving the scan performance.
Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns
Customerid | Name | visitedurl | timestamp | location | companyname
I have only one column family (test_family)
As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table
1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy
If so I will add 1 as row key for first row and 2 for second one and so on.
Note : Same id will be repeated for different days so I chose random no to be row-key
while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`
Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics
Thanks
As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.
However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.
To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.
Hope this help..

Resources