Time range query algorithm - algorithm

I have a table with ID, and start & end time in milliseconds
ID Start End
1..0.....15
2..17....23
3..23....30
4..35....45
and so on.
I have a query find records with range of 18 and 28. The query will select rows which time range covers the range from query time. For above query, record 2 and 3 are valid.
My approach would be
select * from table where
start between 18 and 28 or // record 3 is selected
end between 18 and 28; // record 2 is selected
That's good enough.
Then I have another case where find records with range of 5 to 10.
The above query won't return anything. So, I add an extra statement.
select * from table where
start between 5 and 10 or
end between 5 and 10 or
(start < 5 and end > 10); // record 1 is selected.
My question is to verify if my approach is correct or is there any well-known algorithm that takes care of this problem?
I am pretty sure there are other questions of similar nature. I couldn't think of the correct keyword to find them.
Thanks.

To check if one range intersects another range you might use the following predicate:
start < 10 AND end > 5

Related

Oracle - determine and return the specfic hour of data with the highest sum of the values

I think I can do this in a more roundabout way using arrays, scripting, etc...BUT is it possible to sum up (aggregate) all the values for each "hour" of data in a database for a given field? Basically, I am trying to determine which hour in a day's worth of data had the highest sum...preferably without having to loop through 24 times for each day I want to look at. For example...let's say I have a table called "table", that contains columns for times and values as the follows:
Time Value
00:00 1
00:15 1
00:30 2
00:45 2
01:00 1
01:15 1
01:30 1
01:45 1
If I summed up by hand, I would get the following
Sum for 00 Hour = 6
Sum for 01 Hour = 4
So, in this example 00 Hour would be my "largest sum" hour. I'd like to end up returning simply which hour had the highest sum, and what that value was...the other hours don't matter in this case.
Can this be done all in a single ORACLE query, or does it need to be done outside the query with some scripting and working with the times and values separately? If not a single, maybe even just grab the sum for each hour, and I can run multiple queries - one for each hour? Then push each hour to an array, and just use the max of that array? I know there is a SUM() function in oracle, but how to tell it to "sum all the hours and just return the hour with the highest sum" escapes me. Hope all this makes sense. lol
Thanks for any advice to make this easier. :-)
The following query should do what you are looking for:
SELECT SUBSTR(time, 1, 2) AS HOUR,
SUM(amount) AS TOTAL_AMOUNT
FROM test_data
GROUP BY SUBSTR(time, 1, 2)
ORDER BY TOTAL_AMOUNT DESC
FETCH FIRST ROW WITH TIES;
The query uses the SUM function but grouping by the hour part of your time column. Then it orders the results by the summed amounts descending, only returning the maximum value.
Here is a DBFiddle showing the query in use (LINK)

Running distinct count in quicksight

I want to implement a running distinct count with Amazon Quicksight. Here's an example of what that would look like:
Date
ID
Amount
Running Distinct Count
1/1/1900
a
1
1
1/2/1900
a
3
1
1/2/1900
b
6
2
1/4/1900
a
3
2
1/8/1900
c
9
3
1/22/1900
d
2
4
I've tried runningSum(distinct_count, [Date ASC]), but this returns a sum of the distinct counts for each aggregated date field.
Can this be implemented in QuickSight?
You can use this workaround to get the same functionality as runningDistinctCount() as follows:
runningSum(
distinct_count(
ifelse(
datetime=minOver({Date}, [{ID}], PRE_AGG),
{ID},
NULL
)),
[{Date} ASC],
[]
)
This would give you the runningDistinctCount of ID's over the Date field. It achieves it by considering just the first time the ID appears in the dataset, counting these and finally doing a runningSum on these counts.

Hive SELECT col, COUNT(*) mismatch

Let me start by saying, I am very new to Hive, so I'm not sure what information folks will need to help me out. Please let me know what information would be useful. Also, while I'd usually create a small dataset to recreate the problem with, I think this problem has to do with the scale of my dataset, because I can't seem to recreate the problem on a smaller dataset. Let me know if you have suggestions to make this more easy to answer.
Okay now that's out of the way, here's my problem. I have a huge dataset, partitioned by month, with about 500 million rows per month. I have a column with an ID number in it (I'll call it idcol), and I want to closely examine a couple of examples where there's a high number of repeated IDs and a very low number. So, I used this:
SELECT idcol, COUNT(*) FROM table WHERE month = 7 GROUP BY idcol LIMIT 10;
And got:
000005185884381 13
000035323848000 24
000017027256315 531
000010121767109 54
000039844553332 3
000013731352481 309
000024387407996 3
000028461234451 67
000016564844672 1
000032933040806 17
So, I went to investigate the first idvar with a count of 3, with:
SELECT * FROM table WHERE month = 7 AND idcol = '000039844553332';
I expected to see just 3 rows, but ended up with 469 rows found! That was strange enough, but then I just happened to run the original line of code above but with LIMIT 5 instead and ended up with:
000005185884381 13
000017027256315 75
000010121767109 25
000013731352481 59
000024387407996 1
And, it may be hard to see because the idcol is so long, but idvar 000017027256315 ended up with a count of 531 when I did LIMIT 10 and just 75 when I did LIMIT 5.
What am I missing?! How can I get a correct count of just a small number of values so I can investigate further?!
BTW my first thought was to make the counting part a sub-query, but that didn't change a thing. I used:
SELECT * FROM (SELECT idcol, COUNT(*) FROM table WHERE month = 7 GROUP BY idcol) x LIMIT 10;
...same EXACT results
Most likely the counts are being computed from statistics.See here for the bug and the related discussion.
hive.compute.query.using.stats = FALSE
If this doesn't fix it try the ANALYZE command before running the count(*)
ANALYZE TABLE table_name PARTITION(month) COMPUTE STATISTICS;

In sqoop what does "size" mean when used with --split-limit arguments

From sqoop docs
Using the --split-limit parameter places a limit on the size of the split section created. If the size of the split created is larger than the size specified in this parameter, then the splits would be resized to fit within this limit, and the number of splits will change according to that.
What does "size" refer to here. Can some one explain with a little example.
I was just reading this and I think it would be interpreted like this.
Example table has a Primary Key col called ID and is an INT and table has 1000 rows with the ID values from 1 to 1000. if you set num-mappers to 50 then you would have 50 tasks each try to import 20 rows. The first query would have a predicate that says WHERE ID >= 1 AND ID <= 20. The 2nd mapper would say WHERE ID >= 21 AND ID <= 40... and so on.
If you also define the split-limit then depending on the size of the splits this parameter may adjust the number of tasks used to sqoop the data.
For example, with num-mappers set to 50 and split-limit set to 10, you would now need 100 tasks to import 10 rows of data each to get all 1000 rows. Your first task would now saw WHERE ID >= 1 AND ID <= 10.
In the case of a DateTime column, the value is now based on seconds. So If you have 10 years of data with 1 row for every day you would have about 3,653 rows of data. If you set num-mappers to 10 then your tasks would each try to sqoop about 365 days of data with a predicate that looked something like MYDATETIMECOL >= '2010-01-01' AND MYDATETIMECOL <= '2010-12-31' but if you also set the split-limit to something like 2592000 (num of seconds in 30 days) then you would need about 122 tasks to sqoop the data and the first task would have a predicate like MYDATETIMECOL >= '2010-01-01' AND MYDATETIMECOL <= '2010-01-30'.
These two examples have both used a 1:1 ratio for column value to row count. If either of these tables had 1000 rows per value in the split-by col then ALL of those rows would be sqooped as well.
Example with DateTime col where every day you have loaded 1000 rows for the last 10 years and now you have 3,653,000 rows, the predicates and the number of tasks would be the same but the number of rows sqooped in each of those tasks would be 1000x more.

use lag int the next line after its line have been executed

This is a very complicated situation for me and I was wondering if someone can help me with it:
Here is my table:
Record_no Type Solde SQLCalculatedPmu DesiredValues
------------------------------------------------------------------------
2570088 Insertion 60 133 133
2636476 Insertion 67 119,104 119,104
2636477 Insertion 68 117,352 117,352
2958292 Insertion 74 107,837 107,837
3148350 Radiation 73 107,837 107,83 <---
3282189 Insertion 80 98,401 98,395
3646066 Insertion 160 49,201 49,198
3783510 Insertion 176 44,728 44,725
3783511 Insertion 177 44,475 44,472
4183663 Insertion 188 41,873 41,87
4183664 Insertion 189 41,651 41,648
4183665 Radiation 188 41,651 41,64 <---
4183666 Insertion 195 40,156 40,145
4183667 Insertion 275 28,474 28,466
4183668 Insertion 291 26,908 26,901
4183669 Insertion 292 26,816 26,809
4183670 Insertion 303 25,842 25,836
4183671 Insertion 304 25,757 25,751
In my table every value in the SQLCalculatedPmu column or desiredValue Column is calculated based on the preceding value.
As you can see, I have calculated the SQLcalculatedPMU column based on the round on 3 decimals. The case is that on each line radiation, the client want to start the next calculation based on 2 decimals instead of 3(represented in the desired values column). Next values will be recalculated. For example line 6 will change as the value in line 5 is now on 2 decimals. I could handle this if there where one single radiation but in my case I have a lot of Radiations and in this case they will change all based on the calculation of the two decimals.
In summary, Here are the steps:
1 - round the value of the preceding row of a raditaiton and put it in the radiation row.
2 - calculate all next insertion rows.
3 - when we reach another radiation we redo steps 1 and 2 and so on
I m using an oracle DB and I m the owner so I can make procedures, insert, update, select.
But I m not familiar with procedures or loops.
For information, this is the formula for SQLCalculatedPmu uses two additional culmns price and number and this is calculated every line cumulativelly for each investor:
(price * number)+(cumulative (price*number) of the preceeding lines)
I tried something like this :
update PMUTemp
set SQLCalculatedPmu =
case when Type = 'Insertion' then
(number*price)+lag(SQLCalculatedPmu ,1) over (partition by investor
order by Record_no)/
(number+lag(solde,1) over (partition by investor order by Record_no))
else
TRUNC(lag(SQLCalculatedPmu,1) over partition by invetor order by Record_no))
end;
but I gave me this error (I think it's because I m looking at the preceiding line that itself is modified during the SQL statement) :
ORA-30486 : window function are allowed only in the SELECT list of a query.
I was wondering if creating a procedure that will be called as many time as the number of radiations would do the job but I m really not good in procedures
Any help
Regards,
just to make my need simpler, all I want is to have the DesiredValues column starting from the SQLCalculatedPmu column. Steps are
1 - on a radiation the value become = trunc(preceding value,2)
2 - calculate all next insertion rows this way : (price * number)+(cumulative (price*number) of the preceeding lines). As the radiation value have changed then I need to recalculate next lines based on it
3 - when we reach another radiation we redo steps 1 and 2 and so on
Kindest regards
You should not need a procedure here -- a SQL update of the Radiation rows in the table would do this quicker and more reliably.
Something like ..
update my_table t1
set (column_1, column_2) =
(select round(column_1,2), round(column_2,2)
from my_table t2
where t2.type = 'Insertion' and
t2.record_no = (select max(t3.record_no)
from my_table t3
where t3.type = 'Insertion' and
t3.record_no < t1.record_no ))
where t1.type = 'Radiation'

Resources