Word count for all words in a Oracle BI column - oracle

I'm a user of Oracle BI (v. 11.1.1.7.141014). I have a text column "description" and would like to create a new table with the word count for all words in that column. So for instance:
Source:
Description
___________
This is a test
Just a test
Result:
Word Count
_____________
a 2
test 2
is 1
just 1
this 1
Would it be possible? I have a user account, (no administration features), but I can work on reports (tables, pivot tables, etc.), data structures, custom SQL queries (limited to reports and data structures) and so on...
Thanks in advance

Defining "word" as any sequence of one or more consecutive English letters (upper or lower case), and assuming that "this" and "This" are the same, here is one possible solution. The first line of the code ends in "... from a)," substitute your table name in place of "a" (for my own testing purposes, I created a table with your input data and I called it a).
with b (d, ct) as (select Description, regexp_count(Description, '[a-zA-Z]+') from a),
h (pos) as (select level from dual connect by level <= 100),
prep (word) as (select lower(regexp_substr(d, '[a-zA-Z]+', 1, pos)) from b, h where pos <= ct)
select word, count(word) as word_count
from prep
group by word
order by word_count desc, word
/
The solution needs to know beforehand the maximum number of words per input string; I used 100, that can be increased (in the definition of h in the second line of code).

Related

Efficent use of an index for a self join with a group by

I'm trying to speed up the following
create table tab2 parallel 24 nologging compress for query high as
select /*+ parallel(24) index(a ix_1) index(b ix_2)*/
a.usr
,a.dtnum
,a.company
,count(distinct b.usr) as num
,count(distinct case when b.checked_1 = 1 then b.usr end) as num_che_1
,count(distinct case when b.checked_2 = 1 then b.usr end) as num_che_2
from tab a
join tab b on a.company = b.company
and b.dtnum between a.dtnum-1 and a.dtnum-0.0000000001
group by a.usr, a.dtnum, a.company;
by using indexes
create index ix_1 on tab(usr, dtnum, company);
create index ix_2 on tab(usr, company, dtnum, checked_1, checked_2);
but the execution plan tells me that it's going to be an index full scan for both indexes, and the calculations are very long (1 day is not enough).
About the data. Table tab has over 3 mln records. None of the single columns are unique. The unique values here are pairs of (usr, dtnum), where dtnum is a date with time written as a number in the format yyyy,mmddhh24miss. Columns checked_1, checked_2 have values from set (null, 0, 1, 2). Company holds an id for a company.
Each pair can only have one value checked_1, checked_2 and company as it is unique. Each user can be in multple pairs with different dtnum.
Edit
#Roberto Hernandez: I've attached the picture with the execution plan. As for parallel 24, in our company we are told to create tables with options 'parallel [num] nologging compress for query high'. I'm using 24 but I'm no expert in this field.
#Sayan Malakshinov: http://sqlfiddle.com/#!4/40b6b/2 Here I've simplified by giving data with checked_1 = checked_2, but in real life this may not be true.
#scaisEdge:
For
create index my_id1 on tab (company, dtnum);
create index my_id2 on tab (company, dtnum, usr);
I get
For table tab Your join condition is based on columns
company, datun
so you index should be primarly based on these columns
create index my_id1 on tab (company, datum);
The indexes you are using are useless because don't contain in left most position columsn use ij join /where condition
Eventually you can add user right most potition for avoid the needs of table access and let the db engine retrive alla the inf inside the index values
create index my_id1 on tab (company, datum, user, checked_1, checked_2);
Indexes (bitmap or otherwise) are not that useful for this execution. If you look at the execution plan, the optimizer thinks the group-by is going to reduce the output to 1 row. This results in serialization (PX SELECTOR) So I would question the quality of your statistics. What you may need is to create a column group on the three group-by columns, to improve the cardinality estimate of the group by.

Bound Oracle Text "near" operator to the same sentences

I have a column that stores paragraphs with multiple sentences and I am using the "Near" statement to look for the right record. However, is it possible to bound the near statement that only look for words within the same sentences.
For example:
Paragraph
"An elderly man has died as a result of coronavirus in the Royal
Hobart Hospital overnight. It follows the death of a woman in her 80s
in the North West Regional Hospital in Burnie on Monday morning, and
brings the national toll to 19. "
indextype is ctxsys.context
select
score(1)
from
tbl
where
contains(Paragraph,
'Near (coronavirus, death),20,false)',1) > 0
The result I want is nothing as both words is from a different sentences. However, now it will return me a positive number as both words are less than 20 words apart.
Can you share me some idea on how to do this?
Thanks in advance!
The query should look like this:
select score(1)
from tbl
where contains(Paragraph, 'Near (coronavirus, death),20,false)
WITHIN SENTENCE',1) > 0
;
That is - use the WITHIN operator.
Note that you must tell the index to recognize sentences first. That is: if you created the index with a statement like this:
create index ctxidx on tbl(Paragraph)
indextype is ctxsys.context
-- parameters(' ... ')
;
where the parameters (if you used that clause) don't say anything about "sentences", you will get an error if you try the query above - something along the lines of
DRG-10837: section sentence does not exist
First you will have to define "special" sections for sentences:
begin
ctx_ddl.create_section_group('my_section_group', 'AUTO_SECTION_GROUP');
ctx_ddl.add_special_section('my_section_group', 'SENTENCE');
end;
/
With this in hand:
drop index ctxidx;
create index ctxidx on tbl(Paragraph)
indextype is ctxsys.context
parameters ('section group my_section_group')
;
Now you are ready to successfully run the query at the top of this Answer.

Is there a way to get list of all column values from multiple datasets

We have 2 datasets. Both datasets have a column called Office. Say Dataset1 Office column has values London, Liverpool and Dataset2 the Office column has values Washington, California.
Is there any way in SSRS to do something like UNION ALL (but within SSRS) that is produce a list of all Office values in one table column?
There may be much more elegant ways of doing this but this is all I could come up with..
You need to have a unique numeric id for each record
Assuming you cannot simply union these together in your dataset query, if you can add a numeric ID to each record then you can do it (albeit a bit clunky)
To start with I created two datasets ds1 and ds2 to hold the two lists of office names.
For example the first dataset query was just this...
DECLARE #t TABLE(OfficeID int, Office varchar(20))
INSERT INTO #t VALUES
(1, 'London'),
(2, 'Liverpool')
SELECT * FROM #t
and the second was this.
DECLARE #t TABLE(OfficeID int, Office varchar(20))
INSERT INTO #t VALUES
(10, 'Washington'),
(11, 'California')
SELECT * FROM #t
As you can see, each office now has a unique numeric id
Next I needed another dataset with a list of number that covered the entire range of office ids. In this example the range is just 20 numbers
note: this work on SQL Server for other systems you'll have to come up with another method of getting a list of numbers
So, the query for dsNums is
declare #n table(num int)
insert into #n
select top 20 row_number() over(order by t1.number) as N
from master..spt_values t1
cross join master..spt_values t2
SELECT * FROM #n
Next I added a table to the report, reduced it to two columns and bound it to dsNums. The first column is just the [num] field.
The second is an expression that does two lookups and takes the first non blank result. The expression was
=IIF(
LEN(LOOKUP(Fields!num.Value, Fields!OfficeID.Value, Fields!Office.Value, "ds1"))>0
, LOOKUP(Fields!num.Value, Fields!OfficeID.Value, Fields!Office.Value, "ds1")
, LOOKUP(Fields!num.Value, Fields!OfficeID.Value, Fields!Office.Value, "ds2")
)
If we run the report now we get..
Finally we need to hide the empty rows. To do this I set the Row Visibilty expression to
=LEN(ReportItems!LookupResult.Value) = 0
The result is this (obviously you don't need the num column but it was just for illustration)

How can a query expanding a delimited list be optimized?

I'm new to Oracle and recently ran into the following query. I'm trying to understand what it's doing and hopefully rewrite it to optimize it. In this example, :NameList would be a comma separated list (like: "Bob,Bill,Fred") and then :N_NameList would be the number of tokens (in above example, 3)
SELECT ... FROM
(
SELECT
REGEXP_SUBSTR(:NameList,'[^,]+',1,LEVEL, 'i') Name
FROM DUAL CONNECT BY LEVEL <= :N_NameList
) x
INNER JOIN PEOPLE ppl
ON ppl.Name LIKE x.Name
...
From what I can tell, it expands out the delimited list into unique rows and then joins it with the following tables for each name, but I'm not sure if that's all it's doing. If that is the case, is there a better way to accomplish this?
You could try this instead:
select ...
from people ppl
where instr (','||:NameList||',',
','||ppl.name||',') > 0;
is there a better way to accomplish this?
Well, you could get rid of N_NameList because you can easily count number of tokens. This doesn't mean that it is a better way, it's just a different option. To be honest, it is probably slower option than yours as I have to calculate something that you entered as a parameter.
As this example is based on SQLPlus, I've used & instead of : for substitution variables. && means that it'll "remember" previously entered value (otherwise, I should type NameList twice.
Your current query:
SQL> select regexp_substr('&namelist', '[^,]+', 1, level, 'i') name
2 from dual
3 connect by level <= &n_namelist;
Enter value for namelist: Bob,Bill,Fred
Enter value for n_namelist: 3
Bob
Bill
Fred
Calculated N_NameList (using REGEXP_COUNT):
SQL> select regexp_substr('&&namelist', '[^,]+', 1, level, 'i') name
2 from dual
3 connect by level <= regexp_count('&&namelist', ',') + 1;
Enter value for namelist: Bob,Bill,Fred
Bob
Bill
Fred

Substring inside string

Suppose this is my table:
ID STRING
1 'ABC'
2 'DAE'
3 'BYYYYYY'
4 'H'
I want to select all rows that have at least one of the characters in the STRING column somewhere in another row's STRING variable.
For example, 1 and 2 have an A in common and 1 ad 3 have a B in common, but 4 does not have any characters in common with any of the other rows. So my query should return only the first three lines.
I don't need to know with which line it matched.
Thanks!
#A.B.Cade : Good solution but could be done without any distinct nor join.
SELECT * FROM test t1
WHERE EXISTS
(
SELECT * FROM test t2
WHERE t1.id<>t2.id AND
regexp_like(t1.string, '['|| replace(t2.string, '.[]', '\.\[\]')||']')
)
The query won't compare the string with extra rows since it'll stop the comparison as soon as 1 match is found for the current row...
See fiddle.
#GolezTrol's answer is a good one, but here is another approach:
select distinct t1."ID", t1."STRING"
from table1 t1, table1 t2
where t1."ID" <> t2."ID"
and regexp_like(t1."STRING", '['|| t2."STRING"||']')
First take a cartessian product of the table
Then make sure your not comparing the same string to itself
then create a regexp from one string for comparing to the other - [<string1>] means that the string must contain one of the letters in the [ ] which are all from string1
Here is a fiddle
Like this:
select distinct
id, name
from
(select distinct
x.id,
x.NAME,
length(x.NAME) as leng,
substr(x.name, level, 1) as namechar
from
YourTable x
start with
level = 0
connect by
level <= length(x.name)) y
where
exists
(select
'x'
from
YourTable z
where
instr(z.name, y.namechar) > 0 and
z.id <> y.id)
order by
id
What it does:
First, (inner select) use the table with a number generator that returns a number for each letter in the name. Now each record in YourTable is returned Length(Name) times, each with another number. That generated number is used to isolate that letter (substr).
Then (subselect in top level where clause) check if records exist that contain that isolated letter. Distinct is needed, because records are returned more than once if more than one letter matches. You could add namechar to the outer select field list to see the letter that match.

Resources