Oracle - SUM of multiples columns and lines - oracle

I have 2 tables
n_station and val_horaires
One station has multiple val_horaires and they are link via SUBSTR(code_mesure,1,4) = code_stas
The following query is giving me all the stations from a specific group with all the value between 2 dates
select
m.code_mesure,s.code_stas,s.nom_station, s.riviere_bassin, s.lambert_x, s.lambert_y,
m.date_val_hor, m.h_01,m.h_02,m.h_03,m.h_04,m.h_05,m.h_06,m.h_07,m.h_08,m.h_09,m.h_10,m.h_11,m.h_12,m.h_13,m.h_14,m.h_15,m.h_16,m.h_17,m.h_18,m.h_19,m.h_20,m.h_21,m.h_22,m.h_23,m.h_24
from HYDRO.n_station s
join
(
select
substr(vm.code_mesure,1,4) as code_stas, vm.*
from HYDRO.val_horaires vm
where code_mesure in (select code_mesure from HYDRO.GROUPE_MESURE where HYDRO.GROUPE_MESURE.CODE_GROUPE = 'TELEMP')
AND vm.date_val_hor between '12/12/2016 16:00' AND '14/12/2016 17:00'
) m
on m.code_stas = s.code_stas
where s.code_stas in (select SUBSTR(code_mesure,1,4) from HYDRO.GROUPE_MESURE where CODE_GROUPE = 'TELEMP')
AND s.lambert_x > 0 AND s.lambert_Y > 0
Result of above query
code_mesure code_stas nom_station riviere_bassin lambert_X lambert_Y date_val_hor h_01 h_02 h_03 h_04 h_05 ...................................
70480015 7048 EREZEE OURTHE 236667 109356 12/12/2016 00:00:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
70480015 7048 EREZEE OURTHE 236667 109356 13/12/2016 00:00:00 0 0 0 0 0 0 0 0 0 0,5 0 0,1 0,4 0,1 0 0 0 0 0 0 0 0 0 0
70480015 7048 EREZEE OURTHE 236667 109356 14/12/2016 00:00:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
60480015 6048 RACHAMPS-NOVILLE OURTHE 251592 86756 12/12/2016 00:00:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
60480015 6048 RACHAMPS-NOVILLE OURTHE 251592 86756 13/12/2016 00:00:00 0 0 0 0 0 0 0 0 0 0 0,7 0 0 0 0 0 0 0 0,1 0 0 0 0 0
60480015 6048 RACHAMPS-NOVILLE OURTHE 251592 86756 14/12/2016 00:00:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
I am now trying to get for each stations the SUM of all h_01,h_02,...h_24 for each line and also SUM those 3 lines togethers. Numbers below 0 should not be used (eg: -9999)
In this case the result would be
code_mesure code_stas nom_station riviere_bassin lambert_X lambert_Y Sum LastValue
70480015 7048 EREZEE OURTHE 236667 109356 0.11 0
60480015 6048 RACHAMPS-NOVILLE OURTHE 251592 86756 0.8 0
One last thing :
If the start date is 12/12/2016 16h00 - The value in the row with the date 12/12/2016 should only be used starting from h_16 to h_24
if the end date is 14/12/2016 17:00 - The value in the row with the date 14/12/2016 shoul only be used starting from h_01 to h_17

Use GREATEST() to restrict the numbers to positive values then add the values to combine across columns and use the SUM() aggregate function to combine rows with a GROUP BY clause.
WITH bounds ( start_time, end_time ) AS (
SELECT TIMESTAMP '2016-12-12 16:00:00', TIMESTAMP '2016-12-14 17:00:00' FROM DUAL
)
select
m.code_mesure,
s.code_stas,
s.nom_station,
s.riviere_bassin,
s.lambert_x,
s.lambert_y,
SUM(
CASE WHEN m.date_val_hor + 1/24 BETWEEN b.start_time AND b.end_time THEN GREATEST( m.h_01, 0 ) ELSE 0 END
+ CASE WHEN m.date_val_hor + 2/24 BETWEEN b.start_time AND b.end_time THEN GREATEST( m.h_02, 0 ) ELSE 0 END
+ CASE WHEN m.date_val_hor + 3/24 BETWEEN b.start_time AND b.end_time THEN GREATEST( m.h_03, 0 ) ELSE 0 END
...
+ CASE WHEN m.date_val_hor + 24/24 BETWEEN b.start_time AND b.end_time THEN GREATEST( m.h_24, 0 ) ELSE 0 END
) AS h_sum,
MAX(
CASE EXTRACT( HOUR FROM b.end_time )
WHEN 0 THEN m.h_01 -- Possibly m.h_24
WHEN 1 THEN m.h_02 -- Possibly m.h_01
WHEN 2 THEN m.h_03 -- Possibly m.h_02
...
WHEN 22 THEN m.h_23 -- Possibly m.h_22
ELSE m.h_24 -- Possibly m.h_23
END
) KEEP ( DENSE_RANK LAST ORDER BY m.date_val_hor ) AS last_hour_value
from bounds b
cross join
HYDRO.n_station s
join
( select substr(vm.code_mesure,1,4) as code_stas,
vm.*
from HYDRO.val_horaires vm
join
bounds b
ON ( vm.date_val_hor between TRUNC( b.start_time ) AND TRUNC( b.end_time ) )
where code_mesure in (select code_mesure
from HYDRO.GROUPE_MESURE
where HYDRO.GROUPE_MESURE.CODE_GROUPE = 'TELEMP')
) m
on m.code_stas = s.code_stas
where s.code_stas in ( select SUBSTR(code_mesure,1,4)
from HYDRO.GROUPE_MESURE
where CODE_GROUPE = 'TELEMP' )
AND s.lambert_x > 0 AND s.lambert_Y > 0
GROUP BY
m.code_mesure,
s.code_stas,
s.nom_station,
s.riviere_bassin,
s.lambert_x,
s.lambert_y

Related

Is there a way to vectorize this matlab for loop?

for i = 2:N
A(i,i-1:i+1) = [1, -2, 1];
end
Hello, matlab is telling me that this code can be faster by using spalloc for the matrix A (which I have) but also by vectorizing this for loop. I've tried to use the following:
i = 2:N
A(i, i-1:i+1)
but the result obviously turned out to be not what I want.
How can I solve this?
Thank you!
It looks like you're trying to get a second-order difference operator, except your loop winds up missing the first row and including an extra column. The normal (sparse) difference operator is generated like this:
N = 10;
v = ones(N, 1);
A = spdiags([v -2*v v], [-1 0 1], N, N);
full(A) % for display only
You'll see:
ans =
-2 1 0 0 0 0 0 0 0 0
1 -2 1 0 0 0 0 0 0 0
0 1 -2 1 0 0 0 0 0 0
0 0 1 -2 1 0 0 0 0 0
0 0 0 1 -2 1 0 0 0 0
0 0 0 0 1 -2 1 0 0 0
0 0 0 0 0 1 -2 1 0 0
0 0 0 0 0 0 1 -2 1 0
0 0 0 0 0 0 0 1 -2 1
0 0 0 0 0 0 0 0 1 -2
If that's not quite what you want (e.g., you really don't want the first row), then it's probably faster to generate it as above and then fix it up.

Kindly help to get office wise data row wise

select NUM_OFC_CODE,NUM_RO_CODE,
case when TXT_MONTH='JAN' then 1 ELSE 0 end as JAN,
case when TXT_MONTH='FEB' then 1 ELSE 0 end as FEB,
case when TXT_MONTH='MAR' then 1 ELSE 0 end as MAR,
case when TXT_MONTH='APR' then 1 ELSE 0 end as APR,
case when TXT_MONTH='MAY' then 1 ELSE 0 end as MAY,
case when TXT_MONTH='JUN' then 1 ELSE 0 end as JUN,
case when TXT_MONTH='JUL' then 1 ELSE 0 end as JUL,
case when TXT_MONTH='AUG' then 1 ELSE 0 end as AUG,
case when TXT_MONTH='SEP' then 1 ELSE 0 end as SEP,
case when TXT_MONTH='OCT' then 1 ELSE 0 end as OCT,
case when TXT_MONTH='NOV' then 1 ELSE 0 end as NOV,
case when TXT_MONTH='DEC' then 1 ELSE 0 end as DEC
from LEG_OMBUDSMAN_NONMACT where
NUM_YEAR=2019 group by NUM_OFC_CODE,TXT_MONTH,NUM_RO_CODE;
Result is showing as below:-
NUM_OFC_CODE NUM_RO_CODE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
280400 280000 0 0 0 0 0 0 0 1 0 0
282300 280000 0 0 0 0 0 0 0 1 0 0 0
281600 280000 0 0 0 0 0 0 0 1 0 0 0
280500 280000 0 0 0 0 0 0 1 0 0 0 0
280500 280000 0 0 0 1 0 0 0 0 0 0 0
281800 280000 0 0 0 0 0 0 0 1 0 0 0
282200 280000 0 0 0 0 0 0 0 1 0 0 0
280500 280000 0 0 0 0 1 0 0 0 0 0 0
280500 280000 0 0 0 0 0 1 0 0 0 0 0
280500 280000 0 0 0 0 0 0 0 1 0 0 0
281300 280000 0 0 0 0 0 0 0 1 0 0 0
I want office wise data. If August data is present, Then It should show 1 else 0. Like wise for other months. But in my query Separate row is showing for separate months.
Basically you have to group the data only by NUM_OFC_CODE,NUM_RO_CODE (excluding TXT_MONTH as you don't want a row for each instance of TXT_MONTH) and then use something like NVL(MAX(CASE WHEN TXT_MONTH='JAN' THEN 1 END), 0) as JAN (using a aggregate function to decide wether an entry exists or not) etc.
It's easier with the use of pivot:
-- Just some sampledata:
WITH LEG_OMBUDSMAN_NONMACT(NUM_OFC_CODE, NUM_RO_CODE, NUM_YEAR, TXT_MONTH) AS
(SELECT 1,1,2019, 'JAN' FROM dual union ALL
SELECT 1,1,2019, 'FEB' FROM dual)
-- Here starts the actual query:
SELECT NUM_OFC_CODE, NUM_RO_CODE
, NVL(JAN,0) AS JAN
, NVL(FEB,0) AS FEB
, NVL(MAR,0) AS MAR
, NVL(APR,0) AS APR
, NVL(MAY,0) AS MAY
, NVL(JUN,0) AS JUN
, NVL(JUL,0) AS JUL
, NVL(AUG,0) AS AUG
, NVL(SEP,0) AS SEP
, NVL(OCT,0) AS OCT
, NVL(NOV,0) AS NOV
, NVL(DEC,0) AS DEC
FROM LEG_OMBUDSMAN_NONMACT
pivot (MAX(1) FOR TXT_MONTH IN ('JAN' AS JAN,'FEB' AS FEB,'MAR' as MAR, 'APR' as APR, 'MAY' as MAY, 'JUN' as JUN, 'JUL' as JUL, 'AUG' as AUG, 'SEP' as SEP, 'OCT' as OCT, 'NOV' as NOV, 'DEC' as DEC ))
WHERE NUM_YEAR=2019
Your query is perfectly fine, just need couple of changes.
Remove txt_month from group by.
Use Max in all case statements.
So your query should look like this
select NUM_OFC_CODE,
NUM_RO_CODE,
Max(case when TXT_MONTH='JAN' then 1 ELSE 0 end) as JAN,
Max(case when TXT_MONTH='FEB' then 1 ELSE 0 end) as FEB,
Max(case when TXT_MONTH='MAR' then 1 ELSE 0 end) as MAR,
Max(case when TXT_MONTH='APR' then 1 ELSE 0 end) as APR,
Max(case when TXT_MONTH='MAY' then 1 ELSE 0 end) as MAY,
Max(case when TXT_MONTH='JUN' then 1 ELSE 0 end) as JUN,
Max(case when TXT_MONTH='JUL' then 1 ELSE 0 end) as JUL,
Max(case when TXT_MONTH='AUG' then 1 ELSE 0 end) as AUG,
Max(case when TXT_MONTH='SEP' then 1 ELSE 0 end) as SEP,
Max(case when TXT_MONTH='OCT' then 1 ELSE 0 end) as OCT,
Max(case when TXT_MONTH='NOV' then 1 ELSE 0 end) as NOV,
Max(case when TXT_MONTH='DEC' then 1 ELSE 0 end) as DEC
from LEG_OMBUDSMAN_NONMACT
where NUM_YEAR=2019
group by NUM_OFC_CODE ,NUM_RO_CODE;
Cheers!!

Finding islands of ones with zeros boundary

I am trying to find islands of numbers in a matrix.
By an island, I mean a rectangular area where ones are connected with each other either horizontally, vertically or diagonally including the boundary layer of zeros
Suppose I have this matrix:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1
0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0
0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1
0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0
0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
By boundary layer, I mean row 2 and 7, and column 3 and 10 for island#1.
This is shown below:
I want the row and column indices of the islands. So for the above matrix, the desired output is:
isl{1}= {[2 3 4 5 6 7]; % row indices of island#1
[3 4 5 6 7 8 9 10]} % column indices of island#1
isl{2}= {[2 3 4 5 6 7]; % row indices of island#2
[12 13 14 15 16 17]}; % column indices of island#2
isl{3} ={[9 10 11 12]; % row indices of island#3
[2 3 4 5 6 7 8 9 10 11];} % column indices of island#3
It doesn't matter which island is detected first.
While I know that the [r,c] = find(matrix) function can give the row and column indices of ones but I have no clues on how to detect the connected ones since they can be connected in horizontal, vertical and diagonal order.
Any ideas on how to deal with this problem?
You should look at the BoundingBox and ConvexHull stats returned by regionprops:
a = imread('circlesBrightDark.png');
bw = a < 100;
s = regionprops('table',bw,'BoundingBox','ConvexHull')
https://www.mathworks.com/help/images/ref/regionprops.html
Finding the connected components and their bounding boxes is the easy part. The more difficult part is merging the bounding boxes into islands.
Bounding Boxes
First the easy part.
function bBoxes = getIslandBoxes(lMap)
% find bounding box of each candidate island
% lMap is a logical matrix containing zero or more connected components
bw = bwlabel(lMap); % label connected components in logical matrix
bBoxes = struct2cell(regionprops(bw, 'BoundingBox')); % get bounding boxes
bBoxes = cellfun(#round, bBoxes, 'UniformOutput', false); % round values
end
The values are rounded because the bounding boxes returned by regionprops lies outside its respective component on the grid lines rather than the cell center, and we need integer values to use as subscripts into the matrix. For example, a component that looks like this:
0 0 0
0 1 0
0 0 0
will have a bounding box of
[ 1.5000 1.5000 1.0000 1.0000 ]
which we round to
[ 2 2 1 1]
Merging
Now the hard part. First, the merge condition:
We merge bounding box b2 into bounding box b1 if b2 and the island of b1 (including the boundary layer) have a non-null intersection.
This condition ensures that bounding boxes are merged when one component is wholly or partially inside the bounding box of another, but it also catches the edge cases when a bounding box is within the zero boundary of another. Once all of the bounding boxes are merged, they are guaranteed to have a boundary of all zeros (or border the edge of the matrix), otherwise the nonzero value in its boundary would have been merged.
Since merging involves deleting the merged bounding box, the loops are done backwards so that we don't end up indexing non-existent array elements.
Unfortunately, making one pass through the array comparing each element to all the others is insufficient to catch all cases. To signal that all of the possible bounding boxes have been merged into islands, we use a flag called anyMerged and loop until we get through one complete iteration without merging anything.
function mBoxes = mergeBoxes(bBoxes)
% find bounding boxes that intersect, and merge them
mBoxes = bBoxes;
% merge bounding boxes that overlap
anyMerged = true; % flag to show when we've finished
while (anyMerged)
anyMerged = false; % no boxes merged on this iteration so far...
for box1 = numel(mBoxes):-1:2
for box2 = box1-1:-1:1
% if intersection between bounding boxes is > 0, merge
% the size of box1 is increased b y 1 on all sides...
% this is so that components that lie within the borders
% of another component, but not inside the bounding box,
% are merged
if (rectint(mBoxes{box1} + [-1 -1 2 2], mBoxes{box2}) > 0)
coords1 = rect2corners(mBoxes{box1});
coords2 = rect2corners(mBoxes{box2});
minX = min(coords1(1), coords2(1));
minY = min(coords1(2), coords2(2));
maxX = max(coords1(3), coords2(3));
maxY = max(coords1(4), coords2(4));
mBoxes{box2} = [minX, minY, maxX-minX+1, maxY-minY+1]; % merge
mBoxes(box1) = []; % delete redundant bounding box
anyMerged = true; % bounding boxes merged: loop again
break;
end
end
end
end
end
The merge function uses a small utility function that converts rectangles with the format [x y width height] to a vector of subscripts for the top-left, bottom-right corners [x1 y1 x2 y2]. (This was actually used in another function to check that an island had a zero border, but as discussed above, this check is unnecessary.)
function corners = rect2corners(rect)
% change from rect = x, y, width, height
% to corners = x1, y1, x2, y2
corners = [rect(1), ...
rect(2), ...
rect(1) + rect(3) - 1, ...
rect(2) + rect(4) - 1];
end
Output Formatting and Driver Function
The return value from mergeBoxes is a cell array of rectangle objects. If you find this format useful, you can stop here, but it's easy to get to the format requested with ranges of rows and columns for each island:
function rRanges = rect2range(bBoxes, mSize)
% convert rect = x, y, width, height to
% range = y:y+height-1; x:x+width-1
% and expand range by 1 in all 4 directions to include zero border,
% making sure to stay within borders of original matrix
rangeFun = #(rect) {max(rect(2)-1,1):min(rect(2)+rect(4),mSize(1));...
max(rect(1)-1,1):min(rect(1)+rect(3),mSize(2))};
rRanges = cellfun(rangeFun, bBoxes, 'UniformOutput', false);
end
All that's left is a main function to tie all of the others together and we're done.
function theIslands = getIslandRects(m)
% get rectangle around each component in map
lMap = logical(m);
% get the bounding boxes of candidate islands
bBoxes = getIslandBoxes(lMap);
% merge bounding boxes that overlap
bBoxes = mergeBoxes(bBoxes);
% convert bounding boxes to row/column ranges
theIslands = rect2range(bBoxes, size(lMap));
end
Here's a run using the sample matrix given in the question:
M =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1
0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0
0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1
0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0
0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> getIslandRects(M)
ans =
{
[1,1] =
{
[1,1] =
9 10 11 12
[2,1] =
2 3 4 5 6 7 8 9 10 11
}
[1,2] =
{
[1,1] =
2 3 4 5 6 7
[2,1] =
3 4 5 6 7 8 9 10
}
[1,3] =
{
[1,1] =
2 3 4 5 6 7
[2,1] =
12 13 14 15 16 17
}
}
Quite easy!
Just use bwboundaries to get the boundaries of each of the blobs. you can then just get the min and max in each x and y direction of each boundary to build your box.
Use image dilation and regionprops
mat = [...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1;
0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0;
0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1;
0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0;
0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0];
mat=logical(mat);
dil_mat=imdilate(mat,true(2,2)); %here we make bridges to 1 px away ones
l_mat=bwlabel(dil_mat,8);
bb = regionprops(l_mat,'BoundingBox');
bb = struct2cell(bb); bb = cellfun(#(x) fix(x), bb, 'un',0);
isl = cellfun(#(x) {max(1,x(2)):min(x(2)+x(4),size(mat,1)),...
max(1,x(1)):min(x(1)+x(3),size(mat,2))},bb,'un',0);

Adjacent Elements in MATLAB with Mathematical Formulation

I have a set with elements and the possible adjacent combinations for this are:
So the total possible combinations are c=11 which can be calculated with the formula:
I can model this using a as below whose elements can be represented as a(n,c) are:
I have tried to implement this in MATLAB, but since I have hard-coded the above math my code is not extensible for cases where n > 4:
n=4;
c=((n^2)/2)+(n/2)+1;
A=zeros(n,c);
for i=1:n
A(i,i+1)=1;
end
for i=1:n-1
A(i,n+i+1)=1;
A(i+1,n+i+1)=1;
end
for i=1:n-2
A(i,n+i+4)=1;
A(i+1,n+i+4)=1;
A(i+2,n+i+4)=1;
end
for i=1:n-3
A(i,n+i+6)=1;
A(i+1,n+i+6)=1;
A(i+2,n+i+6)=1;
A(i+3,n+i+6)=1;
end
Is there a relatively low complexity method to transform this problem in MATLAB with n number of elements of set N, following my above mathematical formulation?
The easy way to go about this is to take a bit pattern with the first k bits set and shift it down n - k times, saving each shifted column vector to the result. So, starting from
1
0
0
0
Shift 1, 2, and 3 times to get
|1 0 0 0|
|0 1 0 0|
|0 0 1 0|
|0 0 0 1|
We'll use circshift to achieve this.
function A = adjcombs(n)
c = (n^2 + n)/2 + 1; % number of combinations
A = zeros(n,c); % preallocate output array
col_idx = 1; % skip the first (all-zero) column
curr_col = zeros(n,1); % column vector containing current combination
for elem_count = 1:n
curr_col(elem_count) = 1; % add another element to our combination
for shift_count = 0:(n - elem_count)
col_idx = col_idx + 1; % increment column index
% shift the current column and insert it at the proper index
A(:,col_idx) = circshift(curr_col, shift_count);
end
end
end
Calling the function with n = 4 and 6 we get:
>> A = adjcombs(4)
A =
0 1 0 0 0 1 0 0 1 0 1
0 0 1 0 0 1 1 0 1 1 1
0 0 0 1 0 0 1 1 1 1 1
0 0 0 0 1 0 0 1 0 1 1
>> A = adjcombs(6)
A =
0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1
0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 1
0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1
0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1
0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 1 1
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1

how can I create an incidence matrix in Julia

I would like to create an incidence matrix.
I have a file with 3 columns, like:
id x y
A 22 2
B 4 21
C 21 360
D 26 2
E 22 58
F 2 347
And I want a matrix like (without col and row names):
2 4 21 22 26 58 347 360
A 1 0 0 1 0 0 0 0
B 0 1 1 0 0 0 0 0
C 0 0 1 0 0 0 0 1
D 1 0 0 0 1 0 0 0
E 0 0 0 1 0 1 0 0
F 1 0 0 0 0 0 1 0
I have started the code like:
haps = readdlm("File.txt",header=true)
hap1_2 = map(Int64,haps[1][:,2:end])
ID = (haps[1][:,1])
dic1 = Dict()
for (i in 1:21)
dic1[ID[i]] = hap1_2[i,:]
end
X=[zeros(21,22)]; #the original file has 21 rows and 22 columns
X1 = hcat(ID,X)
The problem now is that I don't know how to fill the matrix with 1s in the specific columns as in the example above.
I'm also not sure if I'm on the right way.
Any suggestion that could help me??
Thanks!
NamedArrays is a neat package which allows naming both rows and columns and seems to fit the bill for this problem. Suppose the data is in data.csv, here is one method to go about it (install NamedArrays with Pkg.add("NamedArrays")):
data,header = readcsv("data.csv",header=true);
# get the column names by looking at unique values in columns
cols = unique(vec([(header[j+1],data[i,j+1]) for i in 1:size(data,1),j=1:2]))
# row names from ID column
rows = data[:,1]
using NamedArrays
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
# now stamp in the 1s in the right places
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],(header[c],data[r,c])] = 1 ; end
Now we have (note I transposed narr for better printout):
julia> narr'
10x6 NamedArray{Int64,2}:
attr ╲ id │ A B C D E F
──────────┼─────────────────
("x",22) │ 1 0 0 0 1 0
("x",4) │ 0 1 0 0 0 0
("x",21) │ 0 0 1 0 0 0
("x",26) │ 0 0 0 1 0 0
("x",2) │ 0 0 0 0 0 1
("y",2) │ 1 0 0 1 0 0
("y",21) │ 0 1 0 0 0 0
("y",360) │ 0 0 1 0 0 0
("y",58) │ 0 0 0 0 1 0
("y",347) │ 0 0 0 0 0 1
But, if DataFrames are necessary, similar tricks should apply.
---------- UPDATE ----------
In case the column of a value should be ignored i.e. x=2 and y=2 should both set a 1 on column for value 2, then the code becomes:
using NamedArrays
data,header = readcsv("data.csv",header=true);
rows = data[:,1]
cols = map(string,sort(unique(vec(data[:,2:end]))))
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],string(data[r,c])] = 1 ; end
giving:
julia> narr
6x8 NamedArray{Int64,2}:
id ╲ attr │ 2 4 21 22 26 58 347 360
──────────┼───────────────────────────────────────
A │ 1 0 0 1 0 0 0 0
B │ 0 1 1 0 0 0 0 0
C │ 0 0 1 0 0 0 0 1
D │ 1 0 0 0 1 0 0 0
E │ 0 0 0 1 0 1 0 0
F │ 1 0 0 0 0 0 1 0
Here is a slight variation on something that I use for creating sparse matrices out of categorical variables for regression analyses. The function includes a variety of comments and options to suit it to your needs. Note: as written, it treats the appearances of "2" and "21" in x and y as separate. It is far less elegant in naming and appearance than the nice response from Dan Getz. The main advantage here is that it works with sparse matrices so if your data is huge, this will be helpful in reducing storage space and computation time.
function OneHot(x::Array, header::Bool)
UniqueVals = unique(x)
Val_to_Idx = [Val => Idx for (Idx, Val) in enumerate(unique(x))] ## create a dictionary that maps unique values in the input array to column positions in the new sparse matrix.
ColIdx = convert(Array{Int64}, [Val_to_Idx[Val] for Val in x])
MySparse = sparse(collect(1:length(x)), ColIdx, ones(Int32, length(x)))
if header
return [UniqueVals' ; MySparse] ## note: this won't be sparse
## alternatively use return (MySparse, UniqueVals) to get a tuple, second element is the header which you can then feed to something to name the columns or do whatever else with
else
return MySparse ## use MySparse[:, 2:end] to drop a value (which you would want to do for categorical variables in a regression)
end
end
x = [22, 4, 21, 26, 22, 2];
y = [2, 21, 360, 2, 58, 347];
Incidence = [OneHot(x, true) OneHot(y, true)]
7x10 Array{Int64,2}:
22 4 21 26 2 2 21 360 58 347
1 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 1 0 0
0 0 0 1 0 1 0 0 0 0
1 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 1

Resources