Speeding up ''ismember'' in Matlab - performance

I'm using the Matlab build in command ''ismember'' to see whether a certain data set is within a larger dataset. The purpose of the piece of code is to remove any multiples of the [0 0] row from the larger data set shown below:
To do achieve this, I am using the following piece of code:
[Lia,locB] = ismember([0 0; 0 0],AFdata,'rows');
if sum(Lia) > 1
AFdata(locB(1):locB(end-1),:) = [];
end
AFdata = [
1.0000 -0.0114
0.9975 -0.0098
0.9951 -0.0084
0.9928 -0.0074
0.9903 -0.0066
0.9804 -0.0042
0.9705 -0.0018
0.9606 0.0004
0.9507 0.0025
0.9408 0.0045
0.9309 0.0063
0.9210 0.0082
0.9111 0.0100
0.9012 0.0118
0.8913 0.0135
0.8814 0.0152
0.8715 0.0167
0.8616 0.0183
0.8517 0.0199
0.8418 0.0214
0.8318 0.0229
0.8219 0.0243
0.8120 0.0256
0.8021 0.0269
0.7922 0.0282
0.7823 0.0294
0.7724 0.0306
0.7625 0.0318
0.7526 0.0329
0.7427 0.0340
0.7328 0.0350
0.7229 0.0359
0.7130 0.0368
0.7031 0.0377
0.6932 0.0385
0.6833 0.0393
0.6734 0.0401
0.6635 0.0408
0.6536 0.0415
0.6437 0.0422
0.6338 0.0428
0.6239 0.0434
0.6140 0.0439
0.6041 0.0444
0.5942 0.0449
0.5843 0.0454
0.5744 0.0458
0.5645 0.0461
0.5546 0.0465
0.5447 0.0469
0.5348 0.0472
0.5249 0.0475
0.5150 0.0478
0.5051 0.0481
0.4951 0.0483
0.4852 0.0485
0.4753 0.0487
0.4654 0.0489
0.4555 0.0491
0.4456 0.0492
0.4357 0.0493
0.4258 0.0494
0.4159 0.0495
0.4060 0.0495
0.3961 0.0495
0.3862 0.0495
0.3763 0.0495
0.3664 0.0494
0.3565 0.0493
0.3466 0.0492
0.3367 0.0491
0.3268 0.0490
0.3169 0.0488
0.3070 0.0486
0.2971 0.0484
0.2872 0.0482
0.2773 0.0479
0.2674 0.0476
0.2575 0.0473
0.2476 0.0469
0.2377 0.0465
0.2278 0.0461
0.2179 0.0457
0.2080 0.0452
0.1981 0.0447
0.1882 0.0441
0.1783 0.0435
0.1684 0.0428
0.1584 0.0421
0.1485 0.0413
0.1386 0.0404
0.1287 0.0395
0.1188 0.0385
0.1089 0.0374
0.0990 0.0363
0.0891 0.0352
0.0792 0.0338
0.0693 0.0323
0.0594 0.0306
0.0495 0.0287
0.0396 0.0265
0.0297 0.0239
0.0198 0.0204
0.0099 0.0153
0.0050 0.0115
0.0020 0.0075
0 0
0 0
0.0020 -0.0075
0.0050 -0.0115
0.0099 -0.0153
0.0198 -0.0204
0.0297 -0.0239
0.0396 -0.0265
0.0495 -0.0287
0.0594 -0.0306
0.0693 -0.0323
0.0792 -0.0338
0.0891 -0.0352
0.0990 -0.0363
0.1089 -0.0375
0.1188 -0.0386
0.1287 -0.0396
0.1386 -0.0405
0.1485 -0.0414
0.1584 -0.0422
0.1684 -0.0429
0.1783 -0.0436
0.1882 -0.0442
0.1981 -0.0448
0.2080 -0.0454
0.2179 -0.0459
0.2278 -0.0463
0.2377 -0.0467
0.2476 -0.0471
0.2575 -0.0475
0.2674 -0.0478
0.2773 -0.0481
0.2872 -0.0484
0.2971 -0.0486
0.3070 -0.0488
0.3169 -0.0490
0.3268 -0.0491
0.3367 -0.0492
0.3466 -0.0493
0.3565 -0.0493
0.3664 -0.0493
0.3763 -0.0493
0.3862 -0.0492
0.3961 -0.0491
0.4060 -0.0490
0.4159 -0.0488
0.4258 -0.0486
0.4357 -0.0484
0.4456 -0.0481
0.4555 -0.0478
0.4654 -0.0474
0.4753 -0.0470
0.4852 -0.0465
0.4951 -0.0460
0.5051 -0.0455
0.5150 -0.0449
0.5249 -0.0442
0.5348 -0.0435
0.5447 -0.0427
0.5546 -0.0418
0.5645 -0.0408
0.5744 -0.0397
0.5843 -0.0386
0.5942 -0.0374
0.6041 -0.0362
0.6140 -0.0350
0.6239 -0.0337
0.6338 -0.0324
0.6437 -0.0310
0.6536 -0.0296
0.6635 -0.0281
0.6734 -0.0266
0.6833 -0.0252
0.6932 -0.0236
0.7031 -0.0220
0.7130 -0.0204
0.7229 -0.0188
0.7328 -0.0172
0.7427 -0.0156
0.7526 -0.0141
0.7625 -0.0125
0.7724 -0.0110
0.7823 -0.0095
0.7922 -0.0080
0.8021 -0.0067
0.8120 -0.0055
0.8219 -0.0045
0.8318 -0.0035
0.8418 -0.0026
0.8517 -0.0018
0.8616 -0.0012
0.8715 -0.0007
0.8814 -0.0004
0.8913 -0.0003
0.9012 -0.0004
0.9111 -0.0007
0.9210 -0.0012
0.9309 -0.0020
0.9408 -0.0030
0.9507 -0.0042
0.9606 -0.0055
0.9705 -0.0072
0.9804 -0.0092
0.9903 -0.0115
0.9928 -0.0119
0.9951 -0.0121
0.9975 -0.0119
1.0000 -0.0114]
However, this piece of code is executed for multiple datasets and numerous iterations which makes this a slows piece of code.
Is there any alternative to using ''ismember''? Or a quicker way to do this. Unfortunately I am not good with programing.

If you use logical indexing it will be faster.
%create index
index = sum(AFdata' == 0)==2;
%clean AFdata
AFdata(index,:) = [];

This is a minor improvement of obchardon's answer. There is no need to transpose the data set, instead use the dimension argument for sum or all. find is unnecessary, use logical indexing instead. Using all instead of sum makes the comparison unnecessary.
index =all(AFdata==0,2);
AFdata(index,:) = [];

Related

Having trouble calculating size for resize-partition

Got a disk with unallocated space before and after a partition. I'm trying to calculate the size that needs to be in resize-partition. Its probably easier to explain the problem through code.
Output 1 being wrong is understandable since there is unallocated space before the partition. Not too sure what partition.offset can be reliably used...
Oh and we cant use the Get-PartitionSupportedSize cause it seems to mess with some of our other operations. Diskpart is also a no go since this runs a lot and we want to minimize powershell opening other programs.
Basically is there any alternative to Get-PartitionSupportedSize?
$physicalDisk = Get-Disk -Number 2
$partition = Get-Partition -DiskNumber 2
$maxSize = (Get-PartitionSupportedSize -DiskNumber 2).SizeMax
$calcSize = $partition.Size + ($physicalDisk.Size - $physicalDisk.AllocatedSize)
"Output 1"
$maxSize - $calcSize
$calcSize = $partition.Size + ($physicalDisk.Size - $physicalDisk.AllocatedSize) - $partition.Offset
"Output 2"
$maxSize - $calcSize
$calcSize = $physicalDisk.AllocatedSize + $physicalDisk.LargestFreeExtent
"Output 3"
$maxSize - $calcSize
$calcSize = $partition.Size + $physicalDisk.LargestFreeExtent
"Output 4"
$maxSize - $calcSize
Results:
Output 1
-132137984 Output 2 3128320 Output 3
-1065472 Output 4 2079744

how to solve a simple mixing operation in gekko?

I am trying to solve a simple mixing operation in gekko. The mixer mx takes two inlet streams Feed1 and Feed2. The expected result is that mass flow of outlet stream mx.outlet should be the summation of mass flows of the inlet streams.
Here is what I have tried.
from gekko import GEKKO, chemical
m = GEKKO(remote=False)
f = chemical.Flowsheet(m)
P = chemical.Properties(m)
c1 = P.compound('Butane')
c2 = P.compound('Propane')
feed1 = f.stream()
m_feed1 = f.massflows(sn= feed1.name)
m_feed1.mdot = 200
m_feed1.mdoti = [50,150]
feed2= f.stream()
m_feed2 = f.massflows(sn= feed2.name)
m_feed2.mdot = 200
m_feed2.mdoti = [50,150]
mx = f.mixer(ni=2)
mx.inlet = [feed1.name,feed2.name]
m.options.SOLVER = 1
mf = f.massflows(sn = mx.outlet)
m.solve()
The code runs successfully. However, on mf.mdot seems to output incorrect value [-1.8220132454e-06]. The expected value is 400. Any help , what is wrong with my code?
Here is source code that works for this mixing application:
from gekko import GEKKO, chemical
import json
m = GEKKO(remote=False)
f = chemical.Flowsheet(m)
P = chemical.Properties(m)
# define compounds
c1 = P.compound('Butane')
c2 = P.compound('Propane')
# create feed streams
feed1 = f.stream(fixed=False)
feed2 = f.stream(fixed=False)
# create massflows objects
m_feed1 = f.massflows(sn=feed1.name)
m_feed2 = f.massflows(sn=feed2.name)
# create mixer
mx = f.mixer(ni=2)
# connect feed streams to massflows objects
f.connect(feed1,mx.inlet[0])
f.connect(feed2,mx.inlet[1])
m.options.SOLVER = 1
mf = f.massflows(sn = mx.outlet)
# specify mass inlet flows
mi = [50,150]
for i in range(2):
m.fix(m_feed1.mdoti[i],val=mi[i])
m.fix(m_feed2.mdoti[i],val=mi[i])
# fix pressure and temperature
m.fix(feed1.P,val=101325)
m.fix(feed2.P,val=101325)
m.fix(feed1.T,val=300)
m.fix(feed2.T,val=305)
m.solve(disp=True)
# print results
print(f'The total massflow out is {mf.mdot.value}')
print('')
# get additional solution information
with open(m.path+'//results.json') as f:
r = json.load(f)
for name, val in r.items():
print(f'{name}={val[0]}')
Below is the solver output. This will only work with APM 0.9.1 and Gekko v0.2.3 (release coming Aug 2019). The thermo and flowsheet object libraries were released with v0.2.2 and there are several features that are still under development. The next release should resolve many of them.
----------------------------------------------------------------
APMonitor, Version 0.9.1
APMonitor Optimization Suite
----------------------------------------------------------------
--------- APM Model Size ------------
Each time step contains
Objects : 6
Constants : 0
Variables : 19
Intermediates: 0
Connections : 44
Equations : 2
Residuals : 2
Number of state variables: 14
Number of total equations: - 14
Number of slack variables: - 0
---------------------------------------
Degrees of freedom : 0
----------------------------------------------
Steady State Optimization with APOPT Solver
----------------------------------------------
Iter Objective Convergence
0 3.86642E-16 1.99000E+02
1 4.39087E-18 1.11937E+01
2 8.33448E-19 6.05819E-01
3 1.84640E-19 1.62783E-01
4 2.91981E-20 7.21250E-02
5 1.55439E-21 2.28110E-02
6 5.51232E-24 1.21437E-03
7 7.03139E-29 4.30650E-06
8 7.03139E-29 4.30650E-06
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.0469 sec
Objective : 0.
Successful solution
---------------------------------------------------
v1 not found in results file
The total massflow out is [400.0]
time=0.0
feed1.h=44154989.486
feed1.x[2]=0.79815448476
feed1.vdot=104.9180373
feed1.dens=0.040621756423
feed1.c[1]=0.0081993193551
feed1.c[2]=0.032422437068
feed1.mdot=200.0
feed1.y[1]=0.25
feed1.y[2]=0.75
feed1.sfrc=0.0
feed1.lfrc=0.0
feed1.vfrc=1.0
feed2.h=44552246.421
feed2.x[2]=0.79815448476
feed2.vdot=106.66667125
feed2.dens=0.03995582599
feed2.c[1]=0.0080649042837
feed2.c[2]=0.031890921707
feed2.mdot=200.0
feed2.y[1]=0.25
feed2.y[2]=0.75
feed2.sfrc=0.0
feed2.lfrc=0.0
feed2.vfrc=1.0
mixer5.outlet.t=381.10062836
mixer5.outlet.h=44353617.96
mixer5.outlet.ndot=8.5239099109
mixer5.outlet.x[1]=0.20184551524
mixer5.outlet.x[2]=0.79815448476
mixer5.outlet.vdot=1.5797241143
mixer5.outlet.dens=5.5635215396
mixer5.outlet.c[1]=1.0891224437
mixer5.outlet.c[2]=4.3066994177
mixer5.outlet.mdot=400.0
mixer5.outlet.y[1]=0.25
mixer5.outlet.y[2]=0.75
mixer5.outlet.sfrc=0.0
mixer5.outlet.lfrc=1.0
mixer5.outlet.vfrc=0.0
v2=300.0
v3=4.2619549555
v4=0.20184551524
v5=0.79815448476
v6=101325.0
v7=305.0
v8=4.2619549555
v9=0.20184551524
v10=0.79815448476
v11=200.0
v12=50.0
v13=150.0
v14=200.0
v15=50.0
v16=150.0
v17=400.0
v18=100.0
v19=300.0

The most efficient way to read a unformatted file

Now I am data-processing 100,000 files by using Fortran. These data are generated by HPC using MPI I/O. Now I can just figure out the following ways to read the raw, which is not efficient. Is it possible that read every to read ut_yz(:,J,K), at one one time insteading of reading one by one? Thanks
The old code is as follows and the efficiency is not so high.
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED',&
ACCESS='DIRECT', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
DO I=1,nxt
READ(10,REC=COUNT) ut_yz(I,J,K)
COUNT = COUNT + 1
ENDDO
ENDDO
ENDDO
CLOSE(10)
The desired one is
OPEN(10,FILE=trim(filename)//".dat",FORM='UNFORMATTED', RECL=4, STATUS='OLD')
!,CONVERT='big_endian'
COUNT = 1
DO K=1,nz
DO J=1,ny
READ(10,REC=COUNT) TEMP(:)
COUNT = COUNT + 153
ut_yz(:,J,K)=TEMP(:)
ENDDO
ENDDO
CLOSE(10)
However, it always fails. Can anyone make a comment on this? Thanks.
Direct IO read will read a single record, if I am not mistaken. Thus, in your new code version you need to increase the record length accordingly:
inquire(iolength=rl) ut_yz(:,1,1)
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
count = 1
do k=1,nz
do j=1,ny
read(10, rec=count) ut_yz(:,j,k)
count = count + 1
end do
end do
close(10)
Of course, in this example you could also read the complete array at once, which should be the fastest option:
inquire(iolength=rl) ut_yz
open(10, file=trim(filename)//'.dat', form='UNFORMATTED', recl=rl, status='OLD', action='READ')
read(10, rec=1) ut_yz
close(10)

Parallelize row-operation in Julia

Coming from a R background, I was exploring the parallel possibilities by Julia. My objective is to replicate the performance of mcapply (parallel apply)
** The problem: **
I iterate a function on the rows of a data-frame that looks like that:
for i in 1:_nrow # of my DataFrame
lat1 = Raw_Data[i,"lat1"]
lat2 = Raw_Data[i,"lat2"]
lon1 = Raw_Data[i,"long1"]
lon2 = Raw_Data[i,"long2"]
iata1 = Raw_Data[i,"iata1"]
iata2 = Raw_Data[i,"iata2"]
a[i] = [(iata1::String,iata2::String, trunc(i,2), get_intermediary_points(lat1,lon1,lat2,lon2,j) ) for j in 0:.1:1]
end
Now, as a step toward parallelization, I can also create an anonymous function that does quite similar work, running calculation on each chunk of my dataframe:
Raw_Data["selector"] = rand(1:nproc,_nrow) # Define how I split my dataframe. 1 chunck per proc
B = by(Raw_Data,:selector,intermediary_points)
Is there a way to speed up calculations with a parallelized "by"? Otherwise, please suggest good alternative.
Thanks!
Note: This is how my dataframe Raw_Data looks like
6x7 DataFrame:
iata1 lat1 long1 iata2 lat2 long2
[1,] 1 "ELH" 0.444616 -1.3384 "FLL" 0.455079 -1.39891
[2,] 2 "BCN" 0.720765 0.0362729 "UFA" 0.955274 0.976218
[3,] 3 "ACE" 0.505053 -0.237426 "VCE" 0.794214 0.215582
[4,] 4 "PVG" 0.543669 2.12552 "LZH" 0.425277 1.91171
[5,] 5 "CDG" 0.855379 0.0444809 "VLC" 0.689233 -0.00835298
[6,] 6 "HLD" 0.858699 2.08915 "CGQ" 0.765906 2.18718
I figure out what happened. I didn't made all the inputs available to all processors.
Basically, if you are running into the same problem:
All functions should have #everywhere in front of them
All packages should also be declared as #everywhere using DataFrames
All parameters should also be declared with #everywhere in front
of it
Now, that's a lot of work. You can follow http://julia.readthedocs.org/en/latest/manual/parallel-computing/ to use stand-alone packages that would simplify a bit the process.
Cheers.

idata.frame: Why error "is.data.frame(df) is not TRUE"?

I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.
My original call, slow but it works:
df.median<-ddply(exp,
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
With idata.frame, Error: is.data.frame(df) is not TRUE
library(plyr)
df.median<-ddply(idata.frame(exp),
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
So, I thought, perhaps it is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "id", nrow) But if I try something similar to my desired call using baseball, it doesn't work:
bb.median<-ddply(idata.frame(baseball),
.(id,year,team),
numcolwise(median),
na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE
Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?
ETA:
I also tried:
groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')
i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable
which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.
Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table.
Specifically the lapply(.SD,FUN) and .SDcols arguments with by
Setup the data.table
library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)
Which columns are numeric
numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))
dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase,
fCycle), .SDcols = numeric_columns]
some benchmarking
library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime,
fPhase, fCycle), .SDcols = numeric_columns],
plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE),
idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist),
smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur),
smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct),
entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct),
larct = median(x$larct), na.rm = TRUE)),
aggregate = aggregate(exp[, numeric_columns],
exp[, c("groupname", "starttime", "fPhase", "fCycle")],
median),
replications = 5)
## test replications elapsed relative user.self
## 4 aggregate 5 5.42 1.789 5.30
## 1 data.table 5 3.03 1.000 3.03
## 3 idataframe 5 11.81 3.898 11.77
## 2 plyr 5 9.47 3.125 9.45
Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().
In any case, this cuts about 20% off the time (on my system):
system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
)
Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.

Resources