I am trying to read fluid levels in a column from a photograph I took, to see if I can automate the data collection.
So far, I have been able to use some code, below to identify the outline of where the fluid level is.
import numpy as np
import matplotlib.pyplot as plt
import skimage
from skimage import data
from skimage.feature import canny
from scipy import misc
from skimage.filters import roberts, sobel, scharr, prewitt
%matplotlib inline
perm = skimage.io.imread('C:\Users\Spencer\Desktop\perm2_crop.jpg')
edge_roberts = roberts(perm[:, :, 2])
plt.imshow(edge_roberts, cmap=plt.cm.gray)
What I need to figure out now is how to identify the break, and then how to translate that into a data value that I can scale to the values on the column.
Any ideas about what packages or methods I would use to to do this? Any examples would also be appreciated.
Related
I am trying to use mutliprocessing Pool.map() to speed up my code. In the function where I have computation occurring for each process I reference an xarray.DataArray that was opened using xarray.open_rasterio(). However, I receive errors similar to this:
rasterio.errors.RasterioIOError: Read or write failed. /net/home_stu/cfite/data/CDL/2019/2019_30m_cdls.img, band 1: IReadBlock failed at X offset 190, Y offset 115: Unable to open external data file: /net/home_stu/cfite/data/CDL/2019/
I assume this is some issue related to the same file being referenced simultaneously while another worker is opening it too? I use DataArray.sel() to select small portions of the raster grid that I work with since the entire .img file is way to big to load all at once. I have tried opening the .img file in the main code and then just referencing to it in my function, and I've tried opening/closing it in the function that is being passed to Pool.map() - and receive errors like this regardless. Is my file corrupted, or will I just not be able to work with this file using multiprocessing Pool? I am very new to working with multiprocessing, so any advice is appreciated. Here is an example of my code:
import pandas as pd
import xarray as xr
import numpy as np
from multiprocessing import Pool
def select_grid(x,y):
ds = xr.open_rasterio('myrasterfile.img') #opening large file with xarray
grid = ds.sel(x=slice(x,x+50), y=slice(y,y+50))
ds.close()
return grid
def myfunction(row):
x = row.x
y = row.y
mygrid = select_grid(x,y)
my_calculation = mygrid.sum() #example calculation, but really I am doing multiple calculations
my_calculation.to_csv('filename.csv')
with Pool(30) as p:
p.map(myfunction, list_of_df_rows)
I can find lots of answers of how to change Seaborn's figure size. I want to know: what is the default figure size?
I believe seaborn uses matplotlib's default parameters when plotting axes, so you can check the figsize with matplotlib.rcParams.
import matplotlib
matplotlib.rcParams['figure.figsize']
# [12.0, 7.0]
import seaborn as sns
import pandas as pd
df = pd.DataFrame(columns=['x','y'])
ax = sns.scatterplot(data=df, x="x", y="x")
ax.figure.get_size_inches()
# array([12., 7.])
Now for advanced plotting functions that generate figures, this is defined by the function.
For example relplot, displot... have a default parameter of height=5 (inches) and aspect=1, which translates into a figsize of [5,5] for a single subplot/facet, and proportionally more for several cols/rows in the FacetGrid.
pairplot has a default of height=2.5 and jointplot of height=6.
The ends on y-axis cuts halfway while plotting confusion matrix using pandas dataframe?
This is what I get:
I used the codes from here How can I plot a confusion matrix? using pandas dataframe:
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
array = [[13,1,1,0,2,0],
[3,9,6,0,1,0],
[0,0,16,2,0,0],
[0,0,0,13,0,0],
[0,0,0,0,15,0],
[0,0,1,0,0,15]]
df_cm = pd.DataFrame(array, range(6),range(6))
#plt.figure(figsize = (10,7))
sn.set(font_scale=1.4)#for label size
sn.heatmap(df_cm, annot=True,annot_kws={"size": 16})# font size
I solved the problem and I think this post explains why it happens.
Simply speaking, matplotlib 3.1.1 broke seaborn heatmaps; You can solve it by downgrading to matplotlib 3.1.0.
As suggested by sikisis
Following solved my problem
pip install matplotlib==3.1.0
Following code is used to preprocess text with a custom lemmatizer function:
%%time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from gensim.utils import simple_preprocess, lemmatize
from gensim.parsing.preprocessing import STOPWORDS
STOPWORDS = list(STOPWORDS)
def preprocessor(s):
result = []
for token in lemmatize(s, stopwords=STOPWORDS, min_length=2):
result.append(token.decode('utf-8').split('/')[0])
return result
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
%%time
X_train, X_test, y_train, y_test = train_test_split([preprocessor(x) for x in data.text],
data.label, test_size=0.2, random_state=0)
#10.8 seconds
Question:
Can the speed of the lemmatization process be improved?
On a large corpus of about 80,000 documents, it currently takes about two hours. The lemmatize() function seems to be the main bottleneck, as a gensim function such as simple_preprocess is quite fast.
Thanks for your help!
You may want to refactor your code to make it easier to time each portion separately. lemmatize() might be part of your bottleneck, but other significant contributors might also be: (1) composing large documents, one-token-at-a-time, via list .append(); (2) the utf-8 decoding.
Separately, the gensim lemmatize() relies on the parse() function from the Pattern library; you could try an alternative lemmatization utility, like those in NLTK or Spacy.
Finally, as lemmatization may be an inherently costly operation, and it might be the case that the same source data gets processed many times in your pipeline, you might want to engineer your process so that the results are re-written to disk, then re-used on subsequent runs – rather than always done "in-line".
The dataset that I have is separated on different files grouped on samples that know each other, i.e., they were created on similar conditions on a similar time.
The balance of the train-test dataset is important so the samples have to be on train or test, but cannot be separated. So KFold it is not simple to use on my scikit-learn code.
Right now, I am using something similar to LOO making something like:
train ~> cat ./dataset/!(1.txt)
test ~> cat ./dataset/1.txt
Which is not confortable and not very useful if I want to make folds on test of several files and make a "real" CV.
How would be possible to make a good CV to check real overfitting?
Looking to this answer, I've realized that pandas can concatenate dataframes. I checked that the process is 15-20% slower than cat command-line but makes able to do folds as I was expecting.
Anyway, I am quite sure that there should be any other better way than this one:
import glob
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
allFiles = glob.glob("./dataset/*.txt")
kf = KFold(len(allFiles), n_folds=3, shuffle=True)
for train_files, cv_files in kf:
dataTrain = pd.concat((pd.read_csv(allFiles[idTrain], header=None) for idTrain in train_files))
dataTest = pd.concat((pd.read_csv(allFiles[idTest], header=None) for idTest in cv_files))