Importing a Mathematica Matrix - wolfram-mathematica

I am trying to import a very large matrix file to a variable using the code below. The matrix is of the form {{A,B,C},{D,E,D},{G,F,A}}
inputseq2s = Import[ "C:\Users\jdd0758\Desktop\WaveformsOnly.txt","Data"]
However when I import it like this, it is not read as if it was entered as:
inputseq2s = {{A,B,C},{D,E,D},{G,F,A}}
Extra commas are added to the input and it will not work in my algorithm. How can I make this work properly?

Related

How to use kde_kws parameters for seaborn.histplot()?

I am trying to use sns.histplot() instead of sns.distplot() since I got the following message in colab:
FutureWarning: distplot is a deprecated function and will be removed
in a future version. Please adapt your code to use either displot (a
figure-level function with similar flexibility) or histplot (an axes-level function for histograms).
Code:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
sns.histplot(df['tip'], kde=True, kde_kws={'fill' : True});
I got an error when passing kde_kws parameters inside sns.histplot():
TypeError: init() got an unexpected keyword argument 'fill'
From the documentation kde_kws= is intended to pass arguments "that control the KDE computation, as in kdeplot()." It is not entirely explicit which arguments those are, but they seem to be the ones like bw_method= and bw_adjust= that change the way the KDE is computed, rather than displayed. If you want to change the appearance of the KDE plot, the you can use line_kws=, but, as the name implies, the KDE is represented only by a line and therefore cannot be filled.
If you want both a histogram and a filled KDE, you need to combine histplot() and kdeplot() on the same axes
sns.histplot(df['tip'], stat='density')
sns.kdeplot(df['tip'], fill=True)

How to use DBSCAN algorithm for a list of points in python

I am new to image processing and python coding.
I have detected a number of features in an image and have their respective pixel locations placed in a list format.
My_list = [(x1,y1),(x2,y2),......,(xn,yn)]
I would like to use DBSCAN algorithm to form clusters from the following points.
Currently using sklearn.cluster to import the build in DBSCAN function for python.
If the current format for the points is not compatible would like to know which is?
Error currently facing with the current format:
C:\Python\python.exe "F:/opencv_files/dbscan.py"
**Traceback (most recent call last):**
**File "**F:/opencv_files/dbscan.py**", line 83, in <module>
db = DBSCAN(eps=0.5, min_samples=5).fit(X) # metric=X)**
**File "**C:\Python\lib\site-packages\sklearn\cluster\dbscan_.py**", line 282, in fit
X = check_array(X, accept_sparse='csr')
File "**C:\Python\lib\site-packages\sklearn\utils\validation.py**", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.**
Your data is a list of tuple. There is nothing in this structure that prevents you from doing crazy things with that, such as having different lengths in there. Plus, this is a very slow and memory inefficient way of keeping the data because everything is boxed as a Python object.
Just call data = numpy.array(data) to convert your data into an efficient multidimensional numeric array. This array will then have a shape.

Importing images to prep for keras

I am trying to import a bunch of images and get them ready for keras. The goal here is to have an array of the following dimensions. (length, 160,329,3). As you can see my reshape function is commented out. The "print(images.shape) line returns (8037,). Not sure how to proceed to get the right array dimensions. For reference the 1st column in the csv file is a list of paths to the image in question. I have a function below that combines the path of the image inside the folder and the path to the folder.
When I run the commented out reshape function I get the following error. "ValueError: cannot reshape array of size 8037 into shape (8037,160,320,3)"
import csv
import cv2
f = open('/Users/username/Desktop/data/driving_log.csv')
csv_f = csv.reader(f)
m=[]
for row in csv_f:
n=(row)
m.append(n)
images=[]
for i in range(len(m)):
img=(m[i][1])
img=img.lstrip()
path='/Users/username/Desktop/data/'
img=path+img
image=cv2.imread(img)
images.append(image)
item_num = len(images)
images=np.array(images)
#images=np.array(images).reshape(item_num, 160, 320, 3)
print(images.shape) #returns (8037,)
Can you print the shape of an image before it is appended to images to verify it is what you expect? Even better would be adding an imshow in the loop to make sure you're loading the images you expect (only need to do for one or two). cv2.imread does not throw an error if there isn't an image at the file path you give it, so your array might be all None which would yield the exact behavior you've described.
If that is the problem, check the img variable and make sure it's pointing exactly where you want it to.
Turns out it was including the first line of the CSV file which was heading. After I sorted that out it ran great. It gave me the requested shape.
images=[]
for i in range(1,len(labels)):
img=(m[i][1])
img=img.lstrip()
path='/Users/user/Desktop/data/'
img=path+img
image=cv2.imread(img)
images.append(image)

Cross validation of dataset separated on files

The dataset that I have is separated on different files grouped on samples that know each other, i.e., they were created on similar conditions on a similar time.
The balance of the train-test dataset is important so the samples have to be on train or test, but cannot be separated. So KFold it is not simple to use on my scikit-learn code.
Right now, I am using something similar to LOO making something like:
train ~> cat ./dataset/!(1.txt)
test ~> cat ./dataset/1.txt
Which is not confortable and not very useful if I want to make folds on test of several files and make a "real" CV.
How would be possible to make a good CV to check real overfitting?
Looking to this answer, I've realized that pandas can concatenate dataframes. I checked that the process is 15-20% slower than cat command-line but makes able to do folds as I was expecting.
Anyway, I am quite sure that there should be any other better way than this one:
import glob
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
allFiles = glob.glob("./dataset/*.txt")
kf = KFold(len(allFiles), n_folds=3, shuffle=True)
for train_files, cv_files in kf:
dataTrain = pd.concat((pd.read_csv(allFiles[idTrain], header=None) for idTrain in train_files))
dataTest = pd.concat((pd.read_csv(allFiles[idTest], header=None) for idTest in cv_files))

is it easy to modify this python code to use pandas and would it help if i did?

I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)

Resources