spark similarities between text sentences - apache-spark-mllib

I'm trying to find similarity between text messages (about 1 million text message), in my implementation each line represents an entry.
In order to calculate similarity between those texts we adopt tfidf and columnSimilarities
Below is the code:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
//import scala.math.Ordering
//import org.apache.spark.RangePartitioner
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
case (row, idx) => IndexedRow(idx, row)}))
val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
new RowMatrix(transposed.rows
.map(idxRow => (idxRow.index, idxRow.vector))
.sortByKey().map(_._2))
}
// split word based on spaces and special characters
val documents = sc.textFile("./test1").map(_.split(" |\\,|\\?|\\-|\\+|\\*|\\(|\\)|\\[|\\]|\\{|\\}|\\<|\\>|\\/|\\;|\\.|\\:|\\=|\\^|\\|").filter(_.nonEmpty).toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
tf.cache()
println(tf.getNumPartitions)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
val mat = new RowMatrix(tfidf)
// transpose matrix to get result between row (not between column)
val sim = transposeRowMatrix(mat).columnSimilarities()
val trdd = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
println(trdd.getNumPartitions)
// to descease write to file time
val transformedRDD = trdd.repartition(50)
println(transformedRDD.getNumPartitions)
transformedRDD.cache()
transformedRDD.saveAsTextFile("output")*
The problem is when the number of similar messages increases in file, the similarity decreases.
e.g.
let's assume that we have the file below:
hello world
hello world 123
how is every thing
we are testing
this is a test
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
The output of the former command is:
%cat output/part-000*
5.0,6.0,0.7373482646933146
0.0,1.0,0.8164965809277261
4.0,5.0,0.053913565847778636
1.0,5.0,0.13144171271256438
2.0,4.0,0.16888723050548915
4.0,6.0,0.052731941041749664
Each line in the output represents the similarity between two lines as follows:
"lineX -1", "lineY -1", "similarity"
The output showing the similarity between the last 2 lines is 5.0,6.0,0.7373482646933146, which is fine.
The two lines are
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
and similarity is 0.7373482646933146
while when the file input is:
hello world
hello world 123
hello world 956248
hello world 2564
how is every thing
we are testing
this is a test
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
corporate code 456-458 you ca also tap on this link to verify corporate.co/8965
corporate code 444-444 you ca also tap on this link to verify corporate.co/4444
the output is:
7.0,10.0,0.4855543123154418
2.0,3.0,0.32317021425463427
6.0,8.0,0.03657892871242232
6.0,10.0,0.03097823353416634
0.0,1.0,0.6661166307685673
7.0,8.0,0.5733398760974173
1.0,2.0,0.37867439463004254
9.0,10.0,0.4855543123154418
0.0,3.0,0.5684806190668547
8.0,9.0,0.6716256614182469
4.0,6.0,0.1903502047647684
8.0,10.0,0.4855543123154418
1.0,3.0,0.37867439463004254
6.0,9.0,0.03657892871242232
7.0,9.0,0.5733398760974173
6.0,7.0,0.03657892871242232
1.0,7.0,0.233827426275723
0.0,2.0,0.5684806190668547
the output between the same lines tested in the first example is: 7.0,8.0,0.5733398760974173
the similarity had decreased from 0.7373482646933146 to 0.5733398760974173 for the sames line
the two lines are:
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
corporate code 134-456 you ca also tap on this link to verify corporate.co/5667
and similarity is 0.5733398760974173
Is there any solution to avoid decreasing in similarity between
sentences when their similar line messages increase in the input? (tfidf could be the problem here? when similar sentence number increases similarity decreases due to tfidf?)
is there any solution to cluster similar messages?
i.e the input above, contains multiple sentences like:
hello world 123
same for the sentences like:
corporate code 123-234 you ca also tap on this link to verify corporate.co/1234
could they be grouped based on similarities output?

Related

How can I export layered drawings from drawio to create "animated" slides in beamer?

When preparing lectures, or conference presentations with beamer, I usually use layered drawings. Then for graphics included in consecutive slides ("frames" in beamer), I simply use different sets of layers.
For graphics created in IPE, I have created a dedicated expallviews.lua script.
Unfortunately, for graphics created with diagrams.net locally run as drawio-desktop, no such automated export of various layers exists. The only way is to manually select the visible layers in GUI and then export consecutive drawings to a set of PDF files.
Is there a more convenient method to solve that problem?
The described problem has been reported in issues 405 and 737 in the drawio-desktop repository.
After reviewing those issues, I have found a method based on automated (instead of a manual via GUI) changing the visibility of layers and exporting such drawings to the set of PDF files. The proposed method is described in the comment to the issue 405. It uses a simple Python script:
#!/usr/bin/python3
"""
This script modifies the visibility of layers in the XML
file with diagram generated by drawio.
It works around the problem of lack of a possibility to export
only the selected layers from the CLI version of drawio.
Written by Wojciech M. Zabolotny 6.10.2022
(wzab01<at>gmail.com or wojciech.zabolotny<at>pw.edu.pl)
The code is published under LGPL V2 license
"""
from lxml import etree as let
import xml.etree.ElementTree as et
import xml.parsers.expat as pe
from io import StringIO
import os
import sys
import shutil
import zlib
import argparse
PARSER = argparse.ArgumentParser()
PARSER.add_argument("--layers", help="Selected layers, \"all\", comma separated list of integers or integer ranges like \"0-3,6,7\"", default="all")
PARSER.add_argument("--layer_prefix", help="Layer name prefix", default="Layer_")
PARSER.add_argument("--outfile", help="Output file", default="output.drawio")
PARSER.add_argument("--infile", help="Input file", default="input.drawio")
ARGS = PARSER.parse_args()
INFILENAME = ARGS.infile
OUTFILENAME = ARGS.outfile
# Find all elements with 'value' starting with the layer prefix.
# Return tuples with the element and the rest of 'value' after the prefix.
def find_layers(el_start):
res = []
for el in el_start:
val = el.get('value')
if val is not None:
if val.find(ARGS.layer_prefix) == 0:
# This is a layer element. Add it, and its name
# after the prefix to the list.
res.append((el,val[len(ARGS.layer_prefix):]))
continue
# If it is not a layer element, scan its children
res.extend(find_layers(el))
return res
# Analyse the list of visible layers, and create the list
# of layers that should be visible. Customize this part
# if you want a more sophisticate method for selection
# of layers.
# Now only "all", comma separated list of integers
# or ranges of integers are supported.
def build_visible_list(layers):
if layers == "all":
return layers
res = []
for lay in layers.split(','):
# Is it a range?
s = lay.find("-")
if s > 0:
# This is a range
first = int(lay[:s])
last = int(lay[(s+1):])
res.extend(range(first,last+1))
else:
res.append(int(lay))
return res
def is_visible(layer_tuple,visible_list):
if visible_list == "all":
return True
if int(layer_tuple[1]) in visible_list:
return True
try:
EL_ROOT = et.fromstring(open(INFILENAME,"r").read())
except et.ParseError as perr:
# Handle the parsing error
ROW, COL = perr.position
print(
"Parsing error "
+ str(perr.code)
+ "("
+ pe.ErrorString(perr.code)
+ ") in column "
+ str(COL)
+ " of the line "
+ str(ROW)
+ " of the file "
+ INFILENAME
)
sys.exit(1)
visible_list = build_visible_list(ARGS.layers)
layers = find_layers(EL_ROOT)
for layer_tuple in layers:
if is_visible(layer_tuple,visible_list):
print("set "+layer_tuple[1]+" to visible")
layer_tuple[0].attrib['visible']="1"
else:
print("set "+layer_tuple[1]+" to invisible")
layer_tuple[0].attrib['visible']="0"
# Now write the modified file
t=et.ElementTree(EL_ROOT)
with open(OUTFILENAME, 'w') as f:
t.write(f, encoding='unicode')
The maintained version of that script, together with a demonstration of its use is also available in my github repository.

Is there any good way to rewrite the edgetpu old code by using pycoral api?

I'm a beginner using coral devboard mini.
I want to start a Smart Bird Feeder project.
https://coral.ai/projects/bird-feeder/
I've been trying to execute the code by referring to
I can't run bird_classify.py.
The error is as follows
untimeError: Internal: Unsupported data type in custom op handler: 0Node number 0 (edgetpu-custom-op) failed to prepare.
Originally, the samples in this project seemed to be deprecated, and
The edgetpu requires an old runtimeversion of 13, instead of the current 14.
(tflite is 2.5 ) I have downloaded it directly and re-installed it in
/usr/lib/python3/dist-packagesm
, but I cannot uninstall the new version and cannot match the version.
Is there a better way to do this?
Also, I've decided to give up on running the same environment as the sample, and use the pycoralapi to run the
If there is a good way to rewrite the code to use pycoral, please let me know.
Thanks
#!/usr/bin/python3
"""
Coral Smart Bird Feeder
Uses ClassificationEngine from the EdgeTPU API to analyze animals in
camera frames. Sounds a deterrent if a squirrel is detected.
Users define model, labels file, storage path, deterrent sound, and
optionally can set this to training mode for collecting images for a custom
model.
"""
import argparse
import time
import re
import imp
import logging
import gstreamer
import sys
sys.path.append('/usr/lib/python3/dist-packages/edgetpu')
from edgetpu.classification.engine import ClassificationEngine
from PIL import Image
from playsound import playsound
from pycoral.adapters import classify
from pycoral.adapters import common
from pycoral.utils.dataset import read_label_file
from pycoral.utils.edgetpu import make_interpreter
def save_data(image,results,path,ext='png'):
"""Saves camera frame and model inference results
to user-defined storage directory."""
tag = '%010d' % int(time.monotonic()*1000)
name = '%s/img-%s.%s' %(path,tag,ext)
image.save(name)
print('Frame saved as: %s' %name)
logging.info('Image: %s Results: %s', tag,results)
def load_labels(path):
"""Parses provided label file for use in model inference."""
p = re.compile(r'\s*(\d+)(.+)')
with open(path, 'r', encoding='utf-8') as f:
lines = (p.match(line).groups() for line in f.readlines())
return {int(num): text.strip() for num, text in lines}
def print_results(start_time, last_time, end_time, results):
"""Print results to terminal for debugging."""
inference_rate = ((end_time - start_time) * 1000)
fps = (1.0/(end_time - last_time))
print('\nInference: %.2f ms, FPS: %.2f fps' % (inference_rate, fps))
for label, score in results:
print(' %s, score=%.2f' %(label, score))
def do_training(results,last_results,top_k):
"""Compares current model results to previous results and returns
true if at least one label difference is detected. Used to collect
images for training a custom model."""
new_labels = [label[0] for label in results]
old_labels = [label[0] for label in last_results]
shared_labels = set(new_labels).intersection(old_labels)
if len(shared_labels) < top_k:
print('Difference detected')
return True
def user_selections():
parser = argparse.ArgumentParser()
parser.add_argument('--model', required=True,
help='.tflite model path')
parser.add_argument('--labels', required=True,
help='label file path')
parser.add_argument('--top_k', type=int, default=3,
help='number of classes with highest score to display')
parser.add_argument('--threshold', type=float, default=0.1,
help='class score threshold')
parser.add_argument('--storage', required=True,
help='File path to store images and results')
parser.add_argument('--sound', required=True,
help='File path to deterrent sound')
parser.add_argument('--print', default=False, required=False,
help='Print inference results to terminal')
parser.add_argument('--training', default=False, required=False,
help='Training mode for image collection')
args = parser.parse_args()
return args
def main():
"""Creates camera pipeline, and pushes pipeline through ClassificationEngine
model. Logs results to user-defined storage. Runs either in training mode to
gather images for custom model creation or in deterrent mode that sounds an
'alarm' if a defined label is detected."""
args = user_selections()
print("Loading %s with %s labels."%(args.model, args.labels))
engine = ClassificationEngine(args.model)
labels = load_labels(args.labels)
storage_dir = args.storage
#Initialize logging file
logging.basicConfig(filename='%s/results.log'%storage_dir,
format='%(asctime)s-%(message)s',
level=logging.DEBUG)
last_time = time.monotonic()
last_results = [('label', 0)]
def user_callback(image,svg_canvas):
nonlocal last_time
nonlocal last_results
start_time = time.monotonic()
results = engine.classify_with_image(image, threshold=args.threshold, top_k=args.top_k)
end_time = time.monotonic()
results = [(labels[i], score) for i, score in results]
if args.print:
print_results(start_time,last_time, end_time, results)
if args.training:
if do_training(results,last_results,args.top_k):
save_data(image,results, storage_dir)
else:
#Custom model mode:
#The labels can be modified to detect/deter user-selected items
if results[0][0] !='background':
save_data(image, storage_dir,results)
if 'fox squirrel, eastern fox squirrel, Sciurus niger' in results:
playsound(args.sound)
logging.info('Deterrent sounded')
last_results=results
last_time = end_time
result = gstreamer.run_pipeline(user_callback)
if __name__ == '__main__':
main()
enter code here
I suggest that you follow one of the examples available from the coral examples. There is an example named classify_image.py which uses the edgetpu (tflite) that I found works. After you install the coral examples, you have to drill down through the directory hierarchy. So, in my case, from root it is: /home/pi/ml-projects/coral/pycoral/tensorflow/examples/lite/examples. There are 17 files in that last examples directory. I'm using: numpy 1.19.3, pycoral 2.0.0, scipy 1.7.1, tensorflow 2.4.0, tflite-runtime 2.5.0.post1. I've installed the following edgetpu-runtime: edgetpu_runtime_20201105.zip.

Multiple Securities Trading Algorithm

I am very new to Python and I am having trouble executing my algorithmic trading strategy on more than one security at a time. I am currently using these lines of code for the stocks:
data_p = pd.read_csv('AAPL_30m.csv', index_col = 0, parse_dates = True)
data_p.drop(columns = ['Adj Close'])
Does anyone know how I would go about properly adding more securities?
Since no data is provided, I can only give you a rough idea on how this can be done. Change directory to the folder with all your data series in csv files:
import pandas as pd
import os
os.chdir(r'C:\Users\username\Downloads\new')
files = os.listdir()
Assume the files in the folder is
['AAPL.csv',
'AMZN.csv',
'GOOG.csv']
Then start with an empty dictionary d and loop through all the files in the directory to read as pandas dataframe. Eventually combine all of them to one big dataframe (if you find it more useful)
d = {}
for f in files:
name = f.split('.')[0]
df = pd.read_csv(f)
....
*** Do your processing ***
....
d[name] = df.copy()
dff = pd.concat(d)
Since I do not know your format and your index, I assume you can do pd.concat(d), alternatively, you may also try out pd.DataFrame(d)

How to use entrezpy and Biopython Entrez libraries to access ClinVar data from genomic position of variant

[Disclaimer: I have published this question 3 weeks ago in biostars, with no answers yet. I really would like to get some ideas/discussion to find a solution, so I post also here.
biostars post link: https://www.biostars.org/p/447413/]
For one of my projects of my PhD, I would like to access all variants, found in ClinVar db, that are in the same genomic position as the variant in each row of the input GSVar file. The language constraint is Python.
Up to now I have used entrezpy module: entrezpy.esearch.esearcher. Please see more for entrezpy at: https://entrezpy.readthedocs.io/en/master/
From the entrezpy docs I have followed this guide to access UIDs using the genomic position of a variant: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html in code:
# first get UIDs for clinvar records of the same position
# credits: credits: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html
chr = variants["chr"].split("chr")[1]
start, end = str(variants["start"]), str(variants["end"])
es = entrezpy.esearch.esearcher.Esearcher('esearcher', self.entrez_email)
genomic_pos = chr + "[chr]" + " AND " + start + ":" + end # + "[chrpos37]"
entrez_query = es.inquire(
{'db': 'clinvar',
'term': genomic_pos,
'retmax': 100000,
'retstart': 0,
'rettype': 'uilist'}) # 'usehistory': False
entrez_uids = entrez_query.get_result().uids
Then I have used Entrez from BioPython to get the available ClinVar records:
# process each VariationArchive of each UID
handle = Entrez.efetch(db='clinvar', id=current_entrez_uids, rettype='vcv')
clinvar_records = {}
tree = ET.parse(handle)
root = tree.getroot()
This approach is working. However, I have two main drawbacks:
entrezpy fulls up my log file recording all interaction with Entrez making the log file too big to be read by the hospital collaborator, who is variant curator.
entrezpy function, entrez_query.get_result().uids, will return all UIDs retrieved so far from all the requests (say a request for each variant in GSvar), thus this space inefficient retrieval. That is the entrez_uids list will quickly grow a lot as I process all variants from a GSVar file. The simple solution that I have implenented is to check which UIDs are new from the current request and then keep only those for Entrez.fetch(). However, I still need to keep all seen UIDs, from previous variants in order to be able to know which is the new UIDs. I do this in code by:
# first snippet's first lines go here
entrez_uids = entrez_query.get_result().uids
current_entrez_uids = [uid for uid in entrez_uids if uid not in self.all_entrez_uids_gsvar_file]
self.all_entrez_uids_gsvar_file += current_entrez_uids
Does anyone have suggestion(s) on how to address these two presented drawbacks?

gspread data does not appear in Google Sheet

I'm trying to write sensor data to a google sheet. I was able to write to this same sheet a year or so ago but I am active on this project again and can't get it to work. I believe the Oauth has changed and I've updated my code for that change.
In the below code, I get no errors, however no data in entered in the GoogleSheet. Also, If I look at GoogleSheets, the "last opened" date does not reflect the time my program would/should be writing to that google sheet.
I've tried numerous variations and I'm just stuck. Any suggestions would be appreciated.
#!/usr/bin/python3
#-- developed with Python 3.4.2
# External Resources
import time
import sys
import json
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import traceback
# Initialize gspread
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('MyGoogleCode.json',scope)
client = gspread.authorize(credentials)
# Start loop ________________________________________________________________
samplecount = 1
while True:
data_time = (time.strftime("%Y-%m-%d %H:%M:%S"))
row = ([samplecount,data_time])
# Append to Google sheet_
try:
if credentials is None or credentials.invalid:
credentials.refresh(httplib2.Http())
GoogleDataFile = client.open('DataLogger')
#wks = GoogleDataFile.get_worksheet(1)
wks = GoogleDataFile.get_worksheet(1)
wks.append_row([samplecount,data_time])
print("worksheets", GoogleDataFile.worksheets()) #prints ID for both sheets
except Exception as e:
traceback.print_exc()
print ("samplecount ", samplecount, row)
samplecount += 1
time.sleep(5)
I found my issue. I've changed 3 things to get gspread working:
Downloaded a newly created json file (probably did not need this step)
With the target worksheet open in chrome, I "shared" it with the email address found in the JSON file.
In the google developers console, I enabled "Drive API"
However, the code in the original post will not refresh the token. It will stop working after 60 minutes.
The code that works (as of July 2017) is below.
The code writes to a google sheet named "Datalogger"
It writes to the sheet shown as Sheet2 in the google view.
The only unique information is the name of the JSON file
Hope this helps others.
Jon
#!/usr/bin/python3
# -- developed with Python 3.4.2
#
# External Resources __________________________________________________________
import time
import json
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import traceback
# Initialize gspread credentials
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('MyjsonFile.json',scope)
headers = gspread.httpsession.HTTPSession(headers={'Connection': 'Keep-Alive'})
client = gspread.Client(auth=credentials, http_session=headers)
client.login()
workbook = client.open("DataLogger")
wksheet = workbook.get_worksheet(1)
# Start loop ________________________________________________________________
samplecount = 1
while True:
data_time = (time.strftime("%Y-%m-%d %H:%M:%S"))
row_data = [samplecount,data_time]
if credentials.access_token_expired:
client.login()
wksheet.append_row(row_data)
print("Number of rows in out worksheet ",wksheet.row_count)
print ("samplecount ", samplecount, row_data)
print()
samplecount += 1
time.sleep(16*60)

Resources