How to load embeddings (in tsv file) generated from StarSpace - gensim

Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.
Thanks,
Amulya

You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.
The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.
The Python code to convert the file would then look something like this:
with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
line_count = '...' # line count of the tsv file (as string)
dimensions = '...' # vector size (as string)
outp.write(' '.join([line_count, dimensions]) + '\n')
for line in inp:
words = line.strip().split()
outp.write(' '.join(words) + '\n')
You can then import the new file into Gensim like so:
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)
I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!

I've not been able to directly load the StarSpace embedding files using Gensim.
However, I was able to use the embed_doc utility provided by StarSpace to convert my words/sentences into their vector representations.
You can read more about the utility here.
This is the command I used for the conversion:
$ ./embed_doc model train.txt > vectors.txt
This converts the lines from train.txt into vectors and pipes the output into vectors.txt. Sadly, this includes output from the command itself and the input lines again.
Finally, to load the vectors into Python I used the following code (it's probably not very pythonic and clean, sorry).
file = open('vectors.txt')
X = []
for i, line in enumerate(file):
should_continue = i < 4 or i % 2 != 0
if should_continue:
continue
vector = [float(chunk) for chunk in line.split()]
X.append(vector)

I have a similar workaround where I used pandas to read the .tsv file and then convert it into a dict where keys are words and value their embedding as lists.
Here are some functions I used.
in_data_path = Path.cwd().joinpath("models", "starspace_embeddings.tsv")
out_data_path = Path.cwd().joinpath("models", "starspace_embeddings.bin")
import pandas as pd
starspace_embeddings_data = pd.read_csv(in_data_path, header=None, index_col=0, sep='\t')
starspace_embeddings_dict = starspace_embeddings_data.T.to_dict('list')
from gensim.utils import to_utf8
from smart_open import open as smart_open
from tqdm import tqdm
def save_word2vec_format(fname, vocab, vector_size, binary=True):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vector_size : int
The number of dimensions of word vectors.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
"""
total_vec = len(vocab)
with smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in tqdm(vocab.items()):
if binary:
row = np.array(row)
word = str(word)
row = row.astype(np.float32)
fout.write(to_utf8(word) + b" " + row.tostring())
else:
fout.write(to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
save_word2vec_format(binary=True, fname=out_data_path, vocab=starspace_embeddings_dict, vector_size=100)
word_vectors = KeyedVectors.load_word2vec_format(out_data_path, binary=True)

Related

Issue in ExcelToCsv nifi processor

I am using ExcelToCsv nifi processor for conversation of .xlsx files to csv file. Wants to convert bunch of .xlsx files which has data in different format to csv. Once the file get converted to csv ,data is getting changed as below.
FYI.
I have used below property values inside ExcelToCsv processor.
Refered ExcelToCsv nifi processor link
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-poi-nar/1.10.0/org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor/
CSV format:custom
Value separator : comma
Quote character : double quotes
Quote mode : Quote minimal
Here are few points where i observed data got changed.
17.90==>17.900000001
270E+11===> 270000000000
34,45,67,344===>344567344 : for third case,quote character does not get added.
Somebody please let us know why am i getting wrong results in csv ouput file?
How to solve this issue?Or Is there any solution for excel to csv conversion?
Comma (",") is used as separator, so you can't have 34,45,67,344 as single value in your csv file.
If you still want to have there comma, you can change file separator from comma to some other character, i.e. pipe ("|"). To change file separator update "Value Separator" filed in ConvertExcelToCSVProcessor nifi processor.
Another option is to escape comma, to achieve that you need to play with "Quote Character" and with "Escape Character"
To keep values as they were in the excel file, experiment with "Format Cell Values" value.
Since Nifi does not have processor to support .XLS (older excel) to .CSV conversion, I wrote a python script to perform conversion, and calling it from ExecuteStreamCommand.
While converting excel rows, the Python script also perform cleanup on rows such as add escape character, remove any \n so that resulted CSV won't fail at ValidateRecord or ConvertRecord processor!
Give it a try (need to tweak) and do let us know that whether it's useful in your case!
import csv
import os
import sys
from io import StringIO, BytesIO
import pandas as pd
import xlrd
from pandas import ExcelFile
wb = xlrd.open_workbook(file_contents=sys.stdin.read(),logfile=open(os.devnull, 'w'))
excel_file_df = pd.read_excel(wb, sheet_name='Sheet1', index=False, index_col=0, encoding='utf-8',engine='xlrd')
#flowfile_content = ExcelFile(BytesIO(sys.stdin.read()))
#excel_file_df = pd.read_excel(flowfile_content, sheet_name='Sheet1', index=False, index_col=0, encoding='utf-8')
csv_data_rows = []
header_list = list(excel_file_df.columns.values)
temp_header_list = []
for field in header_list:
temp = '"' + field + '"'
temp_header_list.append(temp)
header_row = ','.join([str(elem) for elem in temp_header_list])
csv_data_rows.append(header_row)
is_header_row = True
for index, row in excel_file_df.iterrows():
if is_header_row :
is_header_row = False
continue
temp_data_list = []
for item in row :
#item = item.encode('utf-8', 'ignore').decode('utf-8')
if hasattr(item, 'encode'):
item = item.encode('ascii', 'ignore').decode('ascii')
item = str(item)
item = item.replace('\n', '')
item = item.replace('",', '" ')
if item == 'nan':
item=''
temp = '"' + str(item) + '"'
temp_data_list.append(temp)
data_row = ','.join([str(elem) for elem in temp_data_list])
data_row = data_row
csv_data_rows.append(data_row)
for item in csv_data_rows:
sys.stdout.write("%s\r\n" % item)
ExecuteStreamCommand Processor Configuration

Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files.
How should I do that?
The quick and incorrect way, which might be close enough: grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq.
This will miss blocks where the 8-mer is split across two lines (e.g. the first line ends with AAAA and the second starts with AAAA). It also assumes the output has blocks of 4 lines each.
The proper way: write a little program (in Python, or a language of your choice) which buffers one FASTQ block (e.g. 4 lines) and checks that the concatenation of the previous (buffered) block's sequence and the current block's sequence do not have an 8-mer as above. If that's the case, then output the buffered block.
I ended up to use below codes in R and solved my problem.
library(ShortRead)
fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)
trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC",
sread(reads_fq), invert = TRUE)]
writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)
You can use the Python package biotite for it (https://www.biotite-python.org).
Let's say you have the following FASTQ file:
#Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G#?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7#+6!FDC&
+4=4+,#2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
#Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,#'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?#2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'#",8:#5%18&+3>'8:28+:5F0);E9<=,+
This is a script, that should do the work:
import biotite.sequence.io.fastq as fastq
import biotite.sequence as seq
# 'GGGGGGGG', 'CCCCCCCC', etc.
consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"]
fastq_file = fastq.FastqFile("Sanger")
fastq_file.read("example.fastq")
# Iterate over sequence entries in file
for header in fastq_file:
sequence = fastq_file.get_sequence(header)
# Iterative over each of the consecutive sequences
for consecutive_nuc in consecutive_nucs:
# Find all indices, where a match was found
matches = seq.find_subsequence(sequence, consecutive_nuc)
if len(matches) > 0:
# If any match was found report it
print(
f"Found '{consecutive_nuc}' "
f"in sequence '{header}' at position {matches[0]}"
)
This is the output:
Found 'CCCCCCCC' in sequence 'Read:01' at pos 8

write the name of all files in a directory, and their absolute path, to a csv file in bash

This is harder than I expected, but I have a folder with ~100 datasets in .csv format.
I would like to create a .csv file with the following fields:
The first field is the file's name. e.g. user_profile.csv
The second field is the file's absolute path, e.g. /Users/yuqli/project/user_profile.csv
I would like to do this with bash commands. But so far I have only be able to do :
ls >> out.csv
which will write all file names into a txt file... I see some people using a for loop, but manipulating lines in .csv file seems forbidding, and I don't know what to put inside the for loop...
Am I better off just using Python? Any help is appreciated... Thanks!
Thanks for the advice of gurus above, I came up with this Python program that 1) extracts file names and 2) extract field names in each file. Any comments are welcomed. Thanks!
import os
import csv
info = {} # store all information into a Python dictionary
for filename in os.listdir(os.getcwd()):
with open(filename, newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
info[filename] = row1
path = os.getcwd()
header = 'field, dataset, path'
write_file = "output.csv"
with open(write_file, "w") as output:
output.write(header + '\n')
for key, value in info.items():
for elem in value:
curr_path = path + key
line = '{0}, {1}, {2}'.format(elem, key, curr_path)
output.write(line + '\n')

sort fastq file and keep sequences 15-17 bp in length

I have a couple very large fastq files that I am using cutadapt to trim off a transposon end sequence from, this should result in 15-17 base pairs of genomic DNA remaining. A very large portion of the fastq files is 15-17 base pairs after using cutadapt, but some sequences are quite a bit longer (indicating they didn't have a transposon end sequence on them and they are garbage reads for my experiment).
My question: is there a command or script I can utilize in Linux in order for me to sort through these fastq files and output a new fastq containing only reads that are 15-17 base pairs long, while still retaining the usual fastq format?
For reference, the fastq format looks like this:
#D64TDFP1:287:C69APACXX:2:1101:1319:2224 1:N:0:
GTTAGACCGGATCCTAACAGGTTGGATGATAAGTCCCCGGTCTAT
+
DDHHHDHHGIHIIIIE?FFHECGHICHHGH>BD?GHIIIIFHIDG
#D64TDFP1:287:C69APACXX:2:1101:1761:2218 1:N:0:
GTTAGACCGGATCCTAACAGGTTGGATGATAAGTCCCCGGTCTAT
+
FFHHHHHJIJJJJJIIJJJIJHIJJGIJIIIFJ?HHJJJJGHIGI
I found a similar question here, but it appears that a correct solution was never found. Does anyone have any solutions?
Read four lines at a time into an array. Print out those four lines, when the read length is between your thresholds.
Here is an example of how to do that with Perl, but the principle would be the same in Python or any other scripting language:
#!/usr/bin/env perl
use strict;
use warnings;
my $fastq;
my $lineIdx = 0;
while (<>) {
chomp;
$fastq->[$lineIdx++] = $_;
if ($lineIdx == 4) {
my $readLength = length $fastq->[1];
if (($readLength >= 15) && ($readLength <= 17)) {
print "$fastq->[0]\n$fastq->[1]\n$fastq->[2]\n$fastq->[3]\n";
}
$lineIdx = 0;
}
}
To use, e.g.:
$ ./filter_fastq_reads.pl < reads.fq > filtered_reads.fq
This prints out reads in the order they are found. This is just filtering, which should be very fast. Otherwise, if you need to sort on some criteria, please specify the sort criteria in your question.
In Python:
#!/usr/bin/env python
import sys
line_idx = 0
record = []
for line in sys.stdin:
record[line_idx] = line.rstrip()
line_idx += 1
if line_idx == 4:
read_length = len(record[1])
if read_length >= 15 and read_length <= 17:
sys.stdout.write('{}\n'.format('\n'.join(record)))
line_idx = 0

The issue of reading multiple images from a folder in Matlab

I have a set of images located in a folder and I'm trying to read these images and store their names in text file. Where the order of images is very important.
My code as follow:
imagefiles = dir('*jpg');
nfiles = length(imagefiles); % Number of files found
%*******************
for ii=1:nfiles
currentfilename = imagefiles(ii).name;
% write the name in txt file
end
The images stored in the folder in the following sequence : {1,2,3,4,100,110}.
The problem that Matlab read and write the sequence of images as { 1,100,110,2,3,4}. Which is not the correct order.
How can this be overcome?
I would suggest to use scanf to find the number of the file. For that you have to create a format spec which shows how your file name is built. If it is a number, followed by .jpg, that would be: '%d.jpg'.
You can call sscanf (scan string) on the name's of the files using cellfun:
imagefiles = dir('*jpg');
fileNo = cellfun(#(x)sscanf(x,'%d.jpg'),{imagefiles(:).name});
Then you sort fileNo, save the indexes of the sorted array and go through these indexes in the for-loop:
[~,ind] = sort(fileNo);
for ii=ind
currentfilename = imagefiles(ii).name;
% write the name in txt file
end

Resources