Issue in ExcelToCsv nifi processor - apache-nifi

I am using ExcelToCsv nifi processor for conversation of .xlsx files to csv file. Wants to convert bunch of .xlsx files which has data in different format to csv. Once the file get converted to csv ,data is getting changed as below.
FYI.
I have used below property values inside ExcelToCsv processor.
Refered ExcelToCsv nifi processor link
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-poi-nar/1.10.0/org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor/
CSV format:custom
Value separator : comma
Quote character : double quotes
Quote mode : Quote minimal
Here are few points where i observed data got changed.
17.90==>17.900000001
270E+11===> 270000000000
34,45,67,344===>344567344 : for third case,quote character does not get added.
Somebody please let us know why am i getting wrong results in csv ouput file?
How to solve this issue?Or Is there any solution for excel to csv conversion?

Comma (",") is used as separator, so you can't have 34,45,67,344 as single value in your csv file.
If you still want to have there comma, you can change file separator from comma to some other character, i.e. pipe ("|"). To change file separator update "Value Separator" filed in ConvertExcelToCSVProcessor nifi processor.
Another option is to escape comma, to achieve that you need to play with "Quote Character" and with "Escape Character"
To keep values as they were in the excel file, experiment with "Format Cell Values" value.

Since Nifi does not have processor to support .XLS (older excel) to .CSV conversion, I wrote a python script to perform conversion, and calling it from ExecuteStreamCommand.
While converting excel rows, the Python script also perform cleanup on rows such as add escape character, remove any \n so that resulted CSV won't fail at ValidateRecord or ConvertRecord processor!
Give it a try (need to tweak) and do let us know that whether it's useful in your case!
import csv
import os
import sys
from io import StringIO, BytesIO
import pandas as pd
import xlrd
from pandas import ExcelFile
wb = xlrd.open_workbook(file_contents=sys.stdin.read(),logfile=open(os.devnull, 'w'))
excel_file_df = pd.read_excel(wb, sheet_name='Sheet1', index=False, index_col=0, encoding='utf-8',engine='xlrd')
#flowfile_content = ExcelFile(BytesIO(sys.stdin.read()))
#excel_file_df = pd.read_excel(flowfile_content, sheet_name='Sheet1', index=False, index_col=0, encoding='utf-8')
csv_data_rows = []
header_list = list(excel_file_df.columns.values)
temp_header_list = []
for field in header_list:
temp = '"' + field + '"'
temp_header_list.append(temp)
header_row = ','.join([str(elem) for elem in temp_header_list])
csv_data_rows.append(header_row)
is_header_row = True
for index, row in excel_file_df.iterrows():
if is_header_row :
is_header_row = False
continue
temp_data_list = []
for item in row :
#item = item.encode('utf-8', 'ignore').decode('utf-8')
if hasattr(item, 'encode'):
item = item.encode('ascii', 'ignore').decode('ascii')
item = str(item)
item = item.replace('\n', '')
item = item.replace('",', '" ')
if item == 'nan':
item=''
temp = '"' + str(item) + '"'
temp_data_list.append(temp)
data_row = ','.join([str(elem) for elem in temp_data_list])
data_row = data_row
csv_data_rows.append(data_row)
for item in csv_data_rows:
sys.stdout.write("%s\r\n" % item)
ExecuteStreamCommand Processor Configuration

Related

write the name of all files in a directory, and their absolute path, to a csv file in bash

This is harder than I expected, but I have a folder with ~100 datasets in .csv format.
I would like to create a .csv file with the following fields:
The first field is the file's name. e.g. user_profile.csv
The second field is the file's absolute path, e.g. /Users/yuqli/project/user_profile.csv
I would like to do this with bash commands. But so far I have only be able to do :
ls >> out.csv
which will write all file names into a txt file... I see some people using a for loop, but manipulating lines in .csv file seems forbidding, and I don't know what to put inside the for loop...
Am I better off just using Python? Any help is appreciated... Thanks!
Thanks for the advice of gurus above, I came up with this Python program that 1) extracts file names and 2) extract field names in each file. Any comments are welcomed. Thanks!
import os
import csv
info = {} # store all information into a Python dictionary
for filename in os.listdir(os.getcwd()):
with open(filename, newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
info[filename] = row1
path = os.getcwd()
header = 'field, dataset, path'
write_file = "output.csv"
with open(write_file, "w") as output:
output.write(header + '\n')
for key, value in info.items():
for elem in value:
curr_path = path + key
line = '{0}, {1}, {2}'.format(elem, key, curr_path)
output.write(line + '\n')

How to load embeddings (in tsv file) generated from StarSpace

Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.
Thanks,
Amulya
You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.
The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.
The Python code to convert the file would then look something like this:
with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
line_count = '...' # line count of the tsv file (as string)
dimensions = '...' # vector size (as string)
outp.write(' '.join([line_count, dimensions]) + '\n')
for line in inp:
words = line.strip().split()
outp.write(' '.join(words) + '\n')
You can then import the new file into Gensim like so:
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)
I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!
I've not been able to directly load the StarSpace embedding files using Gensim.
However, I was able to use the embed_doc utility provided by StarSpace to convert my words/sentences into their vector representations.
You can read more about the utility here.
This is the command I used for the conversion:
$ ./embed_doc model train.txt > vectors.txt
This converts the lines from train.txt into vectors and pipes the output into vectors.txt. Sadly, this includes output from the command itself and the input lines again.
Finally, to load the vectors into Python I used the following code (it's probably not very pythonic and clean, sorry).
file = open('vectors.txt')
X = []
for i, line in enumerate(file):
should_continue = i < 4 or i % 2 != 0
if should_continue:
continue
vector = [float(chunk) for chunk in line.split()]
X.append(vector)
I have a similar workaround where I used pandas to read the .tsv file and then convert it into a dict where keys are words and value their embedding as lists.
Here are some functions I used.
in_data_path = Path.cwd().joinpath("models", "starspace_embeddings.tsv")
out_data_path = Path.cwd().joinpath("models", "starspace_embeddings.bin")
import pandas as pd
starspace_embeddings_data = pd.read_csv(in_data_path, header=None, index_col=0, sep='\t')
starspace_embeddings_dict = starspace_embeddings_data.T.to_dict('list')
from gensim.utils import to_utf8
from smart_open import open as smart_open
from tqdm import tqdm
def save_word2vec_format(fname, vocab, vector_size, binary=True):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vector_size : int
The number of dimensions of word vectors.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
"""
total_vec = len(vocab)
with smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in tqdm(vocab.items()):
if binary:
row = np.array(row)
word = str(word)
row = row.astype(np.float32)
fout.write(to_utf8(word) + b" " + row.tostring())
else:
fout.write(to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
save_word2vec_format(binary=True, fname=out_data_path, vocab=starspace_embeddings_dict, vector_size=100)
word_vectors = KeyedVectors.load_word2vec_format(out_data_path, binary=True)

RUBY CSV Gem generating random quotation mark

I'm trying to generate a csv file from a SQL query result.
99% of the time it does work fine, but in some lines (rows) of the CSV file, it does generate a quotation mark at the start and the end of the row.
The problem pictured:
I've already checked the content of the SQL cells and it is ok.
So I think the problem happens when generating the file.
Here it is the way the file it's been generated.
#load query result
dataset = DB[ "select
id
,action
from
some_table"]
#generate csv file
CSV.open("#{table}.csv", "wb",:write_headers=> true, :headers => ["id_cliente|""acao"] ) do |csv|
dataset.each do |dbrow|
csv << [
"#{dbrow[:id_cliente]}"
+ "|" +
"#{dbrow[:acao]}"
]
end
end
new_object = $bucket_response.objects.build("#{table}.csv")
new_object.content = open("#{table}.csv")
new_object.acl = :public_read
new_object.save
Is there anyway so solve it or improve the generating process?
You must specify the separator instead of passing it as a string:
CSV.open("#{table}.csv", "wb", col_sep: '|', ..., headers: ['id_cliente', 'acao']
...
csv << [dbrow[:id_cliente], dbrow[:acao]]
...
For more infos check the CSV and CSV::Row docs

python - importing csv - filtering on column - writing to txt file w/ a timestamp - Output issues with txt

First post, try not to get mad at my formatting.
I am trying to ETL on a csv file with python 3.5 - the code I have, successfully extracts, filters on correct column, creates the desired end result in the "new_string" variable and produces the correctly named txt file at end of run. But opening the txt file shows it is only one character long if it were an index i = [1] is only thing showing up, I was expecting the whole column to print out in a string format.. clearly I am not taking the formatting of the list/string into consideration but I am stuck for now.
If anyone sees something going on here. I would appreciate the heads up. Thanks in advance...
here is my code:
cdpath = os.getcwd()
def get_file_path(filename):
currentdirpath = os.getcwd()
file_path = os.path.join(os.getcwd(), filename)
print (file_path)
return file_path
path = get_file_path('cleanme.csv') ## My test file to work on
def timeStamped(fname, fmt='%Y-%m-%d-%H-%M-%S_{fname}'): ##Time stamp func
return datetime.datetime.now().strftime(fmt).format(fname=fname)
def read_csv(filepath):
with open(filepath, 'rU') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
new_list = row[2]
new_string = str(new_list)
print (new_string)
with open(timeStamped('cleaned.txt'),'w') as outf:
outf.write(new_string)
In your code, you have:
def read_csv(filepath):
with open(filepath, 'rU') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
new_list = row[2]
new_string = str(new_list)
print (new_string)
with open(timeStamped('cleaned.txt'),'w') as outf:
outf.write(new_string)
As noted in my comment above, there was some question on whether the second with was properly indented, but actually, it doesn't matter:
You generate the new_string inside the for loop (for row in reader). But because you don't use it inside the loop (except printing it out), when the loop finishes, the only value you will have access to will be the last element.
Alternatively, if you had the with ... as outf as part of the loop, each time through, you'd open a new copy and overwrite the data, such that cleaned.txt only has the last value at the end again.
I think what you want is something like:
def read_csv(filepath):
with open(filepath, 'rU') as csvfile:
with open(timeStamped('cleaned.txt'),'w') as outf:
reader = csv.reader(csvfile)
for row in reader:
new_list = row[2] #extract the 3rd column of each row
new_string = str(new_list) # optionally do some transforms here
print (new_string) #debug
outf.write(new_string) #store result

How to store number in text format in csv file using Ruby CSV?

Even though I am inserting value as a string in CSV its getting stored as number e.g. "01" getting stored as 1.
I am using CSV writer:
#out = File.open("#{File.expand_path("CSV")}/#{file_name}.csv", "w")
CSV::Writer.generate(#out) do |csv|
csv << ["01", "02", "test"]
end
#out.close
This generates csv with given values but when we open csv using excel "01" is not stored as text it gets stored as number
Thanks
You have to surround the value with double quotations like "..." in order to get it stored as a string.
Use string formatting:
my_int = 1
p "%02d" % my_int
start here for Ruby 1.9.2
http://www.ruby-doc.org/core/classes/String.html
and you will see that for a full set of instructions, you need to dig into Kernel::sprintf

Resources