Why Head() function showing semicolon separated data in my jupyter note book? - python-2.x

read the csv file flle using pd.read_csv() method. on displaying , it still in semicolon separated data. I expected table structure
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import random
import os
df=pd.read_csv("E:\Python\data_full.csv")
df.head()
Actual result :
56;"housemaid";"married";"basic.4y";"no";"no";...
1
57;"services";"married";"high.school";"unknown...
2
37;"services";"married";"high.school";"no";"ye...
3
40;"admin.";"married";"basic.6y";"no";"no";"no...
4
56;"services";"married";"high.school";"no";"no...
In [ ]:
​

When I opened the csv file, each row was shown as single cell. Noticed that the delimiter was ;(semicolon ). I have changed the delimiter to ,(comma) and then the each value in the csv file was displayed in each cell.
Now, the head() method is displaying the results in table structure as expected :)
Is there any limitation to use semicolon as a delimiter in csv file?

Related

How to import data from a file as list in Mathematica

I have a large *.txt file that has real numbers. I want to import and Execute the RootMeanSquare function on it, but the output of that function is not a real number.
a.txt:
0.00005589924852471949
0.000036651199287161235
0.000016275882123536572
-4.955137498989977*^-6
-0.00002680629351951319
-0.000048814313574683916
ah=Import["a.txt", "List"];
RootMeanSquare[ah]
Sqrt[7.83436*10^-9 + ("-4.955137498989977*^-6")^2]/Sqrt[6]
In my opinion, the problem is in the number -4.955137498989977*^-6.
Please help me.
thank you.
The import yields a string, so split it and convert to numeric expressions.
ah = ToExpression#StringSplit#Import["a.txt"]
StringSplit splits a string at whitespaces by default.
ToExpression is listable, so it operates on the string list in one step.

Issue in ExcelToCsv nifi processor

I am using ExcelToCsv nifi processor for conversation of .xlsx files to csv file. Wants to convert bunch of .xlsx files which has data in different format to csv. Once the file get converted to csv ,data is getting changed as below.
FYI.
I have used below property values inside ExcelToCsv processor.
Refered ExcelToCsv nifi processor link
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-poi-nar/1.10.0/org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor/
CSV format:custom
Value separator : comma
Quote character : double quotes
Quote mode : Quote minimal
Here are few points where i observed data got changed.
17.90==>17.900000001
270E+11===> 270000000000
34,45,67,344===>344567344 : for third case,quote character does not get added.
Somebody please let us know why am i getting wrong results in csv ouput file?
How to solve this issue?Or Is there any solution for excel to csv conversion?
Comma (",") is used as separator, so you can't have 34,45,67,344 as single value in your csv file.
If you still want to have there comma, you can change file separator from comma to some other character, i.e. pipe ("|"). To change file separator update "Value Separator" filed in ConvertExcelToCSVProcessor nifi processor.
Another option is to escape comma, to achieve that you need to play with "Quote Character" and with "Escape Character"
To keep values as they were in the excel file, experiment with "Format Cell Values" value.
Since Nifi does not have processor to support .XLS (older excel) to .CSV conversion, I wrote a python script to perform conversion, and calling it from ExecuteStreamCommand.
While converting excel rows, the Python script also perform cleanup on rows such as add escape character, remove any \n so that resulted CSV won't fail at ValidateRecord or ConvertRecord processor!
Give it a try (need to tweak) and do let us know that whether it's useful in your case!
import csv
import os
import sys
from io import StringIO, BytesIO
import pandas as pd
import xlrd
from pandas import ExcelFile
wb = xlrd.open_workbook(file_contents=sys.stdin.read(),logfile=open(os.devnull, 'w'))
excel_file_df = pd.read_excel(wb, sheet_name='Sheet1', index=False, index_col=0, encoding='utf-8',engine='xlrd')
#flowfile_content = ExcelFile(BytesIO(sys.stdin.read()))
#excel_file_df = pd.read_excel(flowfile_content, sheet_name='Sheet1', index=False, index_col=0, encoding='utf-8')
csv_data_rows = []
header_list = list(excel_file_df.columns.values)
temp_header_list = []
for field in header_list:
temp = '"' + field + '"'
temp_header_list.append(temp)
header_row = ','.join([str(elem) for elem in temp_header_list])
csv_data_rows.append(header_row)
is_header_row = True
for index, row in excel_file_df.iterrows():
if is_header_row :
is_header_row = False
continue
temp_data_list = []
for item in row :
#item = item.encode('utf-8', 'ignore').decode('utf-8')
if hasattr(item, 'encode'):
item = item.encode('ascii', 'ignore').decode('ascii')
item = str(item)
item = item.replace('\n', '')
item = item.replace('",', '" ')
if item == 'nan':
item=''
temp = '"' + str(item) + '"'
temp_data_list.append(temp)
data_row = ','.join([str(elem) for elem in temp_data_list])
data_row = data_row
csv_data_rows.append(data_row)
for item in csv_data_rows:
sys.stdout.write("%s\r\n" % item)
ExecuteStreamCommand Processor Configuration

CLI "bq load" - how to use non-printable character as delimiter?

I'm having trouble loading data into BigQuery as a single column row. I wish BigQuery offered the ability to have "no delimiter" as an option, but in the meantime I need to choose the most obscure ASCII delimiter I can find so my single column row is not split into columns.
When doing this the CLI won't allow me to input strange characters, so I need to use the API through Python or other channels.
How can I use the CLI instead with a non printable character?
Python example from BigQuery lazy data loading: DDL, DML, partitions, and half a trillion Wikipedia pageviews:
#!/bin/python
from google.cloud import bigquery
bq_client = bigquery.Client(project='fh-bigquery')
table_ref = bq_client.dataset('views').table('wikipedia_views_gcs')
table = bigquery.Table(table_ref, schema=SCHEMA)
extconfig = bigquery.ExternalConfig('CSV')
extconfig.schema = [bigquery.SchemaField('line', 'STRING')]
extconfig.options.field_delimiter = u'\u00ff'
extconfig.options.quote_character = ''
To use a non-printable character with BQ load you can use echo in bash:
bq load \
--source_format=CSV \
--field_delimiter=$(echo -en "\x01") \
--noreplace --max_bad_records=100 \
<bq_dataset>.<bq_table> gs://<bucket_name>/<file_name>.csv

How to load embeddings (in tsv file) generated from StarSpace

Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.
Thanks,
Amulya
You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.
The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.
The Python code to convert the file would then look something like this:
with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
line_count = '...' # line count of the tsv file (as string)
dimensions = '...' # vector size (as string)
outp.write(' '.join([line_count, dimensions]) + '\n')
for line in inp:
words = line.strip().split()
outp.write(' '.join(words) + '\n')
You can then import the new file into Gensim like so:
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)
I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!
I've not been able to directly load the StarSpace embedding files using Gensim.
However, I was able to use the embed_doc utility provided by StarSpace to convert my words/sentences into their vector representations.
You can read more about the utility here.
This is the command I used for the conversion:
$ ./embed_doc model train.txt > vectors.txt
This converts the lines from train.txt into vectors and pipes the output into vectors.txt. Sadly, this includes output from the command itself and the input lines again.
Finally, to load the vectors into Python I used the following code (it's probably not very pythonic and clean, sorry).
file = open('vectors.txt')
X = []
for i, line in enumerate(file):
should_continue = i < 4 or i % 2 != 0
if should_continue:
continue
vector = [float(chunk) for chunk in line.split()]
X.append(vector)
I have a similar workaround where I used pandas to read the .tsv file and then convert it into a dict where keys are words and value their embedding as lists.
Here are some functions I used.
in_data_path = Path.cwd().joinpath("models", "starspace_embeddings.tsv")
out_data_path = Path.cwd().joinpath("models", "starspace_embeddings.bin")
import pandas as pd
starspace_embeddings_data = pd.read_csv(in_data_path, header=None, index_col=0, sep='\t')
starspace_embeddings_dict = starspace_embeddings_data.T.to_dict('list')
from gensim.utils import to_utf8
from smart_open import open as smart_open
from tqdm import tqdm
def save_word2vec_format(fname, vocab, vector_size, binary=True):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vector_size : int
The number of dimensions of word vectors.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
"""
total_vec = len(vocab)
with smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in tqdm(vocab.items()):
if binary:
row = np.array(row)
word = str(word)
row = row.astype(np.float32)
fout.write(to_utf8(word) + b" " + row.tostring())
else:
fout.write(to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
save_word2vec_format(binary=True, fname=out_data_path, vocab=starspace_embeddings_dict, vector_size=100)
word_vectors = KeyedVectors.load_word2vec_format(out_data_path, binary=True)

Scrapy response.xpath not returning anything for a query

I am using the scrapy shell to extract some text data. Here are the commands i gave in the scrapy shell:
>>> scrapy shell "http://jobs.parklandcareers.com/dallas/nursing/jobid6541851-nurse-resident-cardiopulmonary-icu-feb2015-nurse-residency-requires-contract-jobs"
>>> response.xpath('//*[#id="jobDesc"]/span[1]/text()')
[<Selector xpath='//*[#id="jobDesc"]/span[1]/text()' data=u'Dallas, TX'>]
>>> response.xpath('//*[#id="jobDesc"]/span[2]/p/text()[2]')
[<Selector xpath='//*[#id="jobDesc"]/span[2]/p/text()[2]' data=u'Responsible for attending assigned nursi'>]
>>> response.xpath('//*[#id="jobDesc"]/span[2]/p/text()[preceding-sibling::*="Education"][following-sibling::*="Certification"]')
[]
The third command is not returning any data. I was trying to extract data between 2 keywords in the command. Where am i wrong?
//*[#id="jobDesc"]/span[2]/p/text() would return you a list of text nodes. You can filter the relevant nodes in Python. Here's how you can get the text between "Education/Experience:" and "Certification/Registration/Licensure:" text paragraphs:
>>> result = response.xpath('//*[#id="jobDesc"]/span[2]/p/text()').extract()
>>> start = result.index('Education/Experience:')
>>> end = result.index('Certification/Registration/Licensure:')
>>> print ''.join(result[start+1:end])
- Must be a graduate from an accredited school of Nursing.
UPD (regarding an additional question in comments):
>>> response.xpath('//*[#id="jobDesc"]/span[3]/text()').re('Job ID: (\d+)')
[u'143112']
Try:
substring-before(
substring-after('//*[#id="jobDesc"]/span[2]/p/text()', 'Education'), 'Certification')
Note: I couldn't test it.
The idea is that you cannot use preceding-sibling and following-sibling because you look in the same text node. You have to extract the text part that you want using substring-before() and substring-after()
By combining those two functions, you select what is in between.

Resources