Issue with splitting csv file - stanford-nlp

When I run a csv file of tweets through the CoreNLP command line, it splits the tweets based on where periods are in the tweets. I want one output for every line of the csv file, but that is not always the case. How do I split csv files in the CoreNLP command line? This is the same for txt files as well.
Thank you for any help.

CoreNLP does not currently support processing CSV files. However, it does support one sentence per-line files, which is the same as the special case of a one column CSV file. To process these without any splitting of sentences, use: -ssplit.eolonly true

Related

comparing 2 csv files using ansible

has anybody compare 2 csv files in ansible. can you provide an example code if so?
I want to compare specific columns in csv files and output the differences to other file. I was able to easily do it using a powershell, looking to do it directly using ansible.
There is a csvfile module in ansible that can read the content of csv file separated by comma. You can use two variables to store the column content and compare the two stored variables and find the difference and re direct to a file

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Oracle PL/SQL package/procedure for processing CSV from CLOB

I am allowing a user to upload a .csv file via OHS/mod_plsql. This file gets saved as a BLOB file in the database, which I then convert to CLOB. I want to then take the contents of this CLOB file and save its contents into a table. I already have some working code that splits the file based on line endings, then splits each resulting string along commas and inserts the records.
What I need is a way to handle the case where a string within the CSV is enclosed in double-quotes and contains a comma. For example:
col1,col2,col3,col4
some,text,more,text
this,text,has,"commas, semicolons, and periods"
My code will know how to process the second line, but not the third. Does anyone have some code that is smart enough to treat "commas, semicolons, and periods" as a single token? I could probably hack something together, but I don't trust my regular expression skills enough, and I figure someone else has probably already written something that does this same thing that would be willing to share.
There's a good CSV parser in the Alexandria PL/SQL Library - CSV_UTIL_PKG.
https://code.google.com/p/plsql-utils/
More info:
http://ora-00001.blogspot.com.au/2010/04/select-from-spreadsheet-or-how-to-parse.html

JMeter export graph to CSV exports to multiple columns, write results to file exports to one column

I was annoyed with JMeter writing data results to CSV as one column. So when the CSV file was opened in Excel all values would be added to one single column (which requires annoying manual copy/paste work to get to graphs). I then noticed that if I choose Export to CSV on a Listener graph, it actually exports the CSV file as separate columns in Excel, which is great.
Is it possible to have the "Write results to file" write data into separate columns by default as it does with the graph "Export to CSV"? Thanks!
Suppose you have at least 2 options:
Simple Data Writer, which one you are using at the moment, as you understand.
In jmeter.properties file (JMETER_HOME\bin\jmeter.properties) uncomment and set jmeter.save.saveservice.default_delimiter=; to use ';' instead of ',' (used by default) as separator in csv-files (which one you are creating using "Write results to file") - this will separate values in different columns if opened in Excel.
# For use with Comma-separated value (CSV) files or other formats
# where the fields' values are separated by specified delimiters.
jmeter.save.saveservice.default_delimiter=;
Flexible File Writer from jmeter-plugins pack implments the same functionality and looks to be more customizable.
Idea is the same as above - use ";" to separate values written into file:
Write file header: endTimeMillis;responseTime;latency;sentBytes;receivedBytes;isSuccessful;responseCode
Record each sample as: endTimeMillis|;|responseTime|;|latency|;|sentBytes|;|receivedBytes|;|isSuccessful|;|responseCode|\r\n
Hope this helps.

file formats that can be read using PIG

What kind of file formats can be read using PIG?
How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory?
what kind of file formats can be read using PIG? how can i store them in different formats?
There are a few built-in loading and storing methods, but they are limited:
BinStorage - "binary" storage
PigStorage - loads and stores data that is delimited by something (such as tab or comma)
TextLoader - loads data line by line (i.e., delimited by the newline character)
piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.
say we have CSV file n i want to store it as MXL file how this can be done?
I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.
One thing you can do is to write a UDF that converts your columns into an XML string:
B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);
For example, say col1, col2, col3 are "foo", 37, "lemons", respectively. Your UDF can output the string "<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>".
whenever we use STORE command it makes directory and it stores file as part-m-00000 how can i change name of the file and overwrite directory?
You can't change the name of the output file to be something other than part-m-00000. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something like hadoop fs -mv output/part-m-00000 newoutput/myoutputfile. This could be done with a bash script that runs the pig script then executes this command.

Resources