CSV.foreach Not Reading First Column in CSV File - ruby

Learning Ruby for the first time to automate cleaning up some CSV files.
I've managed to piece together the script below from other SO questions but for some reason the script does not read the first column of the original CSV file. If I add a dummy first column everything works perfectly. What am I missing?
require 'csv'
COLUMNS = ['SFID','Date','Num','Transaction Type']
CSV.open("invoicesfixed.csv", "wb",
:write_headers=> true,
:headers => ["Account__c","Invoice_Date__c","Invoice_Number__c","Transaction_Type__c"]) do |csv|
CSV.foreach('invoices.csv', :headers=>true, :converters => :all) do |row|
#convert date format to be compatible with Salesforce
row['Date'] = Date.strptime(row['Date'], '%m/%d/%y').strftime('%Y-%m-%d')
csv << COLUMNS.map { |col| row[col] }
end
end
This input file:
Transaction Type,Date,Num,SFID
Invoice,7/1/19,151466,SFID1
Invoice,7/1/19,151466,SFID2
Invoice,7/1/19,151466,SFID3
Invoice,7/1/19,151466,SFID4
Invoice,7/1/19,151466,SFID5
Invoice,7/1/19,151466,SFID6
Invoice,7/1/19,151153,SFID7
Sales Receipt,7/1/19,149487,SFID8
Sales Receipt,7/1/19,149487,SFID9
Sales Receipt,7/1/19,149758,SFID10
Sales Receipt,7/1/19,149758,SFID11
Yields this output:
Account__c,Invoice_Date__c,Invoice_Number__c,Transaction_Type__c
SFID1,2019-07-01,151466,
SFID2,2019-07-01,151466,
SFID3,2019-07-01,151466,
SFID4,2019-07-01,151466,
SFID5,2019-07-01,151466,
SFID6,2019-07-01,151466,
SFID7,2019-07-01,151153,
SFID8,2019-07-01,149487,
SFID9,2019-07-01,149487,
SFID10,2019-07-01,149758,
SFID11,2019-07-01,149758,
However, this input:
Dummy,Transaction Type,Date,Num,SFID
,Invoice,7/1/19,151466,SFID1
,Invoice,7/1/19,151466,SFID2
,Invoice,7/1/19,151466,SFID3
,Invoice,7/1/19,151466,SFID4
,Invoice,7/1/19,151466,SFID5
,Invoice,7/1/19,151466,SFID6
,Invoice,7/1/19,151153,SFID7
,Sales Receipt,7/1/19,149487,SFID8
,Sales Receipt,7/1/19,149487,SFID9
,Sales Receipt,7/1/19,149758,SFID10
,Sales Receipt,7/1/19,149758,SFID11
Yields the correct output of:
Account__c,Invoice_Date__c,Invoice_Number__c,Transaction_Type__c
SFID1,2019-07-01,151466,Invoice
SFID2,2019-07-01,151466,Invoice
SFID3,2019-07-01,151466,Invoice
SFID4,2019-07-01,151466,Invoice
SFID5,2019-07-01,151466,Invoice
SFID6,2019-07-01,151466,Invoice
SFID7,2019-07-01,151153,Invoice
SFID8,2019-07-01,149487,Sales Receipt
SFID9,2019-07-01,149487,Sales Receipt
SFID10,2019-07-01,149758,Sales Receipt
SFID11,2019-07-01,149758,Sales Receipt
Any ideas why this might be happening?

I had a similar problem, though running your example worked.
I realized that problem (at least for me) was that I was creating CSV file using "Save As UTF-8 CSV" from Excel.
This adds BOM to the beginning of the file - before the first column header name and consequently row['firstColumnName'] was returning nil.
Saving file as CSV fixed the issue for me.

Related

Google Cloud DLP - CSV inspection

I'm trying to inspect a CSV file and there are no findings being returned (I'm using the EMAIL_ADDRESS info type and the addresses I'm using are coming up with positive hits here: https://cloud.google.com/dlp/demo/#!/). I'm sending the CSV file into inspect_content with a byte_item as follows:
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
In looking at the supported file types, it looks like CSV/TSV files are inspected via Structured Parsing.
For CSV/TSV does that mean one can't just sent in the file, and needs to use the table attribute instead of byte_item as per https://cloud.google.com/dlp/docs/inspecting-structured-text?
What about for XSLX files for example? They're an unspecified file type so I tried with a configuration like so, but it still returned no findings:
byte_item: {
type: :BYTES_TYPE_UNSPECIFIED,
data: File.open('/xxxxx/dlptest.xlsx', 'rb').read
}
I'm able to do inspection and redaction with images and text fine, but having a bit of a problem with other file types. Any ideas/suggestions welcome! Thanks!
Edit: The contents of the CSV in question:
$ cat ~/Downloads/dlptest.csv
dylans#gmail.com,anotehu,steve#example.com
blah blah,anoteuh,
aonteuh,
$ file ~/Downloads/dlptest.csv
~/Downloads/dlptest.csv: ASCII text, with CRLF line terminators
The full request:
parent = "projects/xxxxxxxx/global"
inspect_config = {
info_types: [{name: "EMAIL_ADDRESS"}],
min_likelihood: :POSSIBLE,
limits: { max_findings_per_request: 0 },
include_quote: true
}
request = {
parent: parent,
inspect_config: inspect_config,
item: {
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
}
}
dlp = Google::Cloud::Dlp.dlp_service
response = dlp.inspect_content(request)
The CSV file I was testing with was something I created using Google Sheets and exported as a CSV, however, the file showed locally as a "text/plain; charset=us-ascii". I downloaded a CSV off the internet and it had a mime of "text/csv; charset=utf-8". This is the one that worked. So it looks like my issue was specifically due the file being an incorrect mime type.
xlsx is not yet supported. Coming soon. (Maybe that part of the question should be split out from the CSV debugging issue.)

A JSON text must at least contain two octets! (JSON::ParserError)

I'm working with a Ruby script that reads a .json file.
Here is the JSON file:
{
"feed.xml": "93d5b140dd2b4779edef0347ac835fb1",
"index.html": "1cbe25936e392161bad6074d65acdd91",
"md5.json": "655d7c1dbf83a271f348a50a44ba4f6a",
"test.sh": "9be192b1b5a9978cb3623737156445fd",
"index.html": "c064e204040cde216d494776fdcfb68f",
"main.css": "21b13d87db2186d22720e8c881a78580",
"welcome-to-jekyll.html": "01d7c7d66bdeecd9cd69feb5b4b4184d"
}
It is completely valid, and is checked for its existence before trying to read from it. Example:
if File.file?("md5.json")
puts "MD5s exists"
mddigests = File.open("md5.json", "r")
puts "MD5s" + mddigests.read
items = JSON.parse(mddigests.read) <--- Where it all goes wrong.
puts items["feed.xml"]
Everything works up until that point:
MD5s exists
MD5s{
"feed.xml": "93d5b140dd2b4779edef0347ac835fb1",
"index.html": "1cbe25936e392161bad6074d65acdd91",
"md5.json": "655d7c1dbf83a271f348a50a44ba4f6a",
"test.sh": "9be192b1b5a9978cb3623737156445fd",
"index.html": "c064e204040cde216d494776fdcfb68f",
"main.css": "21b13d87db2186d22720e8c881a78580",
"welcome-to-jekyll.html": "01d7c7d66bdeecd9cd69feb5b4b4184d"
}
common.rb:156:in `initialize': A JSON text must at least contain two octets! (JSON::ParserError)
I've searched and tried a lot of different things, to no avail. I'm stumped. Thanks!
You have a duplicate call to read() at the point that it all goes wrong. Replace the second call to read() with the variable mddigests and all should be fine.
This code should work like you'd expect:
if File.file?("md5.json")
puts "MD5s exists"
mddigests = File.open("md5.json", "r")
digests = mddigests.read
puts "MD5s" + digests
items = JSON.parse(digests) #<--- This should work now!
puts items["feed.xml"]
end
The reason is that the file pointer is moved after the first read(), and by the second read(), it's at the end of file, hence the message requiring at least 2 octets.

Ruby - CSV works while SmarteCSV doesn't

I want to open a csv file using SmarterCSV.process
market_csv = SmarterCSV.process(market)
p "just read #{market_csv}"
The problem is that the data is not read and this prints:
[]
However, if I attempt the same thing with the default CSV library implementation the content of the file is read(the following print statement prints the file).
CSV.foreach(market) do |row|
p row
end
The content of the file I was reading is of the form:
Date,Close
03/06/15,0.1634
02/06/15,0.1637
01/06/15,0.1638
31/05/15,0.1638
The problem could come from the line separator, the file is not exactly the same if you're using windows or unix system ("\r\n" or "\r"). Try to identify and specify the character in the SmarterCSV.process like this:
market_csv = SmarterCSV.process(market, row_sep: "\r")
p "just read #{market_csv}"
or like this:
market_csv = SmarterCSV.process(market, row_sep: :auto)
p "just read #{market_csv}"

Create a file descriptor in ruby

I am writing a script will perform various tasks with DSV or positional files. These tasks varies and are like creating an DB table for the file, or creating a shell script for parsing it.
As I have idealized my script would receive a "descriptor" as input to perform its tasks. It then would parse this descriptor and perform its tasks accordingly.
I came up with some ideas on how to specify the descriptor file, but didn't really manage to get something robust - probably due my inexperience in ruby.
It seems though, the best way to parse the descriptor would be using ruby language itself and then somehow catch parsing exceptions to turn into something more relevant to the context.
Example:
The file I will be reading looks like (myfile.dsv):
jhon,12343535,27/04/1984
dave,53245265,30/03/1977
...
Descriptor file myfile.des contains:
FILE_TYPE = "DSV"
DSV_SEPARATOR = ","
FIELDS = [
name => [:pos => 0, :type => "string"],
phone => [:pos => 1, :type => "number"],
birthdate => [:pos => 2, :type => "date", :mask = "dd/mm/yyyy"]
]
And the usage should be:
ruby script.rb myfile.des --task GenerateTable
So the program script.rb should load and parse the descriptor myfile.des and perform whatever tasks accordingly.
Any ideas on how to perform this?
Use YAML
Instead of rolling your own, use YAML from the standard library.
Sample YAML File
Name your file something like descriptor.yml, and fill it with:
---
:file_type: DSV
:dsv_separator: ","
:fields:
:name:
:pos: 0
:type: string
:phone:
:pos: 1
:type: number
:birthdate:
:pos: 2
:type: date
:mask: dd/mm/yyyy
Loading YAML
You can read your configuration back in with:
require 'yaml'
settings = YAML.load_file 'descriptor.yml'
This will return a settings Hash like:
{:file_type=>"DSV",
:dsv_separator=>",",
:fields=>
{:name=>{:pos=>0, :type=>"string"},
:phone=>{:pos=>1, :type=>"number"},
:birthdate=>{:pos=>2, :type=>"date", :mask=>"dd/mm/yyyy"}}}
which you can then access as needed to configure your application.

How to replace a particular line in xml with the new one in ruby

I have a requirement where I need to replace the element value with the new one and I dont want any other modification to be done to the file.
<mtn:test-case title='Power-Consist-Message'>
<mtn:messages>
<mtn:message sequence='4' correlation-key='0x0F04'>
<mtn:header>
<mtn:protocol-version>0x4</mtn:protocol-version>
<mtn:message-type>0x0F04</mtn:message-type>
<mtn:message-version>0x01</mtn:message-version>
<mtn:gmt-time-switch>false</mtn:gmt-time-switch>
<mtn:crc-calc-switch>1</mtn:crc-calc-switch>
<mtn:encrypt-switch>false</mtn:encrypt-switch>
<mtn:compress-switch>false</mtn:compress-switch>
<mtn:ttl>999</mtn:ttl>
<mtn:qos-class-of-service>0</mtn:qos-class-of-service>
<mtn:qos-priority>2</mtn:qos-priority>
<mtn:qos-network-preference>1</mtn:qos-network-preference>
this is how the xml file looks like, I want to replace 999 with "some other value", under s section, but when am doing that using formatter in ruby some other unwanted modifications are taking place, the code that am using is as belows
File.open(ENV['CadPath1']+ "conf\\cad-mtn-config.xml") do |config_file|
# Open the document and edit the file
config = Document.new(config_file)
testField=config.root.elements[4].elements[11].elements[1].elements[1].elements[1].elements[11]
if testField.to_s.match(/<mtn:qos-network-preference>/)
test=config.root.elements[4].elements[11].elements[1].elements[1].elements[1].elements[8].text="2"
# Write the result to a new file.
formatter = REXML::Formatters::Default.new
File.open(ENV['CadPath1']+ "conf\\cad-mtn-config.xml", 'w') do |result|
formatter.write(config, result)
end
end
end
when am writting the modifications to the new file, the xml file size is getting changed from 79kb to 78kb, is there any way to just replace the particular line in xml file and save changes without affecting the xml file.
Please let me know soon...
I prefer Nokogiri as my XML/HTML parser of choice:
require 'nokogiri'
xml =<<EOT
<mtn:test-case title='Power-Consist-Message'>
<mtn:messages>
<mtn:message sequence='4' correlation-key='0x0F04'>
<mtn:header>
<mtn:protocol-version>0x4</mtn:protocol-version>
<mtn:message-type>0x0F04</mtn:message-type>
<mtn:message-version>0x01</mtn:message-version>
<mtn:gmt-time-switch>false</mtn:gmt-time-switch>
<mtn:crc-calc-switch>1</mtn:crc-calc-switch>
<mtn:encrypt-switch>false</mtn:encrypt-switch>
<mtn:compress-switch>false</mtn:compress-switch>
<mtn:ttl>999</mtn:ttl>
<mtn:qos-class-of-service>0</mtn:qos-class-of-service>
<mtn:qos-priority>2</mtn:qos-priority>
<mtn:qos-network-preference>1</mtn:qos-network-preference>
EOT
Notice that the XML is malformed, i.e., it doesn't terminate correctly.
doc = Nokogiri::XML(xml)
I'm using CSS accessors to find the ttl node. Because of some magic, Nokogiri's CSS ignores XML name spaces, simplifying finding nodes.
doc.at('ttl').content = '1000'
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <test-case title="Power-Consist-Message">
# >> <messages>
# >> <message sequence="4" correlation-key="0x0F04">
# >> <header>
# >> <protocol-version>0x4</protocol-version>
# >> <message-type>0x0F04</message-type>
# >> <message-version>0x01</message-version>
# >> <gmt-time-switch>false</gmt-time-switch>
# >> <crc-calc-switch>1</crc-calc-switch>
# >> <encrypt-switch>false</encrypt-switch>
# >> <compress-switch>false</compress-switch>
# >> <ttl>1000</ttl>
# >> <qos-class-of-service>0</qos-class-of-service>
# >> <qos-priority>2</qos-priority>
# >> <qos-network-preference>1</qos-network-preference>
# >> </header></message></messages></test-case>
Notice that Nokogiri replaced the content of the ttl node. It also stripped the XML namespace info because the document didn't declare it correctly, and, finally, Nokogiri has added closing tags to make the document syntactically correct.
If you want the namespace to be declared in the output, you'll need to make sure it's there in the input.
If you need to just literally replace that value without affecting anything else about the XML file, even if (as pointed by the Tin Man above) that would mean leaving the original XML file malformed, you can do that with direct string manipulation using a regular expression.
Assuming there is guaranteed to only be one <mtn:ttl> tag in your XML document, you could just do:
doc = IO.read("somefile.xml")
doc.sub! /<mtn:ttl>.+?<\/mtn:ttl>/, "<mtn:ttl>some other value<\/mtn:ttl>"
File.open("somefile.xml", "w") {|fh| fh.write(doc)}
If there might be more than one <mtn:ttl> tag, then this is trickier; how much trickier depends on how you want to figure out which tag(s) to change.

Resources