How to export pdf table data into csv? - ruby

I am using Rails 4.2, Ruby 2.2, Gem: 'pdf-reader'.
My application will read pdf file which has table-data and it exports into CSV which i have already done. When i match result with table header and table content, they are in wrong position, yes because pdf table is not a actual table, we need to write some extra logic behind this which I am asking for.
marks.pdf has content similar as shown below
School Name: ABC
Program: MicroBiology Year: Second
| Roll No | Math |
|----------- |-------- |
1000001 | 65
|----------- |-------- |
Any help would be appreciated.
Working code which reads PDF and export to CSV is given below
class ExportToCsv
# method useful to export pdf to csv
def convert_to_csv
pdf_reader = PDF::Reader.new("public/marks.pdf")
csv = CSV.open("output100.tsv","wb", {:col_sep => "\t"})
data_header = ""
pdf_reader.pages.each do |page|
page.text.each_line do |line|
# line with characters
if /^[a-z|\s]*$/i=~line
data_header = line.strip
else
# line with number
data_row = line.split(/[0-9]/).first
csv_line = line.sub(data_row,'').strip.split(/[\(|\)]/)
csv_line.unshift(data_row).unshift(data_header)
csv << csv_line
end
end
end
end
end
I am not able to attach original pdf here because of security, sorry for that. You can generate the pdf as per below screenshot.
The screen of pdf is given below:
The screen of generated Csv is given below:
Desired pdf should be like below image

Related

Rails Not Reading a Specific Column from CSV

I'm trying to read my CSV file in Rails 5.2.3 (Ruby 2.6.3) and it's not reading a specific column.
CSV file:
barcode,hardware,title,price
2900007868390,PS4,title1,300
3499550362923,PS4,title2,1800
3499550370973,Nintendo Switch,title3,5000
and so on...
Code:
csv = CSV.read('path/to/my_csv.csv', headers: true)
csv.each do |row|
puts "#{row['barcode']}, #{row['hardware']}, #{row['title']}, #{row['price']}"
end
Result:
, PS4, title1, 300
, PS4, title2, 1800
, Nintendo Switch, title3, 5000
and so on...
As you can see above, it's not reading the barcode column for some reason.
I've managed to get the barcode value if I write row[0] instead of row['barcode'] .
Any ideas why this is happening?
This was because my CSV file had BOMs...
https://en.wikipedia.org/wiki/Byte_order_mark
The code below solved the problem:
csv = CSV.read("resources/games/original_#{date}.csv", 'r:BOM|UTF-8', headers: true)
Thank you #dimitry_n for your help!

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Using a CSV file to insert values using Ruby

I have some sample code I can execute for our Nexpose server and I need to do some mass asset tagging. Here is an example of the code.
nsc = Nexpose::Connection.new('your_nexpose_instance', 'username', 'password', 3780)
nsc.login
criterion = Nexpose::Tag::Criterion.new('IP_RANGE', 'IN', ['ip1', 'ip2'])
criteria = Nexpose::Tag::Criteria.new(criterion)
tag = Nexpose::Tag.new("tagname", Nexpose::Tag::Type::Generic::CUSTOM)
tag.search_criteria = criteria
tag.save(nsc)
I have a file called with the following data.
ip1,ip2,tagname
192.168.1.1,192.168.1.255,Workstations
How would I go about running a for loop and using the CSV to quickly process the above code? I have no experiance with Ruby and tried to follow some example but I'm confused at this point.
There's a CSV library in Ruby's standard lib collection that you can use.
Basic example based on your code example and data, not tested:
require 'csv'
nsc = Nexpose::Connection.new('your_nexpose_instance', 'username', 'password', 3780)
nsc.login
CSV.foreach("path/to/file.csv", headers: true) do |row|
criterion = Nexpose::Tag::Criterion.new('IP_RANGE', 'IN', [row['ip1'], row['ip2'])
criteria = Nexpose::Tag::Criteria.new(criterion)
tag = Nexpose::Tag.new(row['tagname'], Nexpose::Tag::Type::Generic::CUSTOM)
tag.search_criteria = criteria
tag.save(nsc)
end
I made a directory with input.csv and main.rb
input.csv
ip1,ip2,tagname
192.168.1.1,192.168.1.255,Workstations
main.rb
require "csv"
CSV.foreach("input.csv", headers: true) do |row|
puts "ip1: #{row['ip1']}"
puts "ip2: #{row['ip2']}"
puts "tagname: #{row['tagname']}"
end
the output is
ip1: 192.168.1.1
ip2: 192.168.1.255
tagname: Workstations
I hope this can help. If you have questions I'm here :)
If you just need to loop through each line of the file and fire that chunk of code for each line, you could do something like this:
file = Net::HTTP.get(URI(<whatever_your_file_name_is>))
index = 0
file.each_line do |line|
next if index == 0
index += 1
split_line = line.split(',')
ip1 = split_line[0]
ip2 = split_line[1]
tagname = split_line[2]
nsc = Nexpose::Connection.new('your_nexpose_instance', 'username', 'password', 3780)
nsc.login
criterion = Nexpose::Tag::Criterion.new('IP_RANGE', 'IN', [ip1, ip2])
criteria = Nexpose::Tag::Criteria.new(criterion)
tag = Nexpose::Tag.new(tagname, Nexpose::Tag::Type::Generic::CUSTOM)
tag.search_criteria = criteria
tag.save(nsc)
end
NOTE: This code example is assuming that the CSV file is stored remotely, not locally.
ALSO: In case you're wondering, the next if index == 0 is there to skip your header record.
UPDATE
To use this approach for a local file, you can use File.open() instead of Net::HTTP.get(), like so:
file = File.open(<whatever_your_file_name_is>).read
Two things to note:
Make sure you use the fully-qualified name of the file - i.e. ~/folder/folder/filename.csv instead of just filename.csv.
If the files you're going to be loading are enormous, this might not be an ideal approach because it's actually reading the whole file into memory. But considering your file only has 3 columns, you'd have to have an extreme number of rows in the file for this to be an issue.

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

Resources