I'm doing a pretty simple operation; opening a csv file, deleting the first column, and writing out to a new file. The following code works fine, but it takes 50-60 seconds on my 700 MB file:
import csv
from time import time
#create empty output file
f = open('testnew.csv',"w")
f.close()
t = time()
with open('test.csv',"rt") as source:
rdr= csv.reader( source )
with open('testnew.csv',"a") as result:
wtr= csv.writer( result )
for r in rdr:
del r[0]
_ = wtr.writerow( r )
print(round(time()-t))
By contrast, the following shell script does the same thing in 7-8 seconds:
START_TIME=$SECONDS
cut -d',' -f2- < test.csv > testnew.csv
echo $(($SECONDS - $START_TIME))
Is there a way I can get comparable performance in Python?
If I understand correctly, the shell script is simply splitting lines at the first ,, regardless of whether it is enclosed in quotes or not, and writing out the second part. (I do not know what the shell script does if there is no ,.) The csv method does much more, which is useless for you. To just do the same thing as the shell in python, skip the csv module.
for line in source:
parts = line.split(',', maxsplit=1)
source.write(parts[len(parts)-1])
This passes lines without a comma as is. It leaves spaces after the comma (I do not know what cut does. If you do not want that, you can either use re.split instead of line.split or add .lstrip() just before the closing ) on the last line.
Your bash script not parse csv file, only split and cut. So, in python we can do the same:
with open('test.csv',"r") as source:
with open('testnew.csv',"w") as result:
for l in source:
_, tail = l.split(',', 1)
result.write(tail)
My simple profiling (4Mb file):
bash - 193 ms
python csv parsing - 2391 ms
python string splitting - 620 ms
Python 2 is faster for some reason:
bash - 193 ms
python csv parsing - 1471 ms
python string splitting - 373 ms
Related
I am executing the following awk command on Windows 10.
awk "(NR == 1) || (FNR > 1)" *.csv > bigMergeFile.csv
I want to merge all csv files into a single file named bigMergeFile.csv using only the header of the first file.
I successfully tested the code on small files (4 files each containing 5 cols and 4 rows). However, the code does not halt when I run it on large files (10 files, each with 8k rows, 32k cols, approximate size 1 GB). It only stops execution when the space runs out on hard drive. At that time, the size of resultant output file bigMergeFile.csv is 30GB. The combine files size of all input csv file is 9.5 GB.
I have tested the code on Mac OS and it works fine. Help will be appreciated.
My guess: bigMergeFile.csv ends in .csv so it's one of the input files your script is running on and it's growing as your script appends to it. It's like you wrote a loop like:
while ! end-of-file do
read line from start of file
write line to end of file
done
since you're doing basically a concat not a merge, set FS = "^$" to it won't waste time attempting to split fields you won't need anyway.
I am writing a shell script in perl that takes values from two databases and compares them. When the script is finished it outputs a report that is supposed to be formatted this way:
Table Name Date value from Database 1 value from Database 2 Difference
The output is printed into a report file, but even when it is output to the command console it looks like this:
tablename 2017-06-20 7629628
7629628
0
Here's my code that makes the string then outputs it to the file:
$outputstring="$tablelist[0] $DATErowcount0[$output_iteration] $rowcount0[$output_iteration] $TDrowcount0[$output_iteration] $count_dif\n";
print FILE $outputstring;
There seems to be a newline character hidden after $rowcount0[$output_iteration] and before $count_dif. What do I need to do to fix this/print it all in one line?
To fill the arrays with values, values are read from files created by SQL commands.
Here's some of the code:
`$num_from_TDfile=substr $r2, 16;
$date_from_TDfile = substr $r2, 0, 12;
$TDrowcount0[$TDnum_rows0]=$num_from_TDfile;
$DATETDrowcount0[$TDnum_rows0]=$date_from_TDfile;
$TDnum_rows0=$TDnum_rows0+1;`
Adding the chomp to each of the strings taken from the files as suggested by tadman fixed the output so that it was all on one line rather than three lines as in the question's example.
I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;
awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv
Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.
EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;
2212026837 2212026837
2256 41688 6 2256416886
2076113566 2076113566
2009 84517 7 2009845177
2067950476 2067950476
2057 90531 5 2057 90531 5
2085271676 2085271676
2095183426 2095183426
2347366235 2347366235
2200160434 2200160434
2229359595 2229359595
2045373466 2045373466
2053849895 2053849895
2300 81552 3 2300 81552 3
I see two possibilities.
The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub: instead of / /, use /[[:space:]]/.
If that solves your problem, great! You got lucky, move on. :)
The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:
field 1|"field 2 contains some |pipe| characters"|field 3|field 4
If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV, but it's not installed by default; Python's csv module is, so you could use a program like so:
import csv, fileinput as fi, re;
for row in csv.reader(fi.input(), delimiter='|'):
row[25] = re.sub(r'\s+', '', row[25]) # fields start at 0 instead of 1
print '|'.join(row)
Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv.
(If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)
You can use the string function split, and iterate over the corresponding array to reassign the 26th field:
awk 'BEGIN{FS=OFS="|"} {
n = split($26, a, /[[:space:]]+/)
$26=a[1]
for(i=2; i<=n; i++)
$26=$26""a[i]
}1' original.csv > final.csv
Okay so I am at best a novice in bash scripting but I wrote this very small script late last night to take the first 40 character's of each line of a fairly large text file (~300,000 lines) and search through a much larger text file for matches (~2.2 million lines), and then output all of the results into the matching lines into an new text file.
so the script looks like this:
#!/bin/bash
while read -r line
do
match=${line:0:40}
grep "$match" large_list.txt
done <"small_list.txt"
and then by calling the script like so
$ bash my_script.sh > outputfile.txt &
and this gives me all the common elements between the 2 list's. Now this is all well and good and slowly works. but I am running this on a m1.smalll ec2 instance and fair enough (the proccesing on this is shit and I could spin up a larger instance to handle all this or do it on my desktop and upload the file). However I would rather learn a more efficentr way of accomplishing the same task, However I can't quite seem to figure this out. Any tidbits of how to best go about this , or complete the task more efficently would really be very very appreciated
to give you an idea of how slow this is working i started the script about 10 hours ago and I am about 10% of the way through all the matches.
Also I am not set in using bash so scripts in other language's are fair game .. I figure the pro's on S.O. can easily improve my rock for a hammer aproach
edit: adding input and output's and morre information about the data
input: (small text file)
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|Vice S01E09 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/vice-s01e09-hdtv-xvid-fum-ettv-q49614889.html|http://torrage.com/torrent/36A02E282D49EB7D94ACB798654829493CA929CB.torrent
3B9403AD73124A84AAE12E83A2DE446149516AC3|Sons of Guns S04E08 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/sons-of-guns-s04e08-hdtv-xvid-fum-e-q49613491.html|http://torrage.com/torrent/3B9403AD73124A84AAE12E83A2DE446149516AC3.torrent
C4ADF747050D1CF64E9A626CA2563A0B8BD856E7|Save Me S01E06 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/save-me-s01e06-hdtv-xvid-fum-ettv-q49515711.html|http://torrage.com/torrent/C4ADF747050D1CF64E9A626CA2563A0B8BD856E7.torrent
B71EFF95502E086F4235882F748FB5F2131F11CE|Da Vincis Demons S01E08 HDTV x264-EVOLVE|Video TV|http://bitsnoop.com/da-vincis-demons-s01e08-hdtv-x264-e-q49515709.html|http://torrage.com/torrent/B71EFF95502E086F4235882F748FB5F2131F11CE.torrent
match against (large text file)
86931940E7F7F9C1A9774EA2EA41AE59412F223B|0|0
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|4|2|20705|9550|21419
ADFA5DD6F0923AE641F97A96D50D6736F81951B1|0|0
CF2349B5FC486E7E8F48591EC3D5F1B47B4E7567|1|0|429|428|22248
290DF9A8B6EC65EEE4EC4D2B029ACAEF46D40C1F|1|0|523|446|14276
C92DEBB9B290F0BB0AA291114C98D3FF310CF0C3|0|0|21448
Output:
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|4|2|20705|9550|21419
additional clarifications: so Basically there is a hash which is first 40 charecter's of the input file (a file I have already reduced size to about 15% of original, SO for each line in this file there is a hash in the larger text file (that I am matching against) with some corresponding information now it is the line in the larger file that I would like to write to a new file so that in the end I have a 1:1 ratio of all thing in smaller text file to my output_file.txt
In this case I am showing the first line of the input being matched (line 2 of larger file)and then written to an output file
awk solution adopted from this answer:
awk -F"|" 'NR==FNR{a[$1]=$2;next}{if (a[$1]) print}' small.txt large.txt
some python to the rescue.
I created two text-files using the following snippet:
#!/usr/bin/env python
import random
import string
N=2000000
for i in range(N):
s = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(40))
print s + '|4|2|20705|9550|21419'
one 300k and one 2M lines This gives me the following files:
$ ll
-rwxr-xr-x 1 210 Jun 11 22:29 gen_random_string.py*
-rw-rw-r-- 1 119M Jun 11 22:31 large.txt
-rw-rw-r-- 1 18M Jun 11 22:29 small.txt
Then I appended a line from small.txt to the end of large.txt so that I had a matching pattern
Then some more python:
#!/usr/bin/env python
target = {}
with open("large.txt") as fd:
for line in fd:
target[line.split('|')[0]] = line.strip()
with open("small.txt") as fd:
for line in fd:
if line.split('|')[0] in target:
print target[line.split('|')[0]]
Some timings:
$ time ./comp.py
3A8DW2UUJO3FYTE8C5ESE25IC9GWAEJLJS2N9CBL|4|2|20705|9550|21419
real 0m2.574s
user 0m2.400s
sys 0m0.168s
$ time awk -F"|" 'NR==FNR{a[$1]=$2;next}{if (a[$1]) print}' small.txt large.txt
3A8DW2UUJO3FYTE8C5ESE25IC9GWAEJLJS2N9CBL|4|2|20705|9550|21419
real 0m4.380s
user 0m4.248s
sys 0m0.124s
Update:
To conserve memory, do the dictionary-lookup the other way
#!/usr/bin/env python
target = {}
with open("small.txt") as fd:
for line in fd:
target[line.split('|')[0]] = line.strip()
with open("large.txt") as fd:
for line in fd:
if line.split('|')[0] in target:
print line.strip()
I have a batch file that does this.
ECHO A41,35,0,a,1,1,N,"Mr ZACHARY KAPLAN">> test.txt
There are about 30k similar lines. It takes the batch file about 5 hours to run.
Is there a way to speed this up?
/Jeanre
Try this:
Put an ECHO OFF at the top of the batch file.
Then change each line to:
ECHO A41,35,0,a,1,1,N,"Mr ZACHARY KAPLAN"
and call your batch file:
mybatch.bat >> test.txt
Edit the first line to remove the echo off print out.
If You do this, it will be way quicker:
JAM DO %%Fo CAt Pa (set /p %yodo%=jol)
RUn
Write a custom script or program to open the file test.txt once, and write all the data into it in one shot.
Right now each line is executed separately by the command interpreter, and the file is opened and closed each time.
Even a small qbasic program should be able to strip out the data between the echo and >> and write it to a text file more quickly than your current method.
-Adam
you can use a scripting language to strip the leading ECHO and trailing >> test.txt with a small regular expression
here's an example in python:
>>> import re
>>> text = 'ECHO A41,35,0,a,1,1,N,"Mr ZACHARY KAPLAN">> test.txt'
>>> re.sub( r"ECHO\s*(.*?)>>\s*test.txt", r"\1", text )
'A41,35,0,a,1,1,N,"Mr ZACHARY KAPLAN"'
do this for all lines in the file:
import re
f = open("input.bat")
of = open("output.txt", "w" )
for line in f:
of.write( re.sub( r"ECHO\s*(.*?)>>\s*test.txt", r"\1", line ) )
I didn't test this code ...
Here's an example using a Java program - with BufferedReader/PrintWriter
http://www.javafaq.nu/java-example-code-126.html
You can also use a BufferedReader and BufferedWriter
http://java.sun.com/docs/books/tutorial/essential/io/buffers.html
http://leepoint.net/notes-java/io/10file/10readfile.html