Retrieving data from a particular column in a CSV File using TCL - data-structures

I have a CSV File,containing two columns of data.
I want to retrieve two columns of the data in two separate lists.
I have tried the following code:
set fp [open "D:\\RWTH\\Mini thesis\\EclipseTCL\\TCL trial\\excelv1.csv" r]
set file_data [read $fp]
close $fp
set data [split $file_data " "]
puts $data
the output obtained is
{0,245
0.0025,249
0.005,250
0.0075,252
0.01,253
0.0125,255
0.015,256
.
.
.
}
The data is in 2 separate columns in the excel sheet. I wish to take the elements only from the 2nd column i,e
{245,
249,
250,
252,
253,
.
.
.
}
I would be glad, if someone can help me with this.

Using the file_data you have already read from the file, you can:
lmap row [split [string trim $file_data] \n] {
scan $row %*f,%d
}
That is, trim off white space before and after the data, split into rows, then from every row, scan one integer (skipping the real and the comma). All the scanned integers are collected in a list.
It is, however, a good idea to always use the right tool for the job.
package require csv
lmap row [split [string trim $file_data] \n] {
lindex [::csv::split $row] end
}
The ::csv::split command knows exactly how to split csv data. In this case, it isn't really necessary, but it's a good habit to use the csv package for csv data.
Documentation:
csv (package),
lindex,
lmap (for Tcl 8.5),
lmap,
package,
scan,
split,
string

You'd better to use "gets" to read each line from the file:
set fp [open "D:\\RWTH\\Mini thesis\\EclipseTCL\\TCL trial\\excelv1.csv" r]
set secondColumnData {}
while {[gets $fp line]>=0} {
if {[llength $line]>0} {
lappend secondColumnData [lindex [split $line ","] 1]
}
}
close $fp
puts $secondColumnData
gets

You may also try:
set data [read [open "D:\\RWTH\\Mini thesis\\EclipseTCL\\TCL trial\\excelv1.csv"]]
set result [regexp -all -inline -line -- {^.*,(.*)$} $data]
set items {}
foreach {tmp item} $result {
lappend items $item
}
puts $items
Output:
245 249 250 252 253 255 256

Related

Perl - to create and sort hash to produce a sorted output

So I have this script that scrape data to a website, its getting and downloading a CSV and its process the CSV row by row and converts it into TSV, once that finished the TSV file will be converted into a HTML file. I'm done the rest of that but the output that I'm getting is some what messed up, the script goes to different table pages on the source site and downloads a dynamically generated CSV file; that CSV file is then turned into a TSV file that we then turn into HTML. The CSV file seems to be sorted by the first column for each row that is returned but not based on any of the other columns in the same row. Therefore what is happening is that entries with the same first column values can be jumbled up from one download to the next download of the same file.
A visual representation of only sorting by the first column this follows with numbers representing column data:
1st Download:
1-1
1-2
1-3
2-1
2-2
2-3
3-1
3-2
3-3
2nd Download:
1-1
1-3
1-2
2-2
2-1
2-3
3-3
3-2
3-1
So what I have in mind is the process will be like this, download the CSV file from the source and then perform a sort on the lines in that CSV file to normalize them for comparison to one another before writing the TSV or HTML files. This should allow for accurate comparison for updated data files. but I didn't know how to do this my logic is like this
So I will put the function between the 1. and 2. before it process the CSV file into TSV File I want the content of the CSV is already sorted.
So my script is looking like this
my $download_dir_link ="C:/Users/jabella/Downloads";
unlink("$download_dir_link/Product Classification List.csv");
#CHECK IF CSV FILE DOWNLOAD IS FINISHED
my $complete_download_flag = 0;
while($complete_download_flag == 0)
{
my #download_directory = read_dir($download_dir_link);
foreach my $downloaded_file (#download_directory)
{
if($downloaded_file =~ /\QProduct Classification List.csv\E/sgi)
{
$complete_download_flag = 1;
}
}
sleep(5);
}
#SORTED CONTENTS OF CSV BEFORE CONVERSION
print "sORTING csv content...\n";
#CONVERT CSV TO TSV
print "Converting csv to tsv...\n";
my $csv = Text::CSV->new ({ binary => 1 });
my $tsv = Text::CSV->new ({ binary => 1, sep_char => "\t", eol => "\n"});
open my $infh, "<:encoding(utf8)", "$download_dir_link/Product Classification List.csv";
open my $outfh, ">:encoding(utf8)", "Product Classification List.tsv";
while (my $row = $csv->getline ($infh))
{
$tsv->print ($outfh, $row);
}
close($infh);
close($outfh);
my $tsv_content = "";
open(my $fh, '<', "Product Classification List.tsv");
while (<$fh>)
{
$tsv_content = $tsv_content.$_;
}
close($fh);
print "Conversion complete! cleaning tsv content...\n";
#CLEAN TSV CONTENT
$tsv_content =~ s/(.*?)\t"(.*?)"\t"(.*?)"\t"(.*?)"\t(.*?)\t"(.*?)"\t(.*)/<tr><th>$1<\/th><th>$2<\/th><th>$3<\/th><th>$4<\/th><th>$5<\/th><th>$6<\/th><th>$7<\/th><\/tr>/gi;
$tsv_content =~ s/"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\n/<tr><td>$1<\/td><td>$2<\/td><td>$3<\/td><td>$4<\/td><td>$5<\/td><td>$6<\/td><td>$7<\/td><\/tr>\n/gi;
$tsv_content =~ s/\"{2}/\"/sgi;
$tsv_content =~ s/(<\/tr>)\n?"/$1/sgi;
$tsv_content =~ s/\s{2,}/ /sgi;
$tsv_content =~ s/.*?(<tr>)/$1/si;
$tsv_content = "<table>\n$tsv_content</table>";
$classification =~ s/_//sgi;
if(exists $existing_index_hash{$doc_uid."_pind.html"})
{
if($existing_index_hash{$doc_uid."_pind.html"} ne $tsv_content)
{
$changed_flag = "1";
$updated_files = $updated_files."-$classification\n";
print "Updated: $classification\n";
Hope someone here can help me on this thank you
Here is a simple script that loads a CSV file specified as an argument and outputs it sorted by the first two columns.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = 'Text::CSV_XS'->new({binary => 1, auto_diag => 1});
open my $in, '<', shift or die $!;
my #rows;
while (my $row = $csv->getline($in)) {
push #rows, $row;
}
# Here the sorting happens. Compare the first column,
# if the values are the same, compare the second column.
#rows = sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } #rows;
$csv->say(*STDOUT, $_) for #rows;
You can use the following to sort by all columns (but it compares the values as strings, it doesn't work for numbers):
sub by_all {
my ($n, $A, $B) = #_;
$A->[$n] cmp $B->[$n]
|| $n < $#$A && by_all($n + 1, $A, $B)
}
sort { by_all(0, $a, $b) } #rows;
To make it work for numbers, too, you can let Perl guess what is a number:
use Scalar::Util qw{ looks_like_number };
sub by_all {
my ($n, $A, $B) = #_;
(looks_like_number($A->[$n])
? $A->[$n] <=> $B->[$n]
: $A->[$n] cmp $B->[$n]
) || $n < $#$A && by_all($n + 1, $A, $B)
}

Adding a new position at the end of the file Shell or Perl

My question is how to add a new position at the end of the file in Shell or Perl?
I have two files:
File A with 536382 lines and the key is third column:
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
File B with 946 lines and the key is the first column:
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
I want a combination of these two files:
abc1111,1070X00Y0,9999,B,West,Morning
abc2222,1070X00Y0,9999,B,West,Morning
abc3333,1070x00Y0,9999,B,West,Morning
I was trying shell script but I was receiving a message of out of memory.
So I open for suggestions.
Thank you, so far.
I was able to get the results you want by making a few changes to your code.
#!/usr/bin/perl
use strict;
use warnings;
open IN2, '<', \<<EOF;
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
EOF
my %hash;
while ( <IN2> ) {
chomp;
my #col2 = split ",";
$hash{$col2[0]}{$col2[1]} = $col2[2];
}
open IN1, '<', \<<EOF;
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
EOF
while ( <IN1> ) {
chomp;
my $key = (split /,/)[2];
if ( exists( $hash{$key} ) ) {
print join(",", $_, #{ $hash{$key} }{ qw/Position Time/ }), "\n";
}
}
This produced output of:
abc1111,1070X00Y0,**9999**,B,West,Morning
abc2222,1070X00Y0,**9999**,B,West,Morning
abc3333,1070x00Y0,**9999**,B,West,Morning
The changes to the code were
$hash{$col2[0]}{$col2[1]} = $col2[2]; Create a Hash of Hash to hold the Position and Time keys. They are used in a hash slice here
#{ $hash{$key} }{ qw/Position Time/ })
Convert small file into perl's hash
Process big file line by line

Parse list of integers (optimization needed for speed test)

I am performing a tiny speed test in order to compare the speed of the Agda programming language with the Tcl scripting language. Its for scientific work and this is just a pre-test, not a real test. I am not in anyway trying to perform a realistic speed comparison!
I have come up with a small example, in which Agda is 10x times faster than Tcl. There are special reasons I use this example. My main concern is that my Tcl code is badly programmed and this is the sole reason Tcl is slower than Agda in this example.
The goal of the code is to parse a line that represents a list of integers and check if it is indeed a list of integers.
Example "(1,2,3)" would be a valid list.
Example "(1,a,3)" would not be a valid list.
My input is a file and I check every third line (3rd) of the file. If any line is not a list of integers, the program prints "false".
My input file:
(613424,505980,317647,870930,75580,897160,716297,668539,689646,196362,533020)
(727375,472272,22435,869407,320468,80779,302881,240382,196077,635360,568517)
(613424,505980,317647,870930,75580,897160,716297,668539,689646,196362,533020)
(however, my real test file is about 3 megabyte large)
My current Tcl code to solve this problem is:
package require Tcl 8.6
proc checkListNat {str} {
set list [split [string map {"(" "" ")" ""} $str] ","]
foreach l $list {
if {[string is integer $l] == 0} {
return 0
}
}
return 1
}
set i 1
set fp [open "/tmp/test.txt" r]
while { [gets $fp data] >= 0 } {
incr i
if { [expr $i % 3] == 0} {
if { [checkListNat $data] == 0 } {
puts "error"
}
}
}
close $fp
How can I optimize my current Tcl code, so that the speed test between Agda and Tcl is more realistic?
The first thing to do is to put as much code in procedures (or lambda terms) as possible and ensure that all expressions are braced. Those were your two key problems that were killing performance. We'll do a few other things too (you hardly ever need expr inside an if test and this wasn't one of those cases, string trim is more suitable than string map, string is really ought to be done with -strict). With those, I get this version which is relatively similar to what you already had yet ought to be substantially more performant.
package require Tcl 8.6
proc checkListNat {str} {
foreach l [split [string trim $str "()"] ","] {
if {[string is integer -strict $l] == 0} {
return 0
}
}
return 1
}
apply {{} {
set i 1
set fp [open "/tmp/test.txt" r]
while { [gets $fp data] >= 0 } {
if {[incr i] % 3 == 0 && ![checkListNat $data]} {
puts "error"
}
}
close $fp
}} {*}$argv
You might get better performance by adding fconfigure $fp -encoding iso8859-1; you'll have to test that yourself. But the key changes are the ones due to the bold items earlier, as each substantially impacts on the efficiency of compilation strategy used. (Also, Tcl 8.5 is a little faster than 8.6 — 8.6 has a radically different execution engine that is a bit slower for some things — so you might test the new code with 8.5 too; the code itself appears to be valid with both versions.)
try checking with regex {^[0-9,]+$} $line instead of the checkListNat function.
update
here is an example
echo "87,566, 45,67\n56,5r5,45" >! try
...
while {[gets $fp line] >0} {
if {[regexp {^[0-9]+$} $line] >0 } {
puts "OK $line"
} else {
puts "BAD $line"
}
}
gives:
>OK 87,566, 45,67
>BAD 56,5r5,45

Trouble setting value in nested dictionary

I'm writing a TCL script to parse the HTML output of the Firebug profiling console. To begin with I just want to accumulate the number of method calls on a per-file basis. To do this I'm trying to use nested dictionaries. I seem to have gotten the first level correct (where file is the key and method is the value) but not the second, nested level, where method is the value and count is the key.
I have read about dictionary's update command so I'm open to refactoring using this. My TCL usage is on-again off-again so thanks in advance for any assistance. Below is my code and some sample output
foreach row $table_rows {
regexp {<a class="objectLink objectLink-profile a11yFocus ">(.+?)</a>.+?class=" ">(.+?)\(line\s(\d+)} $row -> js_method js_file file_line
if {![dict exists $method_calls $js_file]} {
dict set method_calls $js_file [dict create]
}
set file_method_calls [dict get $method_calls $js_file]
if {![dict exists $file_method_calls $js_method]} {
dict set file_method_calls $js_method 0
dict set method_calls $js_file $file_method_calls
}
set file_method_call_counts [dict get $file_method_calls $js_method]
dict set $file_method_calls $js_method [expr 1 + $file_method_call_counts]
dict set method_calls $js_file $file_method_calls
}
dict for {js_file file_method_calls} $method_calls {
puts "file: $js_file"
dict for {method_name call_count} $file_method_calls {
puts "$method_name: $call_count"
}
puts ""
}
OUTPUT:
file: localhost:50267
(?): 0
e: 0
file: Defaults.js
toDictionary: 0
(?): 0
Renderer: 0
file: jquery.cookie.js
cookie: 0
decoded: 0
(?): 0
The dict set command, like any setter in Tcl, takes the name of a variable as its first argument. I bet that:
dict set $file_method_calls $js_method [expr 1 + $file_method_call_counts]
should really read:
dict set file_method_calls $js_method [expr {1 + $file_method_call_counts}]
(Also, brace your expressions for speed and safety.)

Perl Out Of Memory

I have a script that reads two csv files and compares them to find out if an ID that appears in one also appears in the other. The error I am receiving is as follows:
Out of memory during "large" request for 67112960 bytes, total sbrk() is 348203008 bytes
And now for the code:
use strict;
use File::Basename;
my $DAT = $ARGV[0];
my $OPT = $ARGV[1];
my $beg_doc = $ARGV[2];
my $end_doc = $ARGV[3];
my $doc_counter = 0;
my $page_counter = 0;
my %opt_beg_docs;
my %beg_docs;
my ($fname, $dir, $suffix) = fileparse($DAT, qr/\.[^.]*/);
my $outfile = $dir . $fname . "._IMGLOG";
open(OPT, "<$OPT");
while(<OPT>){
my #OPT_Line = split(/,/, $_);
$beg_docs{#OPT_Line[0]} = "Y" if(#OPT_Line[3] eq "Y");
$opt_beg_docs{#OPT_Line[0]} = "Y";
}
close(OPT);
open(OUT, ">$outfile");
while((my $key, my $value) = each %opt_beg_docs){
print OUT "$key\n";
}
close(OUT);
open(DAT, "<$DAT");
readline(DAT); #skips header line
while(<DAT>){
$_ =~ s/\xFE//g;
my #DAT_Line = split(/\x14/, $_);
#gets the prefix and the range of the beg and end docs
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
(my $beg = #DAT_Line[$beg_doc]) =~ s/\D//g;
(my $end = #DAT_Line[$end_doc]) =~ s/\D//g;
#print OUT "BEGDOC: $beg ENDDOC: $end\n";
foreach($beg .. $end){
my $doc_id = $pre . $_;
if($opt_beg_docs{$doc_id} ne "Y"){
if($beg_docs{$doc_id} ne "Y"){
print OUT "$doc_id,DOCUMENT NOT FOUND IN OPT FILE\n";
$doc_counter++;
} else {
print OUT "$doc_id,PAGE NOT FOUND IN OPT FILE\n";
$page_counter++;
}
}
}
}
close(DAT);
close(OUT);
print "Found $page_counter missing pages and $doc_counter missing document(s)";
Basically I get all the ID's from the file I am checking against to see if the ID exists in. Then I loop over the and generate the ID's for the other file, because they are presented as a range. Then I take the generated ID and check for it in the hash of ID's.
Also forgot to note I am using Windows
You're not using use warnings;, you're not checking for errors on opening files, and you're not printing out debugging statements showing the lines that you are reading in.
Do you know what the input file looks like? If it has no line breaks, you are reading the entire file in all at once, which will be disastrous if it is large. Pay attention to how you are parsing the file.
I'm not sure if it's the cause of your error, but inside your loop where you're reading DAT, you probably want to replace this:
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
with this:
(my $pre = $DAT_Line[$beg_doc]) =~ s/[0-9]//g;
and same for the other two lines there.
You're closing your OUT file handle and then trying to print to it inside the DAT loop, which, I think might be outputting to random memory, since you closed the FILEHANDLE - surprised this didn't output an error.
Remove the first close(OUT); and see if that improves.
I still don't know what your question is, if it's about the error message it means you've run out of memory. If it's about the message itself - you're trying to consume too much memory. If it's why you're consuming too much memory, I'd first ask if you read my message above, then I'd ask how much memory your system has, then I'd follow up with seeing if it improves if you take the regex away.

Resources