Cut common strings from multiple files and paste them into another file - ruby

So lets say I have 5 files: f1, f2, f3, f4, f5. How can I remove the common strings (same text in all files) from all 5 files and put them into a 6th file, f6? Please let me know.
Format of the files:
property.a.p1=some string
property.b.p2=some string2
.
.
.
property.zzz.p4=123455
So if the above is an excerpt from file 1 and files 2 to 5 also have the string property.a.p1=some string in them, then I'd like to remove that string from files 1 to 5 and put it in file 6. Each line of each file is on a new line. Thus, I would be comparing each string on a newline one by one. Each file is around 400 to 600 lines.
I found this on a forum for removing common strings from two files using ruby:
$ ruby -ne 'BEGIN {a=File.read("file1").split(/\n+/)}; print $_ if a.include?($_.chomp)' file2

See if this does what you want. It's a "2-pass" solution, the first pass uses a hash table to find the common lines, and the second uses that to filter out any lines that match the commons.
$files = gci "file1.txt","file2.txt","file3.txt","file4.txt","file5.txt"
$hash = #{}
$common = new-object system.collections.arraylist
foreach ($file in $files) {
get-content $file | foreach {
$hash[$_] ++
}
}
$hash.keys |% {
if ($hash[$_] -eq 5){[void]$common.add($_)}
}
$common | out-file common.txt
[regex]$common_regex = ‘^(‘ + (($common |foreach {[regex]::escape($_)}) –join “|”) + ‘)$’
foreach ($file in $files) {
$new_file = get-content $file |? {$_ -notmatch $common_regex}
$new_file | out-file "new_$($file.name)"
}

Create a table in an SQL database like this:
create table properties (
file_name varchar(100) not null, -- Or whatever sizes make sense
prop_name varchar(100) not null,
prop_value varchar(100) not null
)
Then parse your files with some simple regular expressions or even just split:
prop_name, prop_value = line.strip.split('=')
dump the parsed data into your table, and do a bit of SQL to find the properties that are common to all files:
select prop_name, prop_value
from properties
group by prop_name, prop_value
having count(*) = $n
Where $n is replaced by the number of input files. Now you have a list of all the common properties and their values so write those to your new file, remove them from your properties table, and then spin through all the rows that are left in properties and write them to the appropriate files (i.e. the file named by the file_name column).
You say that the files are "huge" so you probably don't want to slurp all of them into memory at the same time. You could do multiple passes and use a hash-on-disk library for keeping track of what has been seen and where but that would be a waste of time if you have an SQL database around and everyone should have, at least, SQLite kicking around. Managing large amounts of structured data is what SQL and databases are for.

Related

Unwanted space in substring using powershell

I'm fairly new to PS: I'm extracting fields from multiple xml files ($ABB). The $net var is based on a pattern search and returns a non static substring on line 2. Heres what I have so far:
$ABB = If ($aa -eq $null ) {"nothing to see here"} else {
$count = 0
$files = #($aa)
foreach ($f in $files)
{
$count += 1
$mo=(Get-Content -Path $f )[8].Substring(51,2)
(Get-Content -Path $f | Select-string -Pattern $lf -Context 0,1) | ForEach-Object {
$net = $_.Context.PostContext
$enet = $net -split "<comm:FieldValue>(\d*)</comm:FieldValue>"
$enet = $enet.trim()}
Write-Host "$mo-nti-$lf-$enet" "`r`n"
}}
The output looks like this: 03-nti-260- 8409.
Note the space prefacing the 8409 which corresponds to the $net variable. I haven't been able to solve this on my own, my approach could be all wrong. I'm open to any and all suggestions. Thanks for your help.
Since your first characters in the first line of $net after $net = $_.Context.PostContext contains the split characters, a blank line will output as the first element of the output. Then when you stringify output, each split output item is joined by a single space.
You need to select lines that aren't empty:
$enet = $net -split "<comm:FieldValue>(\d*)</comm:FieldValue>" -ne ''
Explanation:
-Split characters not surrounded by () are removed from the output and the remaining string is split into multiple elements from each of those matched characters. When a matched character starts or ends a string, a blank line is output. Care must be taken to remove those lines if they are not required. Trim() will not work because Trim() applies to a single string rather than an array and will not remove empty string.
Adding -ne '' to the end of the command, removes empty lines. It is just an inline boolean condition that when applied to an array, only outputs elements where the condition is true.
You can see an example of the blank line condition below:
123 -split 1
23
123 -split 1 -ne ''
23
Just use a -replace to get rid of any spaces
For example:
'03-nti-260- 8409' -replace '\s'
<#
# Results
03-nti-260-8409
#>

Powershell - Having difficulty in ignoring the header row (first row) and footer row (last row) in file

I am looking to find extra delimiters in my file on a line by line basis.
I would, however would like to ignore the header row (first row) and the footer row (last row) in the file and just focus on the file detail.
I am not sure on how to ignore the first and last row using the ReadLine() method. I DO NOT want to alter the file in any way, this script is used just to identify rows in the CSV file that have extra delimiters.
Please note: The file I am looking to search has millions of rows and in order to do that I have to rely on the ReadLine() method rather than the Get-Content approach.
I did try to use Select-Object -Skip 1 | Select-Object -SkipLast 1 in my Get-Content statement inputting the value into $measure but I didn't get the desired result.
For example:
H|Transaction|2017-10-03 12:00:00|Vendor --> This is the Header
D|918a39230a098134|2017-08-31 00:00:00.000|2017-08-15 00:00:00.000|SLICK-2340|...
D|918g39230b095134|2017-08-31 00:00:00.000|2017-08-15 00:00:00.000|EX|SRE-68|...
T|1268698 Records --> This is Footer
Basically, I want my script to ignore the header and footer, and use the first data row (D|918...) as the example of a correct record and the other detail records to be compared against it for error (in this example the second detail row should be returned, because there an invalid delimiter in the the field (EX|SRE-68...).
When I tried using -skip 1 and -skiplast 1 in the get-content statement, the process is still using the header row as a comparison and returning all detail records as invalid records.
Here's what I have so far...
Editor's note: Despite the stated intent, this code does use the header line (the 1st line) to determine the reference column count.
$File = "test.csv"
$Delimiter = "|"
$measure = Get-Content -Path $File | Measure-Object
$lines = $measure.Count
Write-Host "$File has ${lines} rows."
$i = 1
$reader = [System.IO.File]::OpenText($File)
$line = $reader.ReadLine()
$reader.Close()
$header = $line.Split($Delimiter).Count
$reader = [System.IO.File]::OpenText($File)
try
{
for()
{
$line = $reader.ReadLine()
if($line -eq $null) { break }
$c = $line.Split($Delimiter).Count
if($c -ne $header -and $i -ne${lines})
{
Write-Host "$File - Line $i has $c fields, but it should be $header"
}
$i++
}
}
finally
{
$reader.Close()
}
Any reason your using Read Line? The Get-Content your doing will already load the entire CSV into memory, so I'd save that to a variable and then use a loop to go through (starting at 1 to skip the first line).
So something like this:
$File = "test.csv"
$Delimiter = "|"
$contents = Get-Content -Path $File
$lines = $contents.Count
Write-Host "$File has ${lines} rows."
$header = $contents[0].Split($Delimiter).count
for ($i = 1; $i -lt ($lines - 1); $i++)
{
$c = $contents[$i].Split($Delimiter).Count
if($c -ne $header)
{
Write-Host "$File - Line $i has $c fields, but it should be $header"
}
}
Now that we know that performance matters, here's a solution that uses only [System.IO.TextFile].ReadLine() (as a faster alternative to Get-Content) to read the large input file, and does so only once:
No up-front counting of the number of lines via Get-Content ... | Measure-Object,
No separate instance of opening the file just to read the header line; keeping the file open after reading the header line has the added advantage that you can just keep reading (no logic needed to skip the header line).
$File = "test.csv"
$Delimiter = "|"
# Open the CSV file as a text file for line-based reading.
$reader = [System.IO.File]::OpenText($File)
# Read the lines.
try {
# Read the header line and discard it.
$null = $reader.ReadLine()
# Read the first data line - the reference line - and count its columns.
$refColCount = $reader.ReadLine().Split($Delimiter).Count
# Read the remaining lines in a loop, skipping the final line.
$i = 2 # initialize the line number to 2, given that we've already read the header and the first data line.
while ($null -ne ($line = $reader.ReadLine())) { # $null indicates EOF
++$i # increment line number
# If we're now at EOF, we've just read the last line - the footer -
# which we want to ignore, so we exit the loop here.
if ($reader.EndOfStream) { break }
# Count this line's columns and warn, if the count differs from the
# header line's.
if (($colCount = $line.Split($Delimiter).Count) -ne $refColCount) {
Write-Warning "$File - Line $i has $colCount fields rather than the expected $refColCount."
}
}
} finally {
$reader.Close()
}
Note: This answer was written before the OP clarified that performance was paramount and that a Get-Content-based solution was therefore not an option. My other answer now addresses that.
This answer may still be of interest for a slower, but more concise, PowerShell-idiomatic solution.
the_sw's helpful answer shows that you can use PowerShell's own Get-Content cmdlet to conveniently read a file, without needing to resort to direct use of the .NET Framework.
PSv5+ enables an idiomatic single-pipeline solution that is more concise and more memory-efficient - it processes lines one by one - albeit at the expense of performance; especially with large files, however, you may not want to read them in all at once, so a pipeline solution is preferable.
PSv5+ is required due to use of Select-Objects -SkipLast parameter.
$File = "test.csv"
$Delimiter = '|'
Get-Content $File | Select-Object -SkipLast 1 | ForEach-Object { $i = 0 } {
if (++$i -eq 1) {
return # ignore the actual header row
} elseif ($i -eq 2) { # reference row
$refColumnCount = $_.Split($Delimiter).Count
} else { # remaining rows, except the footer, thanks to -SkipLast 1
$columnCount = $_.Split($Delimiter).Count
if ($columnCount -ne $refColumnCount) {
"$File - Line $i has $columnCount fields rather than the expected $refColumnCount."
}
}
}

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Parsing Text file and placing contents into an Array Powershell

I am looking for a way to parse a text file and place the results into an array in powershell.
I know select-string -path -pattern will get all strings that match a pattern. But what if I already have a structured textfile, perhaps pipe delimminated, with new entries on each line. Like so:
prodServ1a
prodServ1b
prodServ1c
C:\dir\serverFile.txt
How can I place each of those servers into an array in powershell that I can loop through?
You say 'pipe delimited, like so' but your example isn't pipe delimited. I'll imagine it is, then you need to use the Import-CSV commandlet. e.g. if the data file contains this:
prodServ1a|4|abc
prodServ1b|5|def
prodServ1c|6|ghi
then this code:
$data = Import-Csv -Path test.dat -Header "Product","Cost","SerialNo" -Delimiter "|"
will import and split it, and add headers:
$data
Product Cost SerialNo
------- ---- --------
prodServ1a 4 abc
prodServ1b 5 def
prodServ1c 6 ghi
Then you can use
foreach ($item in $data) {
$item.SerialNo
}
If your data is flat or unstructured, but has a pattern such as delimited by a space or comma and no carriage returns or line feeds, you can use the Split() method.
PS>$data = "prodServ1a prodServ1b prodServ1c"
PS>$data
prodServ1a prodServ1b prodServ1c
PS>$data.Split(" ")
prodServ1a
prodServ1b
prodServ1c
Works particularly well with someone sending you a list of IP addresses separated by a comma.

Compare two CSV files in vbscript

I have to write vbscript to compare two csv files,
The both csv files contains the following data format,
File1.csv
DBNane UserGroup Path Access
DB_1 Dev_II DB/Source/Projects Read/Write
DB_2 Test_I DB/Source/Doc Read
File2.csv
DBNane UserGroup Path Access
DB_1 Dev_II DB/Source/Projects Read
DB_2 Test_I DB/Source/Doc Read
I need to compare these files, the output format is like,
File3.csv
DBNane UserGroup Path Access
DB_1 Dev_II DB/Source/Projects Read/Write
I'm new to the vbscript. Any sample script to do this ?
Thanks.
In PowerShell you could get differing lines from 2 text files like this:
$f1 = Get-Content 'C.\path\to\file1.csv'
$f2 = Get-Content 'C.\path\to\file2.csv'
Compare-Object $f1 $f2
If you only need to show what's different in the first file ($f1), you could filter the result like this:
Compare-Object $f1 $f2 | ? { $_.SideIndicator -eq '<=' } | % { $_.InputObject }

Resources