I have to write vbscript to compare two csv files,
The both csv files contains the following data format,
File1.csv
DBNane UserGroup Path Access
DB_1 Dev_II DB/Source/Projects Read/Write
DB_2 Test_I DB/Source/Doc Read
File2.csv
DBNane UserGroup Path Access
DB_1 Dev_II DB/Source/Projects Read
DB_2 Test_I DB/Source/Doc Read
I need to compare these files, the output format is like,
File3.csv
DBNane UserGroup Path Access
DB_1 Dev_II DB/Source/Projects Read/Write
I'm new to the vbscript. Any sample script to do this ?
Thanks.
In PowerShell you could get differing lines from 2 text files like this:
$f1 = Get-Content 'C.\path\to\file1.csv'
$f2 = Get-Content 'C.\path\to\file2.csv'
Compare-Object $f1 $f2
If you only need to show what's different in the first file ($f1), you could filter the result like this:
Compare-Object $f1 $f2 | ? { $_.SideIndicator -eq '<=' } | % { $_.InputObject }
Related
Assume that I have many csv file located into /home/user/test
123_24112021_DONG.csv
122_24112021_DONG.csv
145_24112021_DONG.csv
123_24112021_FINA.csv
122_24112021_FINA.csv
145_24112021_FINA.csv
123_24112021_INDEM.csv
122_24112021_INDEM.csv
145_24112021_INDEM.csv
As you can see, all files have three unique prefix :
145
123
122
And, I need to create zip per prefix which will contains csv files. Note that in reality, I dont know the number of csv file, It is just an example (3 csv files per prefix).
I developed a code that extract prefixes from all csv names in bash table :
for entry in "$search_dir"/*
do
# extract csv files
f1=${entry##*/}
echo $f1
# extract prefix of each file
f2=${f1%%_*}
echo $f2
# add prefix in table
liste_sirets+=($f2)
done
# get uniq prefix in unique_sorted_list
unique_sorted_list=($(printf "%s\n" "${liste_sirets[#]}" | sort -u ))
echo $unique_sorted_list
which give the result :
145
123
122
Now I want to zip each three files defined by their prefix in same zip file :
In other word, create 123_24112021_M2.zip which will contains
123_24112021_DONG.csv
123_24112021_FINA.csv
123_24112021_INDEM.csv
and 122_24112021_M2.zip 145_24112021_M2.zip ...
So, I developed a loop which will focus on each prefix name of csv files located in local path then zip all having the same prefix name :
for i in $unique_sorted_list
do
for j in "$search_dir"/*
do
if $(echo $j| cut -d'_' -f1)==$i
zip -jr $j
done
But, it does not work, any help, please ! thank you !
Using bash and shell utilities:
#!/bin/bash
printf '%s\n' *_*.csv | cut -d_ -f1 | uniq |
while read -r prefix
do
zip "$prefix".zip "$prefix"_*.csv
done
Update:
It is also requested to group files by date (the second part of the filename):
#!/bin/bash
printf '%s\n' *_*_*.csv | cut -d_ -f2 | sort -u |
while read -r date
do
zip "$date".zip ./*_"$date"_*.csv
done
Using bash 4+ associative arrays:
# declare an associative array
declare -A unq
# store unique prefixes in array unq
for f in *_*.csv; do
unq["${f%%_*}"]=1
done
# iterate through unq and create zip files
for i in "${!unq[#]}"; do
zip "$i" "${i}_"*
done
Goal
Using PowerShell, find a string in a file, run a simple transformation script on the string, and replace the original string with the new string in the same file
Details
The file is a Markdown file with one or more HTML blocks inside.
The goal is to make the entire file Markdown with no HTML.
Pandoc is a command-line HTML-to-Markdown transformation tool that easily transforms HTML to Markdown.
The transformation script is a Pandoc script.
Pandoc alone cannot transform a Markdown file that includes HTML to Markdown.
Each HTML block a is one long string with no line breaks (see example below).
The HTML is a little rough and sometimes not valid; despite this, Pandoc handles much of the transformation successfully. This may not be relevant.
I cannot change the fact that the file is generated originally as part Markdown/part HTML, that the HTML is sometimes invalid, or that each HTML block is all on one line.
PowerShell is required because that's the scripting language my team supports.
Example file of mixed Markdown/HTML code; most HTML is invalid
# Heading 1
Text
# Heading 2
<h3>Heading 3</h3><p>I am all on one line</h><span><div>I am not always valid HTML</div></span><br><h4>Heading 4<h4><ul><li>Item<br></li><li>Item</li><ul><span></span><img src="url" style="width:85px;">
# Heading 3
Text
# Heading 4
<h2>Heading 1</h2><div>Text</div><h2>Heading 2</h2><div>Text</div>
# Heading 5
<div><ul><li>Item</li><li>Item</li><li>Item</li></ul></div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code>
Text
Code for transformation script
pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
Attempts
I surrounded each HTML block with a <start> and <end> tag with the goal to extract the text in between those tags with a regex, run the Pandoc script on it, and replace the original text. My plan was to run a foreach loop to iterate through each block one by one.
This attempt transforms the HTML to Markdown, but does not return the original Markdown with it:
$file = 'file.md'
$regex = '<start>.*?<end>'
$a = Get-Content $file -Raw
$a | Select-String $regex -AllMatches | ForEach-Object {$_.Matches.Value} | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
This poor attempt seeks to perform the replace, but only returns the original file with no changes:
$file = 'file.md'
$regex = '<start>.*?<end>'
$content = Get-Content $file -Raw
$a = $content | Select-String $regex -AllMatches
$b = $a | ForEach-Object {$_.Matches } | Foreach-Object {$_.Value} | Select-Object | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
$content | ForEach-Object {
$_ -replace $a,$b }
I am struggling to move beyond these attempts. I am new at PowerShell. If this approach is wrong entirely I would be grateful to know. Thank you for any advice.
Given the line-oriented nature of your input, you can process your input file line by line and decide for each line whether it needs transformation or not:
$file = 'file.md'
(Get-Content $file | ForEach-Object {
if ($_ -match '^<') { # Is this an HTML line? - you could make this regex stricter
$_ | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
} else { # A non-HTML line, pass through as-is
$_
}
}) | Set-Content -Encoding Utf8 $file # be sure to choose the desired encoding
Note the (...) around the pipeline before Set-Content, which ensures that $file is read into memory in full up front, which allows writing back to the same file - do note that this convenient approach bears the slight risk of data loss, however, if the command is interrupted before writing completes; always create a backup of the input files first.
My data looks like this below:
STX=ANAA:1+5013546100917:KELLOGG COMPANY (GB) LIMITED+5000119000006:TESCO STORES PLC+160811:134338+63010+PIONEER+INVFIL+B'
MHD=1+INVFIL:9'
TYP=0700+INVOICES'
SDT=5013546100917:12191+KELLOGG COMPANY (GB) LIMITED+THE KELLOGG BUILDING:TALBOT ROAD:MANCHESTER::M16 OPU+151194288'
CDT=5000119000006:5000119000006+TESCO STORES PLC+BOUGHT LEDGER DEPARTMRENT:TESCO HOUSE:PO BOX 506:CARDIFF:CF4 4TS+220430231'
FIL=9476+1+160811'
FDT=160313+160315'
MTR=7'
MHD=2+INVOIC:9'
CLO=5000119008510:0100851:4420009+TESCO (CO ANTRIM)+KILBEGS ROAD:BALLYMENA ROAD:ANTRIM:AT:BT41 4NN'
IRF=92349489+160314+160314'
And I want to GREP for "FIL=" and "IRF=" and print them out to see the results.
I have tried various options none of them work!
zgrep -i 'FIL=\|IRF=\|' `zgrep -il "5000119000006" *201609*`
zgrep "FIL=|IRF=" `zgrep -il "50001190000006" *201609*'
if file are zipped:
zegrep 'FIL=|IRF=' *gz
if files are regular files (not zipped):
egrep 'FIL=|IRF=' *
so you should try like this:
first filter data with 50001190000006
if data present in file then search for FIL=|IRF= from the file
zgrep -q '5000119000006' test.file;if [ $? -eq 0 ];then egrep 'FIL=|IRF=' test.file; fi
FIL=9476+1+160811'
IRF=92349489+160314+160314'
Replace test.file with 201611(actual file name in your system)
I have never programed in bash before.
I am reading all the files that are in a directory, then I need to look into their names and check if the have R1 or R2, depending on that I need concatenate all files that have R1 in the same and all files that have R2 in the name.
So as a final output I would like to have something like:
String 1 = file1_R1.gz file2_R1.gz file3_R1.gz...
String 2 = file1_R2.gz file2_R2.gz file3_R2.gz...
how can I do that? the only code that I have so far is:
#!/bin/bash
list=$(echo *.gz)
strR1="R1"
strR2="R2"
if [ "$list" = "*.gz" ] ; then list=""; fi
for str in $list
do
if echo "$strR1" | grep -q "$str"; then
echo "str";
else
echo "no file";
fi
done
I can read all the files in the directory but when I do the if I cannot find any file with R1, and I know that there are at least 4 files with R1 in the name.
Thank you!
Why write a loop to do the filtering wildcards will already do for you? (You're using the same thing to filter *.gz already.)
files1=`echo *R1*.gz`
files2=`echo *R2*.gz`
Since you said you want to concatenate all those files, that would then be
zcat *R1.gz > result_R1
zcat *R2.gz > result_R2
or something like that.
So lets say I have 5 files: f1, f2, f3, f4, f5. How can I remove the common strings (same text in all files) from all 5 files and put them into a 6th file, f6? Please let me know.
Format of the files:
property.a.p1=some string
property.b.p2=some string2
.
.
.
property.zzz.p4=123455
So if the above is an excerpt from file 1 and files 2 to 5 also have the string property.a.p1=some string in them, then I'd like to remove that string from files 1 to 5 and put it in file 6. Each line of each file is on a new line. Thus, I would be comparing each string on a newline one by one. Each file is around 400 to 600 lines.
I found this on a forum for removing common strings from two files using ruby:
$ ruby -ne 'BEGIN {a=File.read("file1").split(/\n+/)}; print $_ if a.include?($_.chomp)' file2
See if this does what you want. It's a "2-pass" solution, the first pass uses a hash table to find the common lines, and the second uses that to filter out any lines that match the commons.
$files = gci "file1.txt","file2.txt","file3.txt","file4.txt","file5.txt"
$hash = #{}
$common = new-object system.collections.arraylist
foreach ($file in $files) {
get-content $file | foreach {
$hash[$_] ++
}
}
$hash.keys |% {
if ($hash[$_] -eq 5){[void]$common.add($_)}
}
$common | out-file common.txt
[regex]$common_regex = ‘^(‘ + (($common |foreach {[regex]::escape($_)}) –join “|”) + ‘)$’
foreach ($file in $files) {
$new_file = get-content $file |? {$_ -notmatch $common_regex}
$new_file | out-file "new_$($file.name)"
}
Create a table in an SQL database like this:
create table properties (
file_name varchar(100) not null, -- Or whatever sizes make sense
prop_name varchar(100) not null,
prop_value varchar(100) not null
)
Then parse your files with some simple regular expressions or even just split:
prop_name, prop_value = line.strip.split('=')
dump the parsed data into your table, and do a bit of SQL to find the properties that are common to all files:
select prop_name, prop_value
from properties
group by prop_name, prop_value
having count(*) = $n
Where $n is replaced by the number of input files. Now you have a list of all the common properties and their values so write those to your new file, remove them from your properties table, and then spin through all the rows that are left in properties and write them to the appropriate files (i.e. the file named by the file_name column).
You say that the files are "huge" so you probably don't want to slurp all of them into memory at the same time. You could do multiple passes and use a hash-on-disk library for keeping track of what has been seen and where but that would be a waste of time if you have an SQL database around and everyone should have, at least, SQLite kicking around. Managing large amounts of structured data is what SQL and databases are for.