compare 2 files delete duplicate lines batch - sorting

Hope this is not a duplicate question.
I need a batch script for removing duplicate lines in multiple files. As input I have file1 and file2,3,4,5,6 ......the lines in files are sorted, 1 word per line. I need to remove from file 2,3,4,5... every line that is already in file 1, file 1 will work as dictionary for other files, only compares with1.
$ file1
12-bk-34652
13-bk-36789
14-bk-76559
$file2
12-bk-34652
12-bk-36098
13-bk-36789
14-bk-76559
output file2 > 12-bk-36098
For now I'm using this code and running files one by one.Also is kind of slow considering every file has around 50MB and file1 almost 200MB
findstr /v /x /g:file1 file2>file2a

Related

awk command does not halt on windows for merging large csv files

I am executing the following awk command on Windows 10.
awk "(NR == 1) || (FNR > 1)" *.csv > bigMergeFile.csv
I want to merge all csv files into a single file named bigMergeFile.csv using only the header of the first file.
I successfully tested the code on small files (4 files each containing 5 cols and 4 rows). However, the code does not halt when I run it on large files (10 files, each with 8k rows, 32k cols, approximate size 1 GB). It only stops execution when the space runs out on hard drive. At that time, the size of resultant output file bigMergeFile.csv is 30GB. The combine files size of all input csv file is 9.5 GB.
I have tested the code on Mac OS and it works fine. Help will be appreciated.
My guess: bigMergeFile.csv ends in .csv so it's one of the input files your script is running on and it's growing as your script appends to it. It's like you wrote a loop like:
while ! end-of-file do
read line from start of file
write line to end of file
done
since you're doing basically a concat not a merge, set FS = "^$" to it won't waste time attempting to split fields you won't need anyway.

How to aggregate the result of bash sort on multiple files into a single file?

I have a ~90GB file. Each line consists of tab-separated pairs such as Something \t SomethingElse. My main goal is to find the frequency of each unique line in the file. So I tried
sort --parallel=50 bigFile | unique -c > new_sortedFile
which did not work due to the file size. Then I split the original big file into 9 parts (each 10 GB) then the same command worked for each file separately.
So my question is how can I aggregate the result of those 9 files into a single file in order to have the same result of the bigFile?

split the multiple lines files into small files based on the content

I want to create a script to split large files into multiple files with respect to line numbers. Mainly if a file is getting split there should be a complete line at the end / beginning.
No partial line should present in any of the split files.
split is what you might be looking for.
split --lines <linenumber> <file>
and you will get a bunch of splitted files named like: PREFIXaa, PREFIXab...
For further info see man split.

Comparing/finding the difference between two text files using findstr

I have a requirement to compare two text files and to find out the difference between them. Basically I have an input file (input.txt) which will be processed by a batch job and my batch will log the output (successful.txt) where the job has successfully ran.
In simple words, I need to find out the difference between input.txt and successful.txt (input.txt-successful.txt) and I was thinking to use findstr. It seems to be fine, BUT I don't understand one part of it. It always includes the last line of my input.txt in the output. You could see that in the example below. Please note that there is no leading space or line break after the last line of my input.txt.
In below example, you could see the line server1,db1 is present on both the files, but still listed in the output. (It is always the last line of input.txt)
D:\Scripts\dummy>type input.txt
server2,db2
server3,db3
server10,db10
server4,db4
server1,db11
server10,schema11
host1,sch2
host11,sql2
host11,sql3
server1,db1
D:\Scripts\dummy>type successful.txt
server1,db1
server2,db2
server3,db3
server4,db4
server10,db10
host1,sch2
host11,sql2
host11,sql3
D:\Scripts\dummy>findstr /vixg:successful.txt input.txt
server1,db11
server10,schema11
server1,db1
What am I doing wrong?
Cheers,
G
I could reproduce your results by removing the newline after the last line of input.txt, so solution 1 would be to add a newline to the end of input.txt. Since you appear to say that input.txt has no terminal newline, then adding one would cure the problem; findstr is acting as expected because it acts on newline-terminated lines.
Solution 2 would be
type input.txt|findstr /vixg:successful.txt

method for merging two files, opinion needed

Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.

Resources