Compare content of two files (diff)

Compare content of two files (diff) - bash

What I'm trying to do is to compare the content of two different files. I don't know what I'm doing wrong, but things I searched online regarding diff command didn't work.
For example if fileA content is this:
AAA:111
BBB:222
CCC:333
And fileB content is:
AAA:111
BBB:222
All I want to see as an output is the difference which is CCC:333. No "<" no ">", just plainly CCC:333. I want to use this later in the bash script I'm working on.
Also would it matter if those files were reversed? I mean if it was fileB containing CCC:333?
I don't know if it matters, but the files I'm working on are MAC addresses.
Is the diff command I was trying to use case sensitive?

You can use two diff options as follows :
diff --changed-group-format='%<' --unchanged-group-format='' fileA fileB

If anyone else was looking for those answers I only wanted to add that they both work!
The sort and uniq solution by Cyrus will show you differences in those two files (if the difference would be that they both have lines aaa and bbb, but only one would have xxx and the other had yyy, it would print out those two lines xxx and yyy).
The diff command solution by Philippe can give you different output because it depends if you put fileA first then fileB or if you will put fileB first and then fileA.
Test it yourself.
Correct me if I'm wrong please!
Thank you for your help.

Related

remove line in csv file if string found (from another text file) in bash

Due to a power failure issue, I am having to clean up jobs which are run based on text files. So the problem is, I have a text file with strings like so (they are uuids):
out_file.txt (~300k entries)
<some_uuidX>
<some_uuidY>
<some_uuidZ>
...
and a csv like so:
in_file.csv (~500k entries)
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location3/,<some_uuidX>.json.<some_string3>
/path/to/some/location4/,<some_uuidY>.json.<some_string4>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
/path/to/some/location6/,<some_uuidZ>.json.<some_string6>
...
I would like to remove lines from out_file for entries which match in_file.
The end result:
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
...
Since the file sizes are fairly large, I was wondering if there is an efficient way to do it in bash.
any tips would be geat.

Here is a potential grep solution:
grep -vFwf out_file.txt in_file.csv
And a potential awk solution (likely faster):
awk -F"[,.]" 'FNR==NR { a[$1]; next } !($2 in a)' out_file.txt in_file.csv
NB there are caveats to each of these approaches. Although they both appear to be suitable for your intended purpose (as indicated by your comment "the numbers add up correctly"), posting a minimal, reproducible example in future questions is the best way to help us help you.

awk command works with small files but does nothing with big ones

I have the following awk command to join lines which are smaller than a limit (it is basically used to break lines in multiline fixed-width file):
awk 'last{$0=last $0;} length($0)<21{last=$0" ";next} {print;last=""}' input_file.txt > output_file.txt
input_file.txt:
1,11,"dummy
111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
output_file.txt (expected):
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
The script works pretty well with small files (~MB) but it does nothing with big files (~GB).
What may be the problem?
Thanks in advance.

Best guess - all the lines in your big file are longer than 21 chars. There are more robust ways to do what you're trying to do with that script, though, so it may not be worth debugging this and ask for help with an improved script instead.
Here's one more robust way to combine quoted fields that contain newlines using any awk:
$ awk -F'"' '{$0=prev $0; if (NF%2){print; prev=""} else prev=$0 OFS}' input_file.txt
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
That may be a better starting point for you than your existing script. To do more than that, see What's the most robust way to efficiently parse CSV using awk?.

Create CSV from specific columns in another CSV using shell scripting

I have a CSV file with several thousand lines, and I need to take some of the columns in that file to create another CSV file to use for import to a database.
I'm not in shape with shell scripting anymore, is there anyone who can help with pointing me in the correct direction?
I have a bash script to read the source file but when I try to print the columns I want to a new file it just doesn't work.
while IFS=, read symbol tr_ven tr_date sec_type sec_name name
do
echo "$name,$name,$symbol" >> output.csv
done < test.csv
Above is the code I have. Out of the 6 columns in the original file, I want to build a CSV with "column6, column6, collumn1"
The test CSV file is like this:
Symbol,Trading Venue,Trading Date,Security Type,Security Name,Company Name
AAAIF,Grey Market,22/01/2015,Fund,,Alternative Investment Trust
AAALF,Grey Market,22/01/2015,Ordinary Shares,,Aareal Bank AG
AAARF,Grey Market,22/01/2015,Ordinary Shares,,Aluar Aluminio Argentino S.A.I.C.
What am I doing wrong with my script? Or, is there an easier - and faster - way of doing this?
Edit
These are the real headers:
Symbol,US Trading Venue,Trading Date,OTC Tier,Caveat Emptor,Security Type,Security Class,Security Name,REG_SHO,Rule_3210,Country of Domicile,Company Name
I'm trying to get the last column, which is number 12, but it always comes up empty.

The snippet looks and works fine to me, maybe you have some weird characters in the file or it is coming from a DOS environment (use dos2unix to "clean" it!). Also, you can make use of read -r to prevent strange behaviours with backslashes.
But let's see how can awk solve this even faster:
awk 'BEGIN{FS=OFS=","} {print $6,$6,$1}' test.csv >> output.csv
Explanation
BEGIN{FS=OFS=","} this sets the input and output field separators to the comma. Alternatively, you can say -F=",", -F, or pass it as a variable with -v FS=",". The same applies for OFS.
{print $6,$6,$1} prints the 6th field twice and then the 1st one. Note that using print, every comma-separated parameter that you give will be printed with the OFS that was previously set. Here, with a comma.

Remove all lines from a given text file based on a given list of IDs

I have a list of IDs like so:
11002
10995
48981
And a tab delimited file like so:
11002 Bacteria;
10995 Metazoa
I am trying to delete all lines in the tab delimited file containing one of the IDs from the ID list file. For some reason the following won't work and just returns the same complete tab delimited file without any line removed whatsoever:
grep -v -f ID_file.txt tabdelimited_file.txt > New_tabdelimited_file.txt
I also tried numerous other combinations with grep, but currently I draw blank here.
Any idea why this is failing?
Any help would be greatly appreciated

Since you tagged this with awk, here is one way of doing it:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{ids[$1]++;next}!($1 in ids)' idFile tabFile > new_tabFile
BTW your grep command is correct. Just double check if your file is not formatted for windows.

Shell script, checking homework copy violation

I intend to check submited h.w answers in .c code.
does someone have link or bash shell script code that checks for file similarity (percentage of similar lines, etc...)?

Ready-to-use-programm
On the one hand there is a little C-programm called Sherlock from the University of Sydney, which does exactly what you want: displaying the percentage of similarity. You only have to compile it yourself, but I think that won't be a problem.
Do it yourself
On the other hand, in case you're using a unix-based system and want to do it all by yourself there is the comm command:
compare two sorted files line by line and write to standard output:
the lines that are common, plus the lines that are unique.
(taken from the manpage)
Important to notice here is that comm only works ony sorted files, so you have to sort both of them first. If you have two files, say first.txt and second.txt you can use comm like this:
comm -12 <(sort first.txt) <(sort second.txt)
The -12-option specified suppresses lines which are unique in both files, so you will only get lines appearing on both files.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Compare content of two files (diff) - bash

You can use two diff options as follows : diff --changed-group-format='%<' --unchanged-group-format='' fileA fileB

Related

remove line in csv file if string found (from another text file) in bash

awk command works with small files but does nothing with big ones

Create CSV from specific columns in another CSV using shell scripting

Remove all lines from a given text file based on a given list of IDs

Shell script, checking homework copy violation

Categories

Resources