What I'm trying to do is to compare the content of two different files. I don't know what I'm doing wrong, but things I searched online regarding diff command didn't work.
For example if fileA content is this:
AAA:111
BBB:222
CCC:333
And fileB content is:
AAA:111
BBB:222
All I want to see as an output is the difference which is CCC:333. No "<" no ">", just plainly CCC:333. I want to use this later in the bash script I'm working on.
Also would it matter if those files were reversed? I mean if it was fileB containing CCC:333?
I don't know if it matters, but the files I'm working on are MAC addresses.
Is the diff command I was trying to use case sensitive?
You can use two diff options as follows :
diff --changed-group-format='%<' --unchanged-group-format='' fileA fileB
If anyone else was looking for those answers I only wanted to add that they both work!
The sort and uniq solution by Cyrus will show you differences in those two files (if the difference would be that they both have lines aaa and bbb, but only one would have xxx and the other had yyy, it would print out those two lines xxx and yyy).
The diff command solution by Philippe can give you different output because it depends if you put fileA first then fileB or if you will put fileB first and then fileA.
Test it yourself.
Correct me if I'm wrong please!
Thank you for your help.
Related
Due to a power failure issue, I am having to clean up jobs which are run based on text files. So the problem is, I have a text file with strings like so (they are uuids):
out_file.txt (~300k entries)
<some_uuidX>
<some_uuidY>
<some_uuidZ>
...
and a csv like so:
in_file.csv (~500k entries)
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location3/,<some_uuidX>.json.<some_string3>
/path/to/some/location4/,<some_uuidY>.json.<some_string4>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
/path/to/some/location6/,<some_uuidZ>.json.<some_string6>
...
I would like to remove lines from out_file for entries which match in_file.
The end result:
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
...
Since the file sizes are fairly large, I was wondering if there is an efficient way to do it in bash.
any tips would be geat.
Here is a potential grep solution:
grep -vFwf out_file.txt in_file.csv
And a potential awk solution (likely faster):
awk -F"[,.]" 'FNR==NR { a[$1]; next } !($2 in a)' out_file.txt in_file.csv
NB there are caveats to each of these approaches. Although they both appear to be suitable for your intended purpose (as indicated by your comment "the numbers add up correctly"), posting a minimal, reproducible example in future questions is the best way to help us help you.
I have the following awk command to join lines which are smaller than a limit (it is basically used to break lines in multiline fixed-width file):
awk 'last{$0=last $0;} length($0)<21{last=$0" ";next} {print;last=""}' input_file.txt > output_file.txt
input_file.txt:
1,11,"dummy
111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
output_file.txt (expected):
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
The script works pretty well with small files (~MB) but it does nothing with big files (~GB).
What may be the problem?
Thanks in advance.
Best guess - all the lines in your big file are longer than 21 chars. There are more robust ways to do what you're trying to do with that script, though, so it may not be worth debugging this and ask for help with an improved script instead.
Here's one more robust way to combine quoted fields that contain newlines using any awk:
$ awk -F'"' '{$0=prev $0; if (NF%2){print; prev=""} else prev=$0 OFS}' input_file.txt
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
That may be a better starting point for you than your existing script. To do more than that, see What's the most robust way to efficiently parse CSV using awk?.
I have a CSV file with several thousand lines, and I need to take some of the columns in that file to create another CSV file to use for import to a database.
I'm not in shape with shell scripting anymore, is there anyone who can help with pointing me in the correct direction?
I have a bash script to read the source file but when I try to print the columns I want to a new file it just doesn't work.
while IFS=, read symbol tr_ven tr_date sec_type sec_name name
do
echo "$name,$name,$symbol" >> output.csv
done < test.csv
Above is the code I have. Out of the 6 columns in the original file, I want to build a CSV with "column6, column6, collumn1"
The test CSV file is like this:
Symbol,Trading Venue,Trading Date,Security Type,Security Name,Company Name
AAAIF,Grey Market,22/01/2015,Fund,,Alternative Investment Trust
AAALF,Grey Market,22/01/2015,Ordinary Shares,,Aareal Bank AG
AAARF,Grey Market,22/01/2015,Ordinary Shares,,Aluar Aluminio Argentino S.A.I.C.
What am I doing wrong with my script? Or, is there an easier - and faster - way of doing this?
Edit
These are the real headers:
Symbol,US Trading Venue,Trading Date,OTC Tier,Caveat Emptor,Security Type,Security Class,Security Name,REG_SHO,Rule_3210,Country of Domicile,Company Name
I'm trying to get the last column, which is number 12, but it always comes up empty.
The snippet looks and works fine to me, maybe you have some weird characters in the file or it is coming from a DOS environment (use dos2unix to "clean" it!). Also, you can make use of read -r to prevent strange behaviours with backslashes.
But let's see how can awk solve this even faster:
awk 'BEGIN{FS=OFS=","} {print $6,$6,$1}' test.csv >> output.csv
Explanation
BEGIN{FS=OFS=","} this sets the input and output field separators to the comma. Alternatively, you can say -F=",", -F, or pass it as a variable with -v FS=",". The same applies for OFS.
{print $6,$6,$1} prints the 6th field twice and then the 1st one. Note that using print, every comma-separated parameter that you give will be printed with the OFS that was previously set. Here, with a comma.
I have a list of IDs like so:
11002
10995
48981
And a tab delimited file like so:
11002 Bacteria;
10995 Metazoa
I am trying to delete all lines in the tab delimited file containing one of the IDs from the ID list file. For some reason the following won't work and just returns the same complete tab delimited file without any line removed whatsoever:
grep -v -f ID_file.txt tabdelimited_file.txt > New_tabdelimited_file.txt
I also tried numerous other combinations with grep, but currently I draw blank here.
Any idea why this is failing?
Any help would be greatly appreciated
Since you tagged this with awk, here is one way of doing it:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{ids[$1]++;next}!($1 in ids)' idFile tabFile > new_tabFile
BTW your grep command is correct. Just double check if your file is not formatted for windows.
I intend to check submited h.w answers in .c code.
does someone have link or bash shell script code that checks for file similarity (percentage of similar lines, etc...)?
Ready-to-use-programm
On the one hand there is a little C-programm called Sherlock from the University of Sydney, which does exactly what you want: displaying the percentage of similarity. You only have to compile it yourself, but I think that won't be a problem.
Do it yourself
On the other hand, in case you're using a unix-based system and want to do it all by yourself there is the comm command:
compare two sorted files line by line and write to standard output:
the lines that are common, plus the lines that are unique.
(taken from the manpage)
Important to notice here is that comm only works ony sorted files, so you have to sort both of them first. If you have two files, say first.txt and second.txt you can use comm like this:
comm -12 <(sort first.txt) <(sort second.txt)
The -12-option specified suppresses lines which are unique in both files, so you will only get lines appearing on both files.