Comparing two files with accented characters (Mac OS / Terminal) - macos

Goal: create a file listing all lines not found in either file
OS: Mac OS X, using Terminal
Problem: lines contain accented characters (UTF-8) and comparison doesn't seem to work
I've used the following command for comparing both files:
comm -13 <(sort file1) <(sort file2) > file3
That command works fine except with lines in files containing accented characters. Would you have any solutions?
One non-optimal thing I've tried is to replace all accented characters with non-accented ones with sed -i but that didn't seem to work on one of my two files, so I assume one file is weirdly encoded (in fact, ü is displayed u¨ when opening the file in TextMate but correctly as ü in TextEdit – I had generated that file using find Photos/ -type f > list_photos.txt to scroll through all filenames which contain accented characters... maybe I should add another parametre to the find command in the first place?). Any thoughts about this as well?
Many thanks.
Update:
I manually created text files with accented characters. The comm command worked without requiring LC_ALL. So the issue must be with the output of filenames into a text file (find command).
Test file A:
Istanbul 001 Mosquée Süleymaniye.JPG
Istanbul 002 Mosquée Süleymaniye.JPG
Test file B:
Istanbul 001 Mosquée Süleymaniye.JPG
Istanbul 002 Mosquée Süleymaniye - Angle.JPG
Istanbul 003 Ville.JPG
Comparison produces expected results. But it's when I create automatically those files, I instead get Su¨leymaniye for instance in the text file. When I don't generate an output file, the terminal however shows me the correct word Süleymaniye.
Many, many thanks for looking into it. Much appreciated.

You need to set the ENVIRONMENT for comm.
ENVIRONMENT
The LANG, LC_ALL, LC_COLLATE, and LC_CTYPE environment variables affect
the execution of comm as described in environ(7).
For example:
LC_COLLATE=C comm -13 <(sort file1) <(sort file2) > file3
or
LC_ALL=C comm -13 <(sort file1) <(sort file2) > file3

Related

Extracting a value from a same file from multiple directories

Directory name F1 F2 F3……F120
Inside each directory, a file with a common name ‘xyz.txt’
File xyz.txt has a value
Example:
F1
Xyz.txt
3.345e-2
F2
Xyz.txt
2.345e-2
F3
Xyz.txt
1.345e-2
--
F120
Xyz.txt
0.345e-2
I want to extract these values and paste them in a single file say ‘new.txt’ in a column like
New.txt
3.345e-2
2.345e-2
1.345e-2
---
0.345e-2
Any help please? Thank you so much.
If your files look very similar then you can use grep. For example:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.][0-9]{3}e-[0-9]$' > new.txt
This is a general example as any number can be anything. The regular expression says that the whole line must consist of: a any digit [0-9], a dot character [.], three digits [0-9]{3}, the letter 'e' and any digit [0-9].
If your data is more regular you can also try more simple solution:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.]345e-2$' > new.txt
In this solution only the first digit can be anything.
If your files might contain something else than the line, but the line you want to extract can be unambiguously extracted with a regex, you can use
sed -n '/^[0-9]\.[0-9]*e-*[0-9]*$/p' F*/Xyz.txt >new.txt
The same can be done with grep, but you have to separately tell it to not print the file name. The -x option can be used as a convenience to simplify the regex.
grep -h -x '[0-9]\.[0-9]*e-*[0-9]*' F*/Xyz.txt >new.txt
If you have some files which match the wildcard which should be excluded, try a more complex wildcard, or multiple wildcards which only match exactly the files you want, like maybe F[1-9]/Xyz.txt F[1-9][0-9]/Xyz.txt F1[0-9][0-9]/Xyz.txt
This might work for you (GNU parallel and grep):
parallel -k grep -hE '^[0-9][.][0-9]{3}e-[0-9]$' F{}/xyz.txt ::: {1..120}
Process files in parallel but output results in order.
If the files contain just one line, and you want the whole thing, you can use bash range expansion:
cat /path/to/F{1..120}/Xyz.txt > output.txt
(this keeps the order too).
If the files have more lines, and you need to actually extract the value, use grep -o (-o is not posix, but your grep probably has it).
grep -o '[0-9].345-e2' /path/to/F{1..120}/Xyz.txt > output.txt

Is there an easy and fast solution to compare two csv files in bash?

My Problem:
I have 2 large csv files, with millions of lines.
The one file contains a backup of a database from my server, and looks like:
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...
Now I have another CSV file, containing new codes like, with the exact same schema.
I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.
I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv and sort -u newCodes.csv > newCodesSorted.csv
First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv but the process got killed because it took too much resources, so I thought there had to be a better way
I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv.
I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later:
diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes
But I believe there has to be a better way.
So I ask you, if you have any ideas, how to improve this method.
EDIT:
comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.
But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like
securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...
I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.
I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^
Yes! a highly underrated tool comm is great for this.
Stolen examples from here.
Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b
Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b
Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'
As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)
If you do prefer to stick with diff, you can get it to do what you want without the grep:
diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt
You could then alias that diff command to "comp" or something similar to allow you to just:
comp 1.txt 2.txt
That might be handy if this is a command you are likely to use often in future.
I would think that sorting the file uses a lot of resources.
When you only want the new lines, you can try grep with the option -v
grep -vFxf serverBackup.csv newCodes.csv
or first split serverBackup.csv
split -a 4 --lines 10000 serverBackup.csv splitted
cp newCodes.csv newCodes.csv.org
for f in splitted*; do
grep -vFxf "${f}" newCodes.csv > smaller
mv smaller newCodes.csv
done
rm splitted*
Given:
$ cat f1
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
$ cat f2
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
You could use awk:
$ awk 'FNR==NR{seen[$0]; next} !($0 in seen)' f1 f2
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true

Find words from file a in file b and output the missing word matches from file a

I have two files that I am trying to run a find/grep/fgrep on. I have been trying several different commands to try to get the following results:
File A
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
File B
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
*about files- date=20170802 - most all have this date format - some have different date format *
FileA is my control file - I want to search fileb with the whole word hostnamea-f and match the hostnamea-f in fileb and output the non-matches from filea into the output on terminal to be used in a shell script.
For this example I made it so hostnamee is not within fileb. I want to run an fgrep/grep/awk - whatever can work for this - and output only the missing hostnamee from filea.
I can get this to work but it does not particularly do what I need and if I swap it around I get nothing.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o
hostnamea
hostnameb
hostnamec
hostnamed
HOSTNAMEF
Cool - I get the matches in File-B but what if I try to reverse it.
host#host:/netops/backups/scripts$ fgrep -f fileb filea -i -w -o
host#host:/netops/backups/scripts$
I have tried several different commands but cannot seem to get it right. I am using -i to ignore case, -w to match whole word and -o
I have found some sort of workaround but was hoping there was a more elegant way of doing this with a single command either awk,egrep,fgrep or other.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o > test
user#host:/netops/backups/scripts$ diff filea test -i
5d4
< hostnamee
You can
look for "only-matches", i.e. -o, of a in b
use the result as patterns to look for in a, i.e. -f-
only list what does not match, i.e. -v
Code:
grep -of a.txt b.txt | grep -f- -v a.txt
Output:
hostnamee
hostnamef
Case-insensitive code:
grep -oif a.txt b.txt | grep -f- -vi a.txt
Output:
hostnamee
Edit:
Responding to the interesting input by Ed Morton, I have made the sample input somewhat "nastier" to test robustness against substring matches and regex-active characters (e.g. "."):
a.txt:
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
ostname
lilihostnamec
hos.namea
b.txt:
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
lalahostnamef
hostnameab
stnam
This makes things more interesting.
I provide this case insensitive solution:
grep -Fwoif a.txt b.txt | grep -f- -Fviw a.txt
additional -F, meaning "no regex tricks"
additional -w, meaning "whole word matching"
I find the output quite satisfying, assuming that the following change of the "requirements" is accepted:
Hostnames in "a" only match parts of "b", if all adjoining _ (and other "word characers" are always considered part of the hostname.
(Note the additional output line of hostnamed, which is now not found in "b" anymore, because in "b", it is preceded by an _.)
To match possible occurrences of valid hostnames which are preceded/followed by other word characters, the list in "a" would have to explicitly name those variations. E.g. "_hostnamed" would have to be listed in order to not have "hostnamed" in the output.
(With a little luck, this might even be acceptable for OP, then this extended solution is recommended; for robustness against "EdMortonish traps". Ed, please consider this a compliment on your interesting input, it is not meant in any way negatively.)
Output for "nasty" a and b:
hostnamed
hostnamee
ostname
lilihostnamec
hos.namea
I am not sure whether the changed handling of _ still matches OPs goal (if not, within OPs scope the first case insensitive solution is satisfying).
_ is part of "letter characters" which can be used for "whole word only matching" -w. More detailed regex control at some point gets beyond grep, as Ed Morton has mentioned, using awk, perl (sed for masochistic brain exercise, the kind I enjoy) is then appropriate.
With GNU grep 2.5.4 on windows.
The files a.txt and b.txt have your content, I made however sure that they have UNIX line-endings, that is important (at least for a, possibly not for b).
$ cat tst.awk
NR==FNR {
gsub(/^[^_]+_|-[^-]+$/,"")
hostnames[tolower($0)]
next
}
!(tolower($0) in hostnames)
$ awk -f tst.awk fileB fileA
hostnamee
$ awk -f tst.awk b.txt a.txt
hostnamee
ostname
lilihostnamec
hos.namea
The only assumption in the above is that your host names don't contain underscores and anything after the last - on the line is a date. If that's not the case and there's a better definition of what the optional hostname prefix and suffix strings in fileB can be then just tweak the gsub() to use an appropriate regexp.

How to find specific characters between two files?

I have two files (file1, file2) and I want to make a third one showing their differences using cmd like this:
file1 :qwertyuiop
file2 :qwartyuioa
file3:chr(3)=e a
chr(10)=p a
Any good ideas?

Redirecting two files to standard input

There are several unix commands that are designed to operate on two files. Commonly such commands allow the contents for one of the "files" to be read from standard input by using a single dash in place of the file name.
I just came across a technique that seems to allow both files to be read from standard input:
comm -12 <(sort file1) <(sort file2)
My initial disbelieving reaction was, "That shouldn't work. Standard input will just have the concatenation of both files. The command won't be able to tell the files apart or even realize that it has been given the contents of two files."
Of course, this construction does work. I've tested it with both comm and diff using bash 3.2.51 on cygwin 1.7.7. I'm curious how and why it works:
Why does this work?
Is this a Bash extension, or is this straight Bourne shell functionality?
This works on my system, but will this technique work on other platforms? (In other words, will scripts written using this technique be portable?)
Bash, Korn shell (ksh93, anyway) and Z shell all support process substitution. These appear as files to the utility. Try this:
$ bash -c 'echo <(echo)'
/dev/fd/63
$ ksh -c 'echo <(echo)'
/dev/fd/4
$ zsh -c 'echo <(echo)'
/proc/self/fd/12
You'll see file descriptors similar to the ones shown.
This is a standard Bash extension. <(sort file1) opens a pipe with the output of the sort file1 command, gives the pipe a temporary file name, and passes that temporary file name on the comm command line.
You can see how it works by getting echo to tell you what's being passed to the program:
echo <(sort file1) <(sort file2)

Resources