Find same words in two text files - sorting

I have two text files and each contains more than 50 000 lines. I need to find same words that are in both text files. I tried COMM command but I got answer that "file 2 is not in sorted order". I tried to sort file by command SORT but it doesn´t work. I´m working in Windows. It doesn´t have to be solved in command line. It can be solved in some program or something else. Thank you for every idea.

If you want to sort the files you will have to use some sort of external sort (like merge sort) so you have enough memory. As for another way you could go through the first file and find all the words and store them in a hashtable, then go through the second file and check for repeated words. If the words are actual words and not gibberish the second method will work and be easier. Since the files are so large you may not want to use a scripting language but it might work.

If the words are not on their own line, then comm can not help you.
If you have a set of unix utilities handy, like Cygwin, (you mentioned comm, so you may have have others as well) you can do:
$ tr -cs "[:alpha:]" "\n" < firstFile | sort > firstFileWords
$ tr -cs "[:alpha:]" "\n" < secondFile | sort > secondFileWords
$ comm -12 firstFileWords secondFileWords > commonWords
The first two lines convert the words in each file in to a single word on each line, it also sorts the file.
If you're only interested in individual words, you can change sort to sort -u to make get the unique set.

Related

Is there an easy and fast solution to compare two csv files in bash?

My Problem:
I have 2 large csv files, with millions of lines.
The one file contains a backup of a database from my server, and looks like:
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...
Now I have another CSV file, containing new codes like, with the exact same schema.
I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.
I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv and sort -u newCodes.csv > newCodesSorted.csv
First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv but the process got killed because it took too much resources, so I thought there had to be a better way
I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv.
I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later:
diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes
But I believe there has to be a better way.
So I ask you, if you have any ideas, how to improve this method.
EDIT:
comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.
But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like
securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...
I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.
I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^
Yes! a highly underrated tool comm is great for this.
Stolen examples from here.
Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b
Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b
Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'
As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)
If you do prefer to stick with diff, you can get it to do what you want without the grep:
diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt
You could then alias that diff command to "comp" or something similar to allow you to just:
comp 1.txt 2.txt
That might be handy if this is a command you are likely to use often in future.
I would think that sorting the file uses a lot of resources.
When you only want the new lines, you can try grep with the option -v
grep -vFxf serverBackup.csv newCodes.csv
or first split serverBackup.csv
split -a 4 --lines 10000 serverBackup.csv splitted
cp newCodes.csv newCodes.csv.org
for f in splitted*; do
grep -vFxf "${f}" newCodes.csv > smaller
mv smaller newCodes.csv
done
rm splitted*
Given:
$ cat f1
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
$ cat f2
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
You could use awk:
$ awk 'FNR==NR{seen[$0]; next} !($0 in seen)' f1 f2
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true

sorting with terminal after grepping

I was hoping someone might be able to shed some light on how I could sort a set of grepped values in unix.
for example if I have a list such as;
qp_1_v2
qp_50_v1
qp_51_v4
qp_52_v1
qp_53_v1
qp_54_v2
qp_2_v1,
is there a way to sort numerically using the wildcard i.e sort qp_*_v1; where * would be read as a number and then sorted according to this (ignoring anything that came before and after the ). The problem I'm finding currently is that gp_52_v2 is always read as a string so I have to cut gp_ and _v to leave only the number and then sort.
I hope this makes sense...
Thanks in advance.
edit: A little addition that would be nice if anyone knows how to do it.. would be to grep and list values with the highest version i.e if gp_50 exists 3 times with the following suffixs _v1, _v2, _v3 it only lists gp_50_v3. As such this list will still consist of files with various versions but only the highest version of each file will be outputted to terminal.
ls | cut -d '_' -f 2 | sort
in your case substitute ls for your grep command
Edit: In the example I put before the output is cut, if you want the original name of the file use this:
ls | sort -k2,2g -t '_'
k is the number of the field to compare
g is the max number of characters to compare
t is the delimiter

Finding lines containing words that occur more than once using grep

How do I find all lines that contain duplicate lower case words.
I want to be able to do this using egrep, this is what I've tried thus far but I keep getting invalid back references:
egrep '\<(.)\>\1' inputFile.txt
egrep -w '\b(\w)\b\1' inputFile.txt
For example, if I have the following file:
The sky was grey.
The fall term went on and on.
I hope every one has a very very happy holiday.
My heart is blue.
I like you too too too much
I love daisies.
It should find the following lines in the file:
The fall term went on and on.
I hope every one has a very very happy holiday.
I like you too too too much
It finds these lines because the words on, very and too occur more than once in each line.
This could be possible through -E or -P parameter.
grep -E '(\b[a-z]+\b).*\b\1\b' file
Example:
$ cat file
The fall term went on and on.
I hope every one has a very very happy holiday.
Hi foo bar.
$ grep -E '(\b[a-z]+\b).*\b\1\b' file
The fall term went on and on.
I hope every one has a very very happy holiday.
Got it, you need find out duplicate words (all lowcase)
sed -n '/\s\([a-z]*\)\s.*\1/p' infile
Tools are used to serve your request. To restrict on one tool is not good way.
\1 is the feature in sed, but not sure if grep/egrep has this feature as well.
I know this is about grep, but here is an awk
It would be more flexible, since you can easy change to counter c
c==2 two equal words
c>2 two or more equals words
etc
awk -F"[ \t.,]" '{c=0;for (i=1;i<=NF;i++) a[$i]++; for (i in a) c=c<a[i]?a[i]:c;delete a} c==2' file
The fall term went on and on.
I hope every one has a very very happy holiday.
It runs a loop trough all words in a line and create an array index for every words.
Then a new loop to see if there is word that is repeated.
try
egrep '[a-z]*' my_file
this will find all lower case chars in each line
egrep '[a-z]*' --color my_file
this will color the lower chars

awk: how to remove duplicated lines in a file and output them in another file at the same time?

I am currently working on a script which processes csv files, and one of the things it does is remove and keep note of duplicate lines in the files. My current method to do this is to run uniq once using uniq -d once to display all duplicates, then run uniq again without any options to actually remove the duplicates.
Having said that, I was wondering if it would be possible to perform this same function in one action instead of having to run uniq twice. I've found a bunch of different examples of using awk to remove duplicates out there, but as far as I know I have not been able to find any that both displayed the duplicates and removed them at the same time.
If anyone could offer advice or help for this I would really appreciate it though, thanks!
Here's something to get you started:
awk 'seen[$0]++{print|"cat>&2";next}1' file > tmp && mv tmp file
The above will print any duplicated lines to stderr at the same time as removing them from your input file. If you need more, tell us more....
In general, the size of you input shall be your guide. If you're processing GBs of data, you often have no choice other than relying on sort and uniq, because these tools support external operations.
That said, here's the AWK way:
If you input is sorted, you can keep track of duplicate items in AWK easily by comparing line i to line i-1 with O(1) state: if i == i-1 you have a duplicate.
If your input is not sorted, you have to keep track of all lines, requiring O(c) state, where c is the number of unique lines. You can use a hash table in AWK for this purpose.
This solution does not use awk but it does produce the result you need. In the command below replace sortedfile.txt with your csv file.
cat sortedfile.txt | tee >(uniq -d > duplicates_only.txt) | uniq > unique.txt
tee sends the output of the cat command to uniq -d.

Find unmatched items between two list using bash or DOS

I have two files with two single-column lists:
//file1 - full list of unique values
AAA
BBB
CCC
//file2
AAA
AAA
BBB
BBB
//So the result here would be:
CCC
I need to generate a list of values from file1 that have no matches in file2. I have to use bash script (preferably without special tools like awk) or DOS batch file.
Thank you.
Method 1
Looks like a job for grep's -v flag.
grep -v -F -f listtocheck uniques
Method 2
A variation to Drake Clarris's solution (that can be extended to checking using several files, which grep can't do unless they are first merged), would be:
(
sort < file_to_check | uniq
cat reference_file reference_file
) | sort | uniq -u
By doing this, any words in file_to_check will appear, in the output combined by the subshell in brackets, only once. Words in reference_file will be output at least twice, and words appearing in both files will be output at least three times - one from the first file, twice from the two copies of the second file.
There only remains to find a way to isolate the words we want, those that appear once, which is what sort | uniq -u does.
Optimization I
If reference_file contains a lot of duplicates, it might be worthwhile to run a heavier
sort < reference_file | uniq
sort < reference_file | uniq
instead of cat reference_file reference_file, in order to have a smaller output and weigh less on the final sort.
Optimization II
This would be even faster if we used temporary files, since merging already-sorted files can be done efficiently (and in case of repeated checks with different files, we could reuse again and again the same sorted reference file without need of re-sorting it); therefore
sort < file_to_check | uniq > .tmp.1
sort < reference_file | uniq > .tmp.2
# "--merge" works way faster, provided we're sure the input files are sorted
sort --merge .tmp.1 .tmp.2 .tmp.2 | uniq -u
rm -f .tmp.1 .tmp.2
Optimization III
Finally in case of very long runs of identical lines in one file, which may be the case with some logging systems for example, it may be also worthwhile to run uniq twice, one to get rid of the runs (ahem) and another to uniqueize it, since uniq works in linear time while sort is linearithmic.
uniq < file | sort | uniq > .tmp.1
For a Windows CMD solution (commonly referred to as DOS, but not really):
It should be as simple as
findstr /vlxg:"file2" "file1"
but there is a findstr bug that results in possible missing matches when there are multiple literal search strings.
If a case insensitive search is acceptable, then adding the /I option circumvents the bug.
findstr /vlixg:"file2" "file1"
If you are not restricted to native Windows commands then you can download a utility like grep for Windows. The Gnu utilities for Windows are a good source. Then you could use Isemi's solution on both Windows and 'nix.
It is also easy to write a VBScript or JScript solution for Windows.
cat file1 file2 | sort | uniq -u

Resources