My Problem:
I have 2 large csv files, with millions of lines.
The one file contains a backup of a database from my server, and looks like:
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...
Now I have another CSV file, containing new codes like, with the exact same schema.
I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.
I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv and sort -u newCodes.csv > newCodesSorted.csv
First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv but the process got killed because it took too much resources, so I thought there had to be a better way
I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv.
I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later:
diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes
But I believe there has to be a better way.
So I ask you, if you have any ideas, how to improve this method.
EDIT:
comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.
But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like
securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...
I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.
I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^
Yes! a highly underrated tool comm is great for this.
Stolen examples from here.
Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b
Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b
Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'
As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)
If you do prefer to stick with diff, you can get it to do what you want without the grep:
diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt
You could then alias that diff command to "comp" or something similar to allow you to just:
comp 1.txt 2.txt
That might be handy if this is a command you are likely to use often in future.
I would think that sorting the file uses a lot of resources.
When you only want the new lines, you can try grep with the option -v
grep -vFxf serverBackup.csv newCodes.csv
or first split serverBackup.csv
split -a 4 --lines 10000 serverBackup.csv splitted
cp newCodes.csv newCodes.csv.org
for f in splitted*; do
grep -vFxf "${f}" newCodes.csv > smaller
mv smaller newCodes.csv
done
rm splitted*
Given:
$ cat f1
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
$ cat f2
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
You could use awk:
$ awk 'FNR==NR{seen[$0]; next} !($0 in seen)' f1 f2
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
i currently have a list of terms - words.txt,with each term on one line, and I want to count how many total occurrences for all those terms exists in the first 500 lines of multiple csv files in the same directory.
I currently have something like this:
grep -Ff words.txt /some/directory |wc -l
How exactly can I get the program to display for each file the count number for just those first 500 lines of each file? Do i have to create new files with the 500 lines? How can i do that for a large number of original files? I'm very new to coding and working on a dataset for research, so any help is much appreciated!
Edit: I want it to display something like this but for each file:
grep -Ff words.txt list1.csv |wc -l
/Users/USER/Desktop/FILE/list1.csv:28
This works for me.
head /some/directory/* -n 100 | grep -Ff words.txt | wc -l
Sample Result: 38
I have a file that has a couple thousand domain names in a list. I easily generated a list of just the unique names using the uniq command. Now, I want to go through and find how many times each of the items in the uniques list appears in the original, non-unique list. I thought this should be pretty easy to do with this loop, but I'm running into trouble:
for name in 'cat uniques.list'; do grep -c $name original.list; done > output.file
For some reason, it's spitting out a result that shows some count of something (honestly not sure what) for the uniques file and the original file.
I feel like I'm overlooking something really simple here. Any help is appreciated.
Thanks!
Simply use uniq -c on your file :
-c, --count
prefix lines by the number of occurrences
The command to get the final output :
sort original.list | uniq -c
This question already has answers here:
Remove duplicate lines without sorting [duplicate]
(8 answers)
Closed 9 years ago.
If I have a csv files like this
lion#mammal#scary animal
human#mammal#human
hummingbird#bird#can fly
dog#mammal#man's best friend
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
and I have another csv file like this
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
rockets#pewpew#fire
banana#fruit#yellow
I want the output to be like this:
lion#mammal#scary animal
human#mammal#human
hummingbird#bird#can fly
dog#mammal#man's best friend
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
rockets#pewpew#fire
banana#fruit#yellow
some of the things in the first csv file are present in the second csv file; they overlap pretty much. How can I combine these csv files with the correct order? It is guaranteed that the new entries will always be the first few lines in the beginning of the first csv file.
Soultion 1:
awk '!a[$0]++' file1.cvs file2.cvs
Solution 2 (if don't care of the original order)
sort -u file1 file2
Here's one way:
Use cat -n to concatenate input files and prepend line numbers
Use sort -u remove duplicate data
Use sort -n to sort again by prepended number
Use cut to remove the line numbering
$ cat -n file1 file2 | sort -uk2 | sort -nk1 | cut -f2-
lion#mammal#scary animal
human#mammal#human
hummingbird#bird#can fly
dog#mammal#man's best friend
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
rockets#pewpew#fire
banana#fruit#yellow
$
I have two text files and each contains more than 50 000 lines. I need to find same words that are in both text files. I tried COMM command but I got answer that "file 2 is not in sorted order". I tried to sort file by command SORT but it doesn´t work. I´m working in Windows. It doesn´t have to be solved in command line. It can be solved in some program or something else. Thank you for every idea.
If you want to sort the files you will have to use some sort of external sort (like merge sort) so you have enough memory. As for another way you could go through the first file and find all the words and store them in a hashtable, then go through the second file and check for repeated words. If the words are actual words and not gibberish the second method will work and be easier. Since the files are so large you may not want to use a scripting language but it might work.
If the words are not on their own line, then comm can not help you.
If you have a set of unix utilities handy, like Cygwin, (you mentioned comm, so you may have have others as well) you can do:
$ tr -cs "[:alpha:]" "\n" < firstFile | sort > firstFileWords
$ tr -cs "[:alpha:]" "\n" < secondFile | sort > secondFileWords
$ comm -12 firstFileWords secondFileWords > commonWords
The first two lines convert the words in each file in to a single word on each line, it also sorts the file.
If you're only interested in individual words, you can change sort to sort -u to make get the unique set.