Set union of elements in different files

Set union of elements in different files - shell

I have multiple files like that:
file1:
item1
item2
item3
file2:
item1
item5
item3
file3:
item2
item1
item4
I want to have a file with all the unique elements. I could do that with Python, only problem being that each file contains various million lines and I wanted to know if there is better method (maybe using only shell scripts?).

How about:
cat * | uniq
or there may be efficiency gains if each file contains repeats in itself:
for file in *; do cat $file | uniq; done | uniq
If they aren't sorted files, uniq doesn't work, so this may not be more efficient, as you will need:
for file in *; do sort $file | uniq; done | sort | uniq

If you want the elements in common between all three files, another approach is to use a few grep operations:
$ grep -F -f file1 file2 > file1inFile2
$ grep -F -f file1 file3 > file1inFile3
$ grep -F -f file1inFile2 file1inFile3 > elementsInCommon
The -f option specifies searching against a file of patterns (file1 and file1inFile2 in this case). The -F option does a fixed string search.
If you use bash, you can do a fancy one-liner:
$ grep -F -f <(grep -F -f file1 file2) <(grep -F -f file1 file3) > elementsInCommon
Grep searches in sublinear time, I think. So this may get around the usual O(n log n) time cost of presorting very large files with the sort|uniq approach.
You might be able to speed up a fixed-string grep operation even further, specifying the LC_ALL=C environment variable. However, when I explored this, it seems to be a shell default. Still, given the time improvement that is reported, this setting seems worth investigating if you use grep.
Grep may use a fair amount of memory loading patterns, though, which could be an issue given the size of your input files. You might use your smallest of the three files as the pattern source.
If your inputs are already sorted, however, you can walk through each file one line at a time, testing string equality between the three lines. You then either move some input file pointers ahead by a line, or print the equal string that is common to the three inputs. This approach uses O(n) time (you walk though each file once) and O(1) memory (you buffer three lines). More time, but much less memory. Not sure if this can be done with bash built-ins or core utilities, but this is definitely doable with Python, Perl, C, etc.

Related

How to rewrite a bad shell script to understand how to perform similar tasks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
So, I wrote a bad shell script (according to several questions, one of which I asked) and now I am wondering which way to go to perform the same, or similar, task(s).
I honestly have no clue about which tool may be best for what I need to achieve and I hope that, by understanding how to rewrite this piece of code, it will be easier to understand which way to go.
There we go:
# read reference file line by line
while read -r linE;
do
# field 2 will be grepped
pSeq=`echo $linE | cut -f2 -d" "`
# field 1 will be used as filename to store the grepped things
fName=`echo $linE | cut -f1 -d" "`
# grep the thing in a very big file
grep -i -B1 -A2 "^"$pSeq a_very_big_file.txt | sed 's/^--$//g' | awk 'NF' > $dir$fName".txt"
# grep the same thing in another very big file and store it in the same file as abovr
grep -i -B1 -A2 "^"$pSeq another_very_big_file.txt | sed 's/^--$//g' | awk 'NF' >> $dir$fName".txt"
done < reference_file.csv
At this point I am wondering...how to achieve the same result, whithout using a while loop to read into the reference_file.csv? What is the best way to go, to solve similar problems?
EDIT: when I mentioned the two very_big_files, I am talking > 5GB.
EDIT II: these should be the format of the files:
reference_file.csv:
object pattern
oj1 ptt1
oj2 ptt2
... ...
ojN pttN
a_very_big_file and another_very_big_file:
>head1
ptt1asequenceofcharacters
+
asequenceofcharacters
>head2
ptt1anothersequenceofcharacters
+
anothersequenceofcharacters
>headN
pttNathirdsequenceofcharacters
+
athirdsequenceofcharacters
Basically, I search for pattern in the two files, then I need to get the line above and the two below each match. Of course, not all the lines in the two files match with the patterns in the reference_file.csv.

Global Maxima
Efficient bash scripts are typically very creative and nothing you can achieve by incrementally improving a naive solution.
The most important part of finding efficient solutions is to know your data. Every restriction you can make allows optimizations. Some examples that can make a huge difference:
- The input is sorted or data in different files has the same order.
- The elements in a list are unique.
- One of the files to be processed is way bigger than the others.
- The symbol X never appears in the input or only appears at special places.
- The order of the output does not matter.
When I try to find an efficient solution, my first goal is to make it work without an explicit loop. For this, I need to know the available tools. Then comes the creative part of combining these tools. To me, this is like assembling a jigsaw puzzle without knowing the final picture. A typical mistake here is similar to the XY problem: After you assembled some pieces, you might be fooled into thinking you'd know the final picture and search for a piece Y that does not exist in your toolbox. Frustrated, you implement Y yourself (typically by using a loop) and ruin the solution.
If there is no right piece for your current approach, either use a different approach or give up on bash and use a better scripting/programming language.
Local Maxima
Even though you might not be able to get the best solution by improving a bad solution, you still can improve it. For this you don't need to be very creative if you know some basic anti-patterns and their better alternatives. Here are some typical examples from your script:
Some of these might seem very small, but starting a new process is way more expensive than one might suppose. Inside a loop, the cost of starting a process is multiplied by the number of iterations.
Extract multiple fields from a line
Instead of calling cut for each individual field, use read to read them all at once:
while read -r line; do
field1=$(echo "$line" | cut -f1 -d" ")
field2=$(echo "$line" | cut -f2 -d" ")
...
done < file
while read -r field1 field2 otherFields; do
...
done < file
Combinations of grep, sed, awk
Everything grep (in its basic form) can do, sed can do better. And everything sed can do, awk can do better. If you have a pipe of these tools you can combine them into a single call.
Some examples of (in your case) equivalent commands, one per line:
sed 's/^--$//g' | awk 'NF'
sed '/^--$/d'
grep -vFxe--
grep -i -B1 -A2 "^$pSeq" | sed 's/^--$//g' | awk 'NF'
awk "/^$pSeq/"' {print last; c=3} c>0; {last=$0; c--}'
Multiple grep on the same file
You want to read files at most once, especially if they are big. With grep -f you can search multiple patterns in a single run over one file. If you just wanted to get all matches, you would replace your entire loop with
grep -i -B1 -A2 -f <(cut -f2 -d' ' reference_file | sed 's/^/^/') \
a_very_big_file another_very_big_file
But since you have to store different matches in different files ... (see next point)
Know when to give up and switch to another language
Dynamic output files
Your loop generates multiple files. The typical command line utils like cut, grep and so on only generate one output. I know only one standard tool that generates a variable number of output files: split. But that does not filter based on values, but on position. Therefore, a non-loop solution for your problem seems unlikely. However, you can optimize the loop by rewriting it in a different language, e.g. awk.
Loops in awk are faster ...
time awk 'BEGIN{for(i=0;i<1000000;++i) print i}' >/dev/null # takes 0.2s
time for ((i=0;i<1000000;++i)); do echo $i; done >/dev/null # takes 3.3s
seq 1000000 > 1M
time awk '{print}' 1M >/dev/null # takes 0.1s
time while read -r l; do echo "$l"; done <1M >/dev/null # takes 5.4s
... but the main speedup will come from something different. awk has everything you need built into it, so you don't have to start new processes. Also ... (see next point)
Iterate the biggest file
Reduce the number of times you have to read the biggest files. So instead of iterating reference_file and reading both big files over and over, iterate over the big files once while holding reference_file in memory.
Final script
To replace your script, you can try the following awk script. This assumes that ...
the filenames (first column) in reference_file are unique
the two big files do not contain > except for the header
the patterns (second column) in reference_file are not prefixes of each other.
If this is not the case, simply remove the break.
awk -v dir="$dir" '
FNR==NR {max++; file[max]=$1; pat[max]=$2; next}
{
for (i=1;i<=max;i++)
if ($2~"^"pat[i]) {
printf ">%s", $0 > dir"/"file[i]
break
}
}' reference_file RS=\> FS=\\n a_very_big_file another_very_big_file

Is there an easy and fast solution to compare two csv files in bash?

My Problem:
I have 2 large csv files, with millions of lines.
The one file contains a backup of a database from my server, and looks like:
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...
Now I have another CSV file, containing new codes like, with the exact same schema.
I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.
I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv and sort -u newCodes.csv > newCodesSorted.csv
First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv but the process got killed because it took too much resources, so I thought there had to be a better way
I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv.
I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later:
diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes
But I believe there has to be a better way.
So I ask you, if you have any ideas, how to improve this method.
EDIT:
comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.
But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like
securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...
I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.
I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^

Yes! a highly underrated tool comm is great for this.
Stolen examples from here.
Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b
Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b
Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'
As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)
If you do prefer to stick with diff, you can get it to do what you want without the grep:
diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt
You could then alias that diff command to "comp" or something similar to allow you to just:
comp 1.txt 2.txt
That might be handy if this is a command you are likely to use often in future.

I would think that sorting the file uses a lot of resources.
When you only want the new lines, you can try grep with the option -v
grep -vFxf serverBackup.csv newCodes.csv
or first split serverBackup.csv
split -a 4 --lines 10000 serverBackup.csv splitted
cp newCodes.csv newCodes.csv.org
for f in splitted*; do
grep -vFxf "${f}" newCodes.csv > smaller
mv smaller newCodes.csv
done
rm splitted*

Given:
$ cat f1
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
$ cat f2
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
You could use awk:
$ awk 'FNR==NR{seen[$0]; next} !($0 in seen)' f1 f2
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true

combine multiple text files and remove duplicates

I have around 350 text files (and each file is around 75MB). I'm trying to combine all the files and remove duplicate entries. The file is in the following format:
ip1,dns1
ip2,dns2
...
I wrote a small shell script to do this
#!/bin/bash
for file in data/*
do
cat "$file" >> dnsFull
done
sort dnsFull > dnsSorted
uniq dnsSorted dnsOut
rm dnsFull dnsSorted
I'm doing this processing often and was wondering if there is anything I could do to improve the processing next time when I run it. I'm open to any programming language and suggestions. Thanks!

First off, you're not using the full power of cat. The loop can be replaced by just
cat data/* > dnsFull
assuming that file is initially empty.
Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline:
cat data/* | sort | uniq > dnsOut
This is still wasteful since sort alone can do what you're using cat and uniq for; the whole script can be replaced by
sort -u data/* > dnsOut
If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk:
awk '{if (!a[$0]++) print}' data/* > dnsOut

Find unmatched items between two list using bash or DOS

I have two files with two single-column lists:
//file1 - full list of unique values
AAA
BBB
CCC
//file2
AAA
AAA
BBB
BBB
//So the result here would be:
CCC
I need to generate a list of values from file1 that have no matches in file2. I have to use bash script (preferably without special tools like awk) or DOS batch file.
Thank you.

Method 1
Looks like a job for grep's -v flag.
grep -v -F -f listtocheck uniques
Method 2
A variation to Drake Clarris's solution (that can be extended to checking using several files, which grep can't do unless they are first merged), would be:
(
sort < file_to_check | uniq
cat reference_file reference_file
) | sort | uniq -u
By doing this, any words in file_to_check will appear, in the output combined by the subshell in brackets, only once. Words in reference_file will be output at least twice, and words appearing in both files will be output at least three times - one from the first file, twice from the two copies of the second file.
There only remains to find a way to isolate the words we want, those that appear once, which is what sort | uniq -u does.
Optimization I
If reference_file contains a lot of duplicates, it might be worthwhile to run a heavier
sort < reference_file | uniq
sort < reference_file | uniq
instead of cat reference_file reference_file, in order to have a smaller output and weigh less on the final sort.
Optimization II
This would be even faster if we used temporary files, since merging already-sorted files can be done efficiently (and in case of repeated checks with different files, we could reuse again and again the same sorted reference file without need of re-sorting it); therefore
sort < file_to_check | uniq > .tmp.1
sort < reference_file | uniq > .tmp.2
# "--merge" works way faster, provided we're sure the input files are sorted
sort --merge .tmp.1 .tmp.2 .tmp.2 | uniq -u
rm -f .tmp.1 .tmp.2
Optimization III
Finally in case of very long runs of identical lines in one file, which may be the case with some logging systems for example, it may be also worthwhile to run uniq twice, one to get rid of the runs (ahem) and another to uniqueize it, since uniq works in linear time while sort is linearithmic.
uniq < file | sort | uniq > .tmp.1

For a Windows CMD solution (commonly referred to as DOS, but not really):
It should be as simple as
findstr /vlxg:"file2" "file1"
but there is a findstr bug that results in possible missing matches when there are multiple literal search strings.
If a case insensitive search is acceptable, then adding the /I option circumvents the bug.
findstr /vlixg:"file2" "file1"
If you are not restricted to native Windows commands then you can download a utility like grep for Windows. The Gnu utilities for Windows are a good source. Then you could use Isemi's solution on both Windows and 'nix.
It is also easy to write a VBScript or JScript solution for Windows.

cat file1 file2 | sort | uniq -u

Reading millions of files (in a certain order) and putting them into one big file --- fast

In my bash script I have the following (for concreteness I preserve the original names;
sometimes people ask about the background etc., and then the original names make more sense):
tail -n +2 Data | while read count phi npa; do
cat Instances/$phi >> $nF
done
That is, the first line of file Data is skipped, and then all lines, which are of
the form "r c p n", are read, and the content of files Instances/p is appended
to file $nF (in the order given by Data).
In typical examples, Data has millions of lines. So perhaps I should write a
C++ application for that. However I wondered whether somebody knew a faster
solution just using bash?

Here I use cut instead of your while loop, but you could re-introduce that if it provides some utility to you. The loop would have to output the phy variable once per iteration.
tail -n +2 Data | cut -d' ' -f 2 | xargs -I{} cat Instances/{} >> $nF
This reduces the number of cat invocations to as few as possible, which should improve efficiency. I also believe that using cut here will improve things further.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio