Sort data file based on a separate key file - sorting

We have 2 files: data.txt and keys.txt.
data.txt is some proper unicode text with N lines.
keys.txt is a list of newline-separated integers, N lines.
Output a file sorted.txt where the lines in data.txt are sorted according to keys.txt without writing an intermediate file paste -d',' keys.txt data.txt.
I need to use this for large files (hundreds of GB) on machines with 16-32 GB of memory.
My first attempt was to do it in Python, which is a bit slow. It's simple enough, so we discussed doing it in C++. But I'd prefer if it uses readily available tools so there's no installation needed. This could well be impossible to do efficiently with GNU or Unix tools, but I don't know enough there to make a claim.

You should be able to do this without buffering to a file. For performance, I guess calibrating sort --buffer-size would be the first move, and perhaps using parallel to sort in chunks the second.
paste keys.txt data.txt | sort -n -k1 | cut -f2-

Related

Set union of elements in different files

I have multiple files like that:
file1:
item1
item2
item3
file2:
item1
item5
item3
file3:
item2
item1
item4
I want to have a file with all the unique elements. I could do that with Python, only problem being that each file contains various million lines and I wanted to know if there is better method (maybe using only shell scripts?).
How about:
cat * | uniq
or there may be efficiency gains if each file contains repeats in itself:
for file in *; do cat $file | uniq; done | uniq
If they aren't sorted files, uniq doesn't work, so this may not be more efficient, as you will need:
for file in *; do sort $file | uniq; done | sort | uniq
If you want the elements in common between all three files, another approach is to use a few grep operations:
$ grep -F -f file1 file2 > file1inFile2
$ grep -F -f file1 file3 > file1inFile3
$ grep -F -f file1inFile2 file1inFile3 > elementsInCommon
The -f option specifies searching against a file of patterns (file1 and file1inFile2 in this case). The -F option does a fixed string search.
If you use bash, you can do a fancy one-liner:
$ grep -F -f <(grep -F -f file1 file2) <(grep -F -f file1 file3) > elementsInCommon
Grep searches in sublinear time, I think. So this may get around the usual O(n log n) time cost of presorting very large files with the sort|uniq approach.
You might be able to speed up a fixed-string grep operation even further, specifying the LC_ALL=C environment variable. However, when I explored this, it seems to be a shell default. Still, given the time improvement that is reported, this setting seems worth investigating if you use grep.
Grep may use a fair amount of memory loading patterns, though, which could be an issue given the size of your input files. You might use your smallest of the three files as the pattern source.
If your inputs are already sorted, however, you can walk through each file one line at a time, testing string equality between the three lines. You then either move some input file pointers ahead by a line, or print the equal string that is common to the three inputs. This approach uses O(n) time (you walk though each file once) and O(1) memory (you buffer three lines). More time, but much less memory. Not sure if this can be done with bash built-ins or core utilities, but this is definitely doable with Python, Perl, C, etc.

combine multiple text files and remove duplicates

I have around 350 text files (and each file is around 75MB). I'm trying to combine all the files and remove duplicate entries. The file is in the following format:
ip1,dns1
ip2,dns2
...
I wrote a small shell script to do this
#!/bin/bash
for file in data/*
do
cat "$file" >> dnsFull
done
sort dnsFull > dnsSorted
uniq dnsSorted dnsOut
rm dnsFull dnsSorted
I'm doing this processing often and was wondering if there is anything I could do to improve the processing next time when I run it. I'm open to any programming language and suggestions. Thanks!
First off, you're not using the full power of cat. The loop can be replaced by just
cat data/* > dnsFull
assuming that file is initially empty.
Then there's all those temporary files that force programs to wait for hard disks (commonly the slowest parts in modern computer systems). Use a pipeline:
cat data/* | sort | uniq > dnsOut
This is still wasteful since sort alone can do what you're using cat and uniq for; the whole script can be replaced by
sort -u data/* > dnsOut
If this is still not fast enough, then realize that sorting takes O(n lg n) time while deduplication can be done in linear time with Awk:
awk '{if (!a[$0]++) print}' data/* > dnsOut

Find same words in two text files

I have two text files and each contains more than 50 000 lines. I need to find same words that are in both text files. I tried COMM command but I got answer that "file 2 is not in sorted order". I tried to sort file by command SORT but it doesn´t work. I´m working in Windows. It doesn´t have to be solved in command line. It can be solved in some program or something else. Thank you for every idea.
If you want to sort the files you will have to use some sort of external sort (like merge sort) so you have enough memory. As for another way you could go through the first file and find all the words and store them in a hashtable, then go through the second file and check for repeated words. If the words are actual words and not gibberish the second method will work and be easier. Since the files are so large you may not want to use a scripting language but it might work.
If the words are not on their own line, then comm can not help you.
If you have a set of unix utilities handy, like Cygwin, (you mentioned comm, so you may have have others as well) you can do:
$ tr -cs "[:alpha:]" "\n" < firstFile | sort > firstFileWords
$ tr -cs "[:alpha:]" "\n" < secondFile | sort > secondFileWords
$ comm -12 firstFileWords secondFileWords > commonWords
The first two lines convert the words in each file in to a single word on each line, it also sorts the file.
If you're only interested in individual words, you can change sort to sort -u to make get the unique set.

awk: how to remove duplicated lines in a file and output them in another file at the same time?

I am currently working on a script which processes csv files, and one of the things it does is remove and keep note of duplicate lines in the files. My current method to do this is to run uniq once using uniq -d once to display all duplicates, then run uniq again without any options to actually remove the duplicates.
Having said that, I was wondering if it would be possible to perform this same function in one action instead of having to run uniq twice. I've found a bunch of different examples of using awk to remove duplicates out there, but as far as I know I have not been able to find any that both displayed the duplicates and removed them at the same time.
If anyone could offer advice or help for this I would really appreciate it though, thanks!
Here's something to get you started:
awk 'seen[$0]++{print|"cat>&2";next}1' file > tmp && mv tmp file
The above will print any duplicated lines to stderr at the same time as removing them from your input file. If you need more, tell us more....
In general, the size of you input shall be your guide. If you're processing GBs of data, you often have no choice other than relying on sort and uniq, because these tools support external operations.
That said, here's the AWK way:
If you input is sorted, you can keep track of duplicate items in AWK easily by comparing line i to line i-1 with O(1) state: if i == i-1 you have a duplicate.
If your input is not sorted, you have to keep track of all lines, requiring O(c) state, where c is the number of unique lines. You can use a hash table in AWK for this purpose.
This solution does not use awk but it does produce the result you need. In the command below replace sortedfile.txt with your csv file.
cat sortedfile.txt | tee >(uniq -d > duplicates_only.txt) | uniq > unique.txt
tee sends the output of the cat command to uniq -d.

Reading millions of files (in a certain order) and putting them into one big file --- fast

In my bash script I have the following (for concreteness I preserve the original names;
sometimes people ask about the background etc., and then the original names make more sense):
tail -n +2 Data | while read count phi npa; do
cat Instances/$phi >> $nF
done
That is, the first line of file Data is skipped, and then all lines, which are of
the form "r c p n", are read, and the content of files Instances/p is appended
to file $nF (in the order given by Data).
In typical examples, Data has millions of lines. So perhaps I should write a
C++ application for that. However I wondered whether somebody knew a faster
solution just using bash?
Here I use cut instead of your while loop, but you could re-introduce that if it provides some utility to you. The loop would have to output the phy variable once per iteration.
tail -n +2 Data | cut -d' ' -f 2 | xargs -I{} cat Instances/{} >> $nF
This reduces the number of cat invocations to as few as possible, which should improve efficiency. I also believe that using cut here will improve things further.

Resources