Grepping progressively through large file - shell

I have several large data files (~100MB-1GB of text) and a sorted list of tens of thousands of timestamps that index data points of interest. The timestamp file looks like:
12345
15467
67256
182387
199364
...
And the data file looks like:
Line of text
12345 0.234 0.123 2.321
More text
Some unimportant data
14509 0.987 0.543 3.600
More text
15467 0.678 0.345 4.431
The data in the second file is all in order of timestamp. I want to grep through the second file using the time stamps of the first, printing the timestamp and fourth data item in an output file. I've been using this:
grep -wf time.stamps data.file | awk '{print $1 "\t" $4 }' >> output.file
This is taking on the order of a day to complete for each data file. The problem is that this command searches though the entire data file for every line in time.stamps, but I only need the search to pick up from the last data point. Is there any way to speed up this process?

You can do this entirely in awk …
awk 'NR==FNR{a[$1]++;next}($1 in a){print $1,$4}' timestampfile datafile

JS웃's awk solution is probably the way to go. If join is available and the first field of the irrelevant "data" is not numeric, you could exploit the fact that the files are in the same order and avoid a sorting step. This example uses bash process substitution on linux
join -o2.1,2.4 -1 1 -2 1 key.txt <(awk '$1 ~ /^[[:digit:]]+$/' data.txt)

'grep' has a little used option -f filename which gets the patterns from filename and does the matching. It is likely to beat the awk solution and your timestamps would not have to be sorted.

Related

How to make a table using bash shell?

I have multiple text files that have their own column. I hope to combine them into one text file like a table not a long column.
I tried 'paste' and 'column', but it did not make the shape that I wanted.
When I used the paste with two text files, it made a nice table.
paste height_1.txt height_2.txt > test.txt
The trouble starts from three or more text files.
paste height_1.txt height_2.txt height_3.txt > test.txt
At a glance, it seems nice. But when I plot the each column in the text.txt file in gnuplot(p "text.txt"), I could find some unexpected graph different from the original file especially in its last part.
The shape of the table is ruined in a strange way in the test.txt, causing the graph weird.
How could I make a well-structured table in the text file with bash shell?
Is it not useful to do this work with bash shell?
If yes, I will try this with python.
Height files are extracted from other *.csv files using awk.
Thank you so much for reading this question.
awk with simple concatenation can take the records for as many files as you have and join them together in a single output file for further processing. You simply provide the multiple input files as the files for awk to read and then concatenate each record using FNR (file record number) as an index and then use the END rule to print the combined records from all files.
For example, given 3 data files, e.g. data1.txt - data3.txt each with an integer in each row, e.g.
$ cat data1.txt
1
2
3
$ cat data2.txt
4
5
6
(7-9 in data3.txt, and presuming you have an equal number of records in each input file)
You could do:
awk '{a[FNR]=(FNR in a) ? a[FNR] "\t" $1 : $1} END {for (i in a) print a[i]}' data1.txt data2.txt data3.txt
(using a tab above with "\t" for the separator between columns of the output file -- you can change to suit your needs)
The result of the command above would be:
1 4 7
2 5 8
3 6 9
(note: this is what you would get with paste data1.txt data2.txt data3.txt, but presuming you have input that is giving paste problems, awk may be a bit more flexible)
Or using a "," as the separator, you would receive:
1,4,7
2,5,8
3,6,9
If your data file has more fields than a single integer and you want to compile all fields in each file, you can assign $0 to the array instead of the first field $1.
Spaced and formatted in multi-line format (for easier reading), the same awk script would be
awk '
{
a[FNR] = (FNR in a) ? a[FNR] "\t" $1 : $1
}
END {
for (i in a)
print a[i]
}
' data1.txt data2.txt data3.txt
Look things over and let me know if I misunderstood your question, or if you have further questions about this approach.

Efficient search pattern in large CSV file

I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user #anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.
I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.
My input file dataset.csv is something like this:
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
I am using the following to get the results I want (again, this is thanks to #anubhava).
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.
I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.
I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.
As always, I appreciate any advice.
with sorting:
awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}' < (sort -t, -nk2 dataset.csv)
or with gawk (unlimited number of opened fd-s)
gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv
This is the right way to do it using any awk:
$ tail -n +2 file | sort -t, -k2,2n |
awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'
The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.
Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.

Print lines of a compressed gz file based on another index file

Need to print specific lines of a large txt.gz file, using an index file
Hi all,
I found several examples for printing specific lines of a non-compressed files but could not find any solution for a very large gz file.
My index file (idx.txt) looks like this, and contains 700,000 indices:
1745
1746
7379
13920
13921
16681
16682
...
...
...
54830241
54867703
54867710
I would like to retrieve all these 700,000 lines in my other source file, which is a very large compressed CSV file with 55,000,000 rows and looks like this:
100035243,2,"Chronic obstructive pulmonary disease","SS","LETAIRIS","AMBRISENTAN","","Dyspnoea",NA,73,"F","","","CN"
100035672,1,"Myeloproliferative disorder","PS","JAKAFI","RUXOLITINIB","ORAL","Platelet count increased",20131206,48.501,"F","79.37","KG","OT"
100035914,1,"Multiple sclerosis","PS","GILENYA","FINGOLIMOD HYDROCHLORIDE","ORAL","Lymphocyte count decreased",20130718,47.154,"F","","","OT"
....
What I tried so far:
sed -nf idx.txt <(gzip -dc gzfile.gz) > output.txt
awk 'NR==FNR{i[$0];next}i[FNR]' idx.txt <(gzip -dc gzfile.gz) > output.txt
Both are very slow.
Any thoughts?
IMHO your awk code looks ok to me so there could be 1 way to increase its speed of processing. Though I am not sure(and since your samples are not clear so didn't test also), if your id.txt file's last entry is far lesser than total number of lines in .gz file then you could actually exit from awk code and NO need to read Input_files, try it out once.
awk 'NR==FNR{i[$0]=$0;last=$0;next} i[FNR]{print} FNR!=NR && FNR>last{exit}' idx.txt <(gzip -dc gzfile.gz) > output.txt
So what I am doing is, I am creating a variable named last here whose value should be last line value of ids.txt.Then in 2nd condition I am checking if line number is greater than value of last entry in ids.txt then exit from code.
EDIT: Changed OP's code from i[$0] to i[$0]=$0 in first condition since condition i[FNR] will only work when array i is having values. Changed it after user mentioned in comments.
PS: This will definitely save time only and only in case you have huge difference between last line value of ids.txt and total number of lines present in .gz file. Since I am going with your statement that you have very huge data.
Both sed and awk solutions looks good. Probably, sed one is faster than awk one. And probably they are the faster things you can get. To reduce time... reduce the input file size.
One extra thing you can do is to stop reading after last line printed, so if you know that last line printed will be far away from the end of file, you can avoid a lengthy decompression:
sed -nf idx.txt <(gzip -dc gzfile.gz | head -n "$(sort -nr idx.txt | head -1)") > output.txt

Performant way of displaying the number of unique column entries in a set of files?

I'm attempting to pipe a large amount of files in to a sequence of commands which displays the number of unique entries in a given column of said files. I'm inexperienced with the shell, but after a short while I was able to come up with this:
awk '{print $5 }' | sort | uniq | wc - l
This sequence of commands works fine for a small amount of files, but takes an unacceptable amount of time to execute on my target set. Is there a set of commands that can accomplish this more efficiently?
You can count unique occurrences of values in the fifth field in a single pass with awk:
awk '{if (!seen[$5]++) ++ctr} END {print ctr}'
This creates an array of the values in the fifth field and increments the ctr variable if the value has never seen before. The END rule prints the value of the counter.
With GNU awk, you can alternatively just check the length of the associative array in the end:
awk '{seen[$5]++} END {print length(seen)}'
Benjamin has supplied the good oil, but depending on just how much data is to be stored in the array, it may pay to pass the data to wc anyway:
awk '!_[$5]++' file | wc -l
the sortest and fastest (i could) using awk but not far from previous version of #BenjaminW. I think a bit faster (difference could only be interesting on very huge file) because of test made earlier in the process
awk '!E[$5]++{c++}END{print c}' YourFile
works with all awk version
GNU datamash has a count function for columns:
datamash -W count 5

Bash script compare values from 2 files and print output values from one file

I have two files like this;
File1
114.4.21.198,cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
114.4.21.205,cl_id=1O3M7A7Q0S3C6h85902g7b3h7_101pf
114.4.21.205,cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
114.4.21.213,cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
File2
cl_id=1B3O7M6C8T4O1b559i2g930m0_1165d
cl_id=1X3J7M6J0W5S9535180h90302_101p5
cl_id=1G3D7X6V6A7R81356e3g527m9_101nl
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
cl_id=1Q3Y7Q7J0M3E62953e5g3g5k0_117p6
I want to compare cl_id values that exist on file1 but not exist on file2 and print out the first values from file1 (IP Address).
it should be like this
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
114.4.23.70
114.4.21.201
114.4.21.211
120.172.168.36
I have tried awk,grep diff, comm. but nothing come close. Please tell the correct command to do this.
thanks
One proper way to that is this:
grep -vFf file2 file1 | sed 's|,cl_id.*$||'
I do not see how you get your output. Where does 120.172.168.36 come from.
Here is one solution to compare
awk -F, 'NR==FNR {a[$0]++;next} !a[$1] {print $1}' file2 file1
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
Feed both files into AWK or perl with field separator=",". If there are two fields, add the fields to a dictionary/map/two arrays/whatever ("file1Lines"). If there is just one field (this is file 2), add it to a set/list/array/whatever ("file2Lines"). After reading all input:
Loop over the file1Lines. For each element, check whether the key part is present in file2Lines. If not, print the value part.
This seems like what you want to do and might work, efficiently:
grep -Ff file2.txt file1.txt | cut -f1 -d,
First the grep takes the lines from file2.txt to use as patterns, and finds the matching lines in file1.txt. The -F is to use the patterns as literal strings rather then regular expressions, though it doesn't really matter with your sample.
Finally the cut takes the first column from the output, using , as the column delimiter, resulting in a list of IP addresses.
The output is not exactly the same as your sample, but the sample didn't make sense anyway, as it contains text that was not in any of the input files. Not sure if this is what you wanted or something more.

Resources