Getting specific lines of a file - bash

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?

Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file

Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).

Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.

Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt

You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Related

Split large csv file into multiple files and keep header in each part

How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part ?
How can I achieve this
h1 h2
a aa
b bb
c cc
.
.
12483720 rows
into
h1 h2
a aa
b bb
.
.
.
1000 rows
And
h1 h2
x xx
y yy
.
.
.
10000 rows
Another awk. First some test records:
$ seq 1 1234567 > file
Then the awk:
$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file
Explained:
$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file
Count the records:
$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total
Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?
awk -v l=1000 '(NR==1){header=$0;next}
(n==l) {
c=sprintf("%0.5d",c+1);
close(file); file=FILENAME; sub(/csv$/,c".csv",file)
print header > file
n=0;l*=10
}
{print $0 > file; n++}' file.csv
This works in the following way:
(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
(n==l){...}: Every time we wrote the requested amount of records/lines, we need to start writing to a new file. This happens every time n==l and we perform the following actions:
c=sprintf("%0.5d",c+1): increase the counter with one, and print it as 000xx
close(file): close the file you just wrote too.
file=FILENAME; sub(/csv$/,c".csv",file): define the new filename
print header > file: open the file and write the header to that file.
n=0: reset the current record count
l*=10: increase the maximum record count for the next file
{print $0 > file; n++}: write the entries to the file and increment the record count
Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt, and then move that file back to the original filename.
# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv
# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'
# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'
# Walk the files that do not have headers. For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
mv tmp.txt $f
done
Here's a first approach:
#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
cp header h$f
cat $f >>h$f
done
rm -f header
rm -f y*
The following bash solution should work nicely :
IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
curr_file_name="file_with_${curr_file_max_rows}_rows"
echo "$header" > "$curr_file_name"
for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
IFS='' read -r row || break 2
echo "$row" >> "$curr_file_name"
}
}
We have a first iteration level which produces the number of rows we're going to write for each successive file. It generates the file names and write the header to them. It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.
Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file. In this loop we try to read a line from the input file. If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.
You can try it here.

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Use Bash scripting to select columns and rows with specific name

I'm working with a very large text file (4GB) and I want to make a smaller file with only the data I need in it. It is a tab deliminated file and there are row and column headers. I basically want to select a subset of the data that has a given column and/or row name.
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
I'm planning to have a file with a list of the columns I want.
colname_1 colname_3
I'm a newbie to bash scripting and I really don't know how to do this. I saw other examples, but they all new what column number they wanted in advance and I don't. Sorry if this is a repeat question, I tried to search.
I would want the result to be
colname_1 colname_3
row_1 1 3
row_2 2 9
row_3 2 4
Bash works best as "glue" between standard command-line utilities. You can write loops which read each line in a massive file, but it's painfully slow because bash is not optimized for speed. So let's see how to use a few standard utilities -- grep, tr, cut and paste -- to achieve this goal.
For simplicity, let's put the desired column headings into a file, one per line. (You can always convert a tab-separated line of column headings to this format; we're going to do just that with the data file's column headings. But one thing at a time.)
$ printf '%s\n' colname_{1,3} > columns
$ cat columns
colname_1
colname_2
An important feature of the printf command-line utility is that it repeats its format until it runs out of arguments.
Now, we want to know which column in the data file each of these column headings corresponds to. We could try to write this as a loop in awk or even in bash, but if we convert the header line of the data file into a file with one header per line, we can use grep to tell us, by using the -n option (which prefixes the output with the line number of the match).
Since the column headers are tab-separated, we can get turn them into separate lines just by converting tabs to newlines using tr:
$ head -n1 giga.dat | tr '\t' '\n'
colname_1
colname_2
colname_3
colname_4
Note the blank line at the beginning. That's important, because colname_1 actually corresponds to column 2, since the row headers are in column 1.
So let's look up the column names. Here, we will use several grep options:
-F The pattern argument consists of several patterns, one per line, which are interpreted as ordinary strings instead of regexes.
-x The pattern must match the complete line.
-n The output should be prefixed by the line number of the match.
If we have Gnu grep, we could also use -f columns to read the patterns from the file named columns. Or if we're using bash, we could use the bashism "$(<columns)" to insert the contents of the file as a single argument to grep. But for now, we'll stay Posix compliant:
$ head -n1 giga.dat | tr '\t' '\n' | grep -Fxn "$(cat columns)"
2:colname_1
4:colname_3
OK, that's pretty close. We just need to get rid of everything other than the line number; comma-separate the numbers, and put a 1 at the beginning.
$ { echo 1
> grep -Fxn "$(<columns)" < <(head -n1 giga.dat | tr '\t' '\n')
> } | cut -f1 -d: | paste -sd,
1,2,4
cut -f1 Select field 1. The argument could be a comma-separated list, as in cut -f1,2,4.
cut -d: Use : instead of tab as a field separator ("delimiter")
paste -s Concatenate the lines of a single file instead of corresponding lines of several files
paste -d, Use a comma instead of tab as a field separator.
So now we have the argument we need to pass to cut in order to select the desired columns:
$ cut -f"$({ echo 1
> head -n1 giga.dat | tr '\t' '\n' | grep -Fxn -f columns
> } | cut -f1 -d: | paste -sd,)" giga.dat
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
You can actually do this by keeping track of the array indexes for the columns that match the column names in your file containing the column list. After you have found the array indexes in the data file for the column names in your column list file, you simply read your data file (beginning at the second line) and output the row_label plus the data for the columns at the array index you determined in matching the column list file to the original columns.
There are probably several ways to approach this and the following assumes the data in each column does not contain any whitespace. The use of arrays presumes bash (or other advanced shell supporting arrays) and not POSIX shell.
The script takes two file names as input. The first is your original data file. The second is your column list file. An approach could be:
#!/bin/bash
declare -a cols ## array holding original columns from original data file
declare -a csel ## array holding columns to select (from file 2)
declare -a cpos ## array holding array indexes of matching columns
cols=( $(head -n 1 "$1") ) ## fill cols from 1st line of data file
csel=( $(< "$2") ) ## read select columns from file 2
## fill column position array
for ((i = 0; i < ${#csel[#]}; i++)); do
for ((j = 0; j < ${#cols[#]}; j++)); do
[ "${csel[i]}" = "${cols[j]}" ] && cpos+=( $j )
done
done
printf " "
for ((i = 0; i < ${#csel[#]}; i++)); do ## output header row
printf " %s" "${csel[i]}"
done
printf "\n" ## output newline
unset cols ## unset cols to reuse in reading lines below
while read -r line; do ## read each data line in data file
cols=( $line ) ## separate into cols array
printf "%s" "${cols[0]}" ## output row label
for ((j = 0; j < ${#cpos[#]}; j++)); do
[ "$j" -eq "0" ] && { ## handle format for first column
printf "%5s" "${cols[$((${cpos[j]}+1))]}"
continue
} ## output remaining columns
printf "%13s" "${cols[$((${cpos[j]}+1))]}"
done
printf "\n"
done < <( tail -n+2 "$1" )
Using your example data as follows:
Data File
$ cat dat/col+data.txt
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
Column Select File
$ cat dat/col.txt
colname_1 colname_3
Example Use/Output
$ bash colnum.sh dat/col+data.txt dat/col.txt
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
Give it a try and let me know if you have any questions. Note, bash isn't known for its blinding speed handling large files, but as long as the column list isn't horrendously long, the script should be reasonably fast.

Bash - select lines of a file based on values in another file

I have 2 files; let's call them file1 and file2. file1 contains a start and an end coordinate in each row, e.g.:
start end
2000 2696
3465 3688
8904 9546
etc.
file2 has several columns, of which the first is the most relevant for the question:
position v2 v3 v4
3546 value12 value13 value14
9847 value22 value23 value24
12000 value32 value33 value34
Now, I need to output a new file, which will contain only the lines of file2 for which the 'position' value (1st column) is in between the 'start' and the 'end' value of any of the columns of file1. In R I'd just make a double loop, but it takes too much time (the files are large), so need to do it in bash. In case the question is unclear, here's the R loop that would do the job:
for(i in 1:dim(file1)[1]){
for(j in 1:dim(file2)[1]){
if(file2[j,1]>file1$start[i] & file2[j,1]<file1$end[i]) file2$select=1 else file2$select=0
}
}
Very sure there's a simple way of doing this using bash / awk...
The awk will look like this, but you'll need to remove the first line from file1 and file2 first:
awk 'FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
The bit in curly braces after "FNR==NR" only applies to the processing of file1 and it says to store field1 in array x[] and field2 in array y[] so we have the upper and lower bounds of each range. The bit in the second set of curly braces applies to procesing file2 only. It says to iterate through all the bounds in array x[] and y[] and see if field1 is between the bounds, and print the whole reocrd if it is.
If you don't want to remove the header line at the start, you can make the awk a little more complicated and ignore it like this:
awk 'FNR==1{next}FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
EDITED
Ok, I have added code to check "chromosome" (whatever that is!) assuming it is in the first field in both files, like this:
File1
x 2000 2696
x 3465 3688
x 8904 9546
File2
x 3546 value12 value13 value14
y 3467 value12 value13 value14
x 9847 value22 value23 value24
x 12000 value32 value33 value34
So the code now stores the chromosome in array c[] as well and checks they are equal before outputting.
awk 'BEGIN{i=0}FNR==NR{c[i]=$1;x[i]=$2;y[i++]=$3;next}{for(j=0;j<i;j++){if(c[j]==$1&&$2>=x[j]&&$2<=y[j]){print $0;next}}}' file1 file2
Don't know how to do this in bash...
I would try a perl script, reading the first file and storing it in memory (if it's possible, it depends on its size) and then going through the second file line by line and doing the comparisons to output the line or not.
I think you can do this in R too... the same way: storing the first file, looping for each line of the second file .
Moreover if the intervals don't overlap, you can do a sort on the files to speed up your algorithm.
This should be faster than the for loop
res <- apply(file2, 1, function(row)
{
any(row$position > file1$start & row$position < file1$end)
})
Assuming the delimiters for the files are spaces (if not - change -d estting).
The script uses cut to extract the first field of file2.
Then a simple grep searches for the field in file1. If present, the line from file2 is printed.
#!/bin/bash
while read line
do
word=$(echo $line | cut -f1 -d" ")
grep -c $word file1 >/dev/null
if [ $? -eq 0 ];then
echo "$line"
fi
done < file2

Resources