Randomly sample lines retaining commented header lines - bash

I'm attempting to randomly sample lines from a (large) file, while always retaining a set of "header lines". Header lines are always at the top of the file and unlike any other lines, begin with a #.
The actual file format I'm dealing with is a VCF, but I've kept the question general
Requirements:
Output all header lines (identified by a # at line start)
The command / script should (have the option to) read from STDIN
The command / script should output to STDOUT
For example, consider the following sample file (file.in):
#blah de blah
1
2
3
4
5
6
7
8
9
10
An example output (file.out) would be:
#blah de blah
10
2
5
3
4
I have a working solution (in this case selecting 5 non-header lines at random) using bash. It is capable of reading from STDIN (I can cat the contents of file.in into the rest of the command) however it writes to a named file rather than STDOUT:
cat file.in | tee >(awk '$1 =~ /^#/' > file.out) | awk '$1 !~ /^#/' | shuf -n 5 >> file.out

By using process substitution (thanks Tom Fenech), both commands are seen as files.
Then using cat we can concatenate these "files" together and output to STDOUT.
cat <(awk '/^#/' file) <(awk '!/^#/' file | shuf -n 10)
Input
#blah de blah
1
2
3
4
5
6
7
8
9
10
Output
#blah de blah
1
9
8
4
7
2
3
10
6
5

Related

Cannot print in awk command in bash script

I am trying to read values from a file and print specific items into a variable which I will use later.
cat /dir1/file1 | while read blmbline2
do
BLMBFILE2=`print $blmbline2 | awk '{$1=""; print $0}'`
echo $BLMBFILE2
done
When I run that same code at the command line, it runs as expected, but, when I run it in a bash script called testme.sh, I get this error:
./testme.sh: line 3: print: command not found
If I run print by itself at the command prompt, I don't get an error (just a blank line).
If I run "bash" and then print at the command prompt, I get command not found.
I can't figure out what I'm doing wrong. Can someone suggest?
updated: I see some other posts that say to use echo or printf? Is there a difference I need to be concerned with in using one of those in bash?
Since awk can read files, you may be able to do away with the cat | while read and just use awk. Using a sample file containing:
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
Declare your bash array variable and populate with the output from awk:
arr=() ; arr=($(awk '{$1=""; print $0}' /dir1/file1))
Use the following to display array size and contents:
printf "array length: %d\narray contents: %s\n" "${#arr[#]}" "${arr[*]}"
Output:
array length: 30
array contents: 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Change print to echo in your shell script. With printf you can format the data and with echo it will print the entire line of the file. Also, create an array so you can store multiple items:
BLMBFILE2=()
while IFS= read -r -d $'\0'
do
BLMBFILE2+=(`echo $REPLY | awk '{$1=""; print $0}'`)
echo $BLMBFILE2
done < <(cat /dir1/file1)
echo "Items found:"
for value in "${BLMBFILE2[#]}"
do
echo $value
done

How to print 1-10,11-20 and so on number of rows of a file in loop using shell? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a file consisting of 4000 rows, I need to iterate the records of that file over shell script and extract first 10 rows and send that rows to my java code which i already wrote, and then next 10 rows and so on
To pass 10 lines at a time as arguments to your script:
< file xargs -d$'\n' -n 10 myscript
To pipe 10 lines at a time as input to your script:
< file xargs -d$'\n' -n 10 sh -c 'printf "%s\n" "$#" | myscript' {}
Assuming your input is in a file named file which I'm creating with 30 instead of 4000 lines of input:
$ seq 30 > file
and modifying to have some lines that contain spaces, some that contain shell variables, and some that contain regexp and globbing chars to show no type of shell expansion is being done:
$ head -10 file
1
here is a multi-field line
3
4
$HOME
6
.*
8
9
10
Here's 10 args at a time being passed to an awk script:
$ < file xargs -d$'\n' -n 10 awk 'BEGIN{for (i=1; i<ARGC; i++) print i, "<" ARGV[i] ">"; exit} END{print "---"}'
1 <1>
2 <here is a multi-field line>
3 <3>
4 <4>
5 <$HOME>
6 <6>
7 <.*>
8 <8>
9 <9>
10 <10>
---
1 <11>
2 <12>
3 <13>
4 <14>
5 <15>
6 <16>
7 <17>
8 <18>
9 <19>
10 <20>
---
1 <21>
2 <22>
3 <23>
4 <24>
5 <25>
6 <26>
7 <27>
8 <28>
9 <29>
10 <30>
---
and here's 10 lines of input at a time being passed to an awk script:
$ < file xargs -d$'\n' -n 10 sh -c 'printf "%s\n" "$#" | awk '\''{print NR, "<" $0 ">"} END{print "---"}'\''' {}
1 <1>
2 <here is a multi-field line>
3 <3>
4 <4>
5 <$HOME>
6 <6>
7 <.*>
8 <8>
9 <9>
10 <10>
---
1 <11>
2 <12>
3 <13>
4 <14>
5 <15>
6 <16>
7 <17>
8 <18>
9 <19>
10 <20>
---
1 <21>
2 <22>
3 <23>
4 <24>
5 <25>
6 <26>
7 <27>
8 <28>
9 <29>
10 <30>
---
Considering that OP wants to pass lines as an argument to OP's code if that is the case then could you please try following once(haven't tested it by running it since I don't have OP's java code etc).
awk '
FNR%10==0{
system("your_java_code " value OFS $0)
value=""
}
{
value=(value?value OFS:"")$0
}
END{
if(value){
system("your_java_code " value)
}
}
' Input_file
OR
awk '
{
value=(value?value OFS:"")$0
}
FNR%10==0{
system("your_java_code " value)
value=""
}
END{
if(value){
system("your_java_code " value)
}
}
' Input_file
PS: Just for safer side, I kept END section of awk code so that in case there are left over lines(let's say total number of lines are NOT completely divided by 10) then it will call java program with remaining lines to it.
This might work for you (GNU parallel):
parallel -kN10 javaProgram :::: file
This will pass the lines 1-10, 11-20, ... as arguments to program javaProgram
If you want to pass 10 lines at time, use:
parallel -kN10 --cat javaProgram :::: file
Sounds to me like you want to slice out rows from a file, then pipe those rows to java. This interpretation differs from the other answers, so let me know if I'm not understanding you:
$ file=/etc/services
$ count=$(wc -l < "${file}")
$ start=1
$ stride=10
$ for ((i=start; i<=count; i+=stride)); do
awk -v i="${i}" -v stride="${stride}" \
'NR > (i+stride) { exit } NR >= i && NR < (i + stride)' "${file}" \
| java ...
done
file holds the path to the data rows. count is the total count of rows in that file. start is the first row, stride is how many you want to slice out in each iteration.
The for loop then performs the stride addition, while awk slices out the rows so numbered. We pipe them to the java program on standard in.
Assuming that you are passing the 10 lines groups from your file to your script as command line arguments, this is an answer:
rows=4000 # the number of rows in file
groupsize=10 # the size of lines groups
OIFS="$IFS"; IFS=$'\n' # use newline as input field separator to avoid `for` splitting on spaces
groups=$(($rows / $groupsize)) # the number of groups of lines
for i in $(seq 1 $groups); do # loop through each group of lines
from=$((($i * $groupsize) - $groupsize + 1))
to=$(($i * $groupsize))
# build the arguments for each script invocation by concatenating each group of lines
for line in `sed -n -e ${from},${to}p file`; do # 'file' is your input file name
arguments=$arguments \"$line\"
done
echo script $arguments # remove echo and change 'script' with your script name
done
IFS="$OIFS" # restore original input field separator
Like this :
for ((i=0; i<=4000; i+=10)); do
arr=( ) # create a new empty array
for ((j=$i; j<=i+10; j++)); do
arr+=( $j ) # add id to array
done
printf '%s\n' "${arr[#]}" # or execute command with all the id
done

SED to spit out nth and (n+1)th lines

EDITS: For reference, "stuff" is a general variable, as is "KEEP".
KEEP could be "Hi, my name is Dave" on line 2 and "I love pie" on line 7. The numbers I've put here are for illustration only and DO NOT show up in the data.
I had a file that needed to be parsed, keeping every 4th line, starting at the 3rd line. In other words, it looked like this:
1 stuff
2 stuff
3 KEEP
4
5 stuff
6 stuff
7 KEEP
8 stuff etc...
Great, sed solved that easily with:
sed -n -e 3~4p myfile
giving me
3 KEEP
7 KEEP
11 KEEP
Now I have a different file format and a different take on the pattern:
1 stuff
2 KEEP
3 KEEP
4
5 stuff
6 KEEP
7 KEEP etc...
and I still want the output of
2 KEEP
3 KEEP
6 KEEP
7 KEEP
10 KEEP
11 KEEP
Here's the problem - this is a multi-pattern "pattern" for sed. It's "every 4th line, spit out 2 lines, but start at line 2".
Do I need to have some sort of DO/FOR loop in my sed, or do I need a different command like awk or grep? Thus far, I have tried formats like:
sed -n -e '3~4p;4~4p' myfile
and
awk 'NR % 3 == 0 || NR % 4 ==0' myfile
and
sed -n -e '3~1p;4~4p' myfile
and
awk 'NR % 1 == 0 || NR % 4 ==0' myfile
source: https://superuser.com/questions/396536/how-to-keep-only-every-nth-line-of-a-file
If your intent is to print lines 2,3 then every fourth line after those two, you can do:
$ seq 20 | awk 'BEGIN{e[2];e[3]} (NR%4) in e'
2
3
6
7
10
11
14
15
18
19
You were pretty close with your sed:
$ printf '%s\n' {1..12} | sed -n '2~4p;3~4p'
2
3
6
7
10
11
this is the idiomatic way to write in awk
$ awk 'NR%4==2 || NR%4==3' file
however, this special case can be shortened to
$ awk 'NR%4>1' file
This might work for you (GNU sed):
sed '2~4,+1p;d' file
Use a range, the first parameter is the starting line and modulus (in this case from line 2 modulus 4). The second parameter is how man lines following the start of the range (in this case plus one). Print these lines and delete all others.
In the generic case, you want to keep lines p to p+q and p+n to p+q+n and p+2n to p+q+2n ... So you can write:
awk '(NR - p) % n <= q'

How to get the length of each word in a column without AWK, sed or a loop? [duplicate]

This question already has answers here:
Length of string in bash
(11 answers)
Closed 6 years ago.
Is it even possible? I currently have a one-liner to count the number of words in a file. If I output what I currently have it looks like this:
3 abcdef
3 abcd
3 fec
2 abc
This is all done in 1 line without loops and I was thinking if I could add a column with length of each word in a column. I was thinking I could use wc -m to count the characters, but I don't know if I can do that without a loop?
As seen in the title, no AWK, sed, perl.. Just good old bash.
What I want:
3 abcdef 6
3 abcd 4
3 fec 3
2 abc 3
Where the last column is length of each word.
while read -r num word; do
printf '%s %s %s\n' "$num" "$word" "${#word}"
done < file
You can do something like this also:
File
> cat test.txt
3 abcdef
3 abcd
3 fec
2 abc
Bash script
> cat test.txt.sh
#!/bin/bash
while read line; do
items=($line) # split the line
strlen=${#items[1]} # get the 2nd item's length
echo $line $strlen # print the line and the length
done < test.txt
Results
> bash test.txt.sh
3 abcdef 6
3 abcd 4
3 fec 3
2 abc 3

setting awk variables through inlining

I've got this:
./awktest -v fields=`cat testfile`
which ought to set fields variable to '1 2 3 4 5' which is all that testfile contains
It returns:
gawk: ./awktest:9: fatal: cannot open file `2' for reading (No such file or directory)
When I do this it works fine.
./awktest -v fields='1 2 3 4 5'
printing fields at the time of error yields:
1
printing fields in the second instance yields:
1 2 3 4 5
When I try it with 12345 instead of 1 2 3 4 5 it works fine for both, so it's a problem with the white space. What is this problem? And how do I fix it.
This is most likely not an awk question. Most likely, it is your shell that is the culprit.
For example, if awktest is:
#!/bin/bash
i=1
for arg in "$#"; do
printf "%d\t%s\n" $i "$arg"
((i++))
done
Then you get:
$ ./awktest -v fields=`cat testfile`
1 -v
2 fields=1
3 2
4 3
5 4
6 5
You see that the file contents are not being handled as a single word.
Simple solution: use double quotes on the command line:
$ ./awktest -v fields="$(< testfile)"
1 -v
2 fields=1 2 3 4 5
The $(< file) construct is a bash shortcut for `cat file` that does not need to spawn an external process.
Or, read the first line of the file in the awk BEGIN block
awk '
BEGIN {getline fields < "testfile"}
rest of awk program ...
'
./awktest -v fields="`cat testfile`"
#note that:
#./awktest -v fields='`cat testfile`'
#does not work

Resources