process 10 lines of the sample data at a time

process 10 lines of the sample data at a time - bash

I would like to make a loop that will take 10 lines of my input file and output it to an output file. And continue to add lines to the output file not over writing it.
This is a sample data:
FilePath Filename Probability ClassifierID HectorFileType LibmagicFileType
/mnt/Hector/Data/benign/binary/benign-pete/ 01d0cd964020a1f498c601f9801742c1 19 S040PDFv02 data.pdf PDF document
/mnt/Hector/Data/benign/binary/benign-pete/ 0299a1771587043b232f760cbedbb5b7 0 S040PDFv02 data.pdf PDF document
I then use this to count each unique file and show how many of each file there is with:
cut -f 4 input.txt|sort| uniq -c | awk '{print $2, $1}' | sed 1d
So ultimately I just need help making a loop that can run that line of bash and output 10 lines of data at a time to an output file

If I understand correctly, for every block of 10 lines, you are trying to:
Skip the headers, the first line of the block
count how many times field #4 (ClassifierID) occurs and output the field, plus the count.
Here is an AWK script which will do it:
FNR % 10 != 1 {
++count[$4]
}
FNR % 10 == 0 {
for (i in count) {
print i, count[i]
delete count[i]
}
}
Discussion
The FNR % 10 != 1 block processes every line, but lines 1, 11, 21, ... AKA the lines you want to skip. This block keeps a count of field $4
The FNR % 10 == 0 block prints out a summary for that block and resets (via delete) the count
My script does not sort the fields, so the order might be different.
If you want to tally for the whole file, not just block of 10s, then replace FNR % 10 == 0 with END.

Related

Parsing multiline program output

I've recently been working on some lab assignments and in order to collect and analyze results well, I prepared a bash script to automate my job. It was my first attempt to create such script, thus it is not perfect and my question is strictly connected with improving it.
Exemplary output of the program is shown below, but I would like to make it more general for more purposes.
>>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478
All data I want to collect is always in different lines. My first attempt was running the same program twice (or more times depending on the amount of data) and then using grep in each run to extract the data I need by looking for the keyword. It is very inefficient, as there probably are some possibilities of parsing whole output of one run, but I could not come up with any idea. At the moment the script is:
#!/bin/bash
write() {
o1=$(./progname args | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
o2=$(./progname args | grep "cycle" | grep -o -E '[0-9]+.[0-9]+')
o3=$(./progname args | grep "Total" | grep -o -E '[0-9]+.[0-9]+')
echo "$1 $o1 $o2 $o3"
}
for ((i = 1; i <= 10; i++)); do
write $i >> times.dat
done
It is worth mentioning that echoing results in one line is crucial, as I am using gnuplot later and having data in columns is perfect for that use. Sample output should be:
1 0.019306 3.369 170620476
2 0.019559 3.375 170620475
3 0.021971 3.334 170620478
4 0.020536 3.378 170620480
5 0.019692 3.390 170620475
6 0.020833 3.375 170620477
7 0.019951 3.450 170620477
8 0.019417 3.381 170620476
9 0.020105 3.374 170620476
10 0.020255 3.402 170620475
My question is: how could I improve the script to collect such data in just one program execution?

You could use awk here and could get values into an array and later access them by index 1,2 and 3 in case you want to do this in a single command.
myarr=($(your_program args | awk '/Total/{print $NF;next} /cycle/{print $NF;next} /Time/{print $(NF-1)}'))
OR use following to forcefully print all elements into a single line, which will not come in new lines if someone using " to keep new lines safe for values.
myarr=($(your_program args | awk '/Total/{val=$NF;next} /cycle/{val=(val?val OFS:"")$NF;next} /Time/{print val OFS $(NF-1)}'))
Explanation: Adding detailed explanation of awk program above.
awk ' ##Starting awk program from here.
/Total/{ ##Checking if a line has Total keyword in it then do following.
print $NF ##Printing last field of that line which has Total in it here.
next ##next keyword will skip all further statements from here.
}
/cycle/{ ##Checking if a line has cycle in it then do following.
print $NF ##Printing last field of that line which has cycle in it here.
next ##next keyword will skip all further statements from here.
}
/Time/{ ##Checking if a line has Time in it then do following.
print $(NF-1) ##Printing 2nd last field of that line which has Time in it here.
}'
To access individual items you could use like:
echo ${myarr[0]}, echo ${myarr[1]} and echo ${myarr[2]} for Total, cycle and time respectively.
Example to access all elements by loop in case you need:
for i in "${myarr[#]}"
do
echo $i
done

You can execute your program once and save the output at a variable.
o0=$(./progname args)
Then you can grep that saved string any times like this.
o1=$(echo "$o0" | grep "Time" | grep -o -E '[0-9]+.[0-9]+')

Assumptions:
each of the 3x search patterns (Time, cycle, Total) occur just once in a set of output from ./progname
format of ./progname output is always the same (ie, same number of space-separated items for each line of output)
I've created my own progname script that just does an echo of the sample output:
$ cat progname
echo ">>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478"
One awk solution to parse and print the desired values:
$ i=1
$ ./progname | awk -v i=${i} ' # assign awk variable "i" = ${i}
/Time/ { o1 = $3 } # o1 = field 3 of line that contains string "Time"
/cycle/ { o2 = $5 } # o2 = field 5 of line that contains string "cycle"
/Total/ { o3 = $4 } # o4 = field 4 of line that contains string "Total"
END { printf "%s %s %s %s\n", i, o1, o2, o3 } # print 4x variables to stdout
'
1 0.042127 3.386 170620482

Change names of a columns using a mapping file

I have a file with 3 columns like this:
NC_0001 10 x
NC_0001 11 x
NC_0002 90 y
I want to change the names of the first column using another file .txt that contains the conversion, it's like:
NC_0001 1
NC_0001 1
NC_0002 2
...
So finally I should have:
1 10 x
1 11 x
2 90 y
How can I do that?
P.S. the first file is very huge (50 GB) so I must use a unix command like awk.

awk -f script.awk map_file data_file
NR == FNR { # for the first file
tab[$1] = $2 # create a k/v of the colname and rename value
}
NR != FNR { # for the second file
$1 = tab[$1] # set first column equal to the map value
print # print
}
As a one-liner
awk 'NR==FNR{t[$1]=$2} NR!=FNR{$1=t[$1];print}' map_file data_file
If possible, you should split the first file and run this command on each partition file in parallel. Then, join the results.

Check to see if numbers in a column are sequential via command line

In a text file, I have a sequence of numbers in a column preceded by a short string. It is the 5th column in the example file here under "NAME":
SESSION NAME: session
SAMPLE RATE: 48000.000000
BIT DEPTH: 16-bit
SESSION START TIMECODE: 00:00:00:00.00
TIMECODE FORMAT: 24 Frame
# OF AUDIO TRACKS: 2
# OF AUDIO CLIPS: 2
# OF AUDIO FILES: 2
M A R K E R S L I S T I N G
# LOCATION TIME REFERENCE UNITS NAME COMMENTS
2 0:00.500 24000 Samples xxxx0001
3 0:03.541 170000 Samples xxxx0002
4 0:05.863 281458 Samples xxxx0003
5 0:08.925 428430 Samples xxxx0004
6 0:10.604 509025 Samples xxxx0005
7 0:13.973 670742 Samples xxxx0006
8 0:15.592 748453 Samples xxxx0008
9 0:19.243 923666 Samples xxxx0008
In the example above, 0007 is missing, and 0008 is duplicated.
Therefore, I would like to be able to check if the numbers are:
sequential given the range that presently exists in the column.
if there are any duplicates
I would also like to output these results:
SKIPPED:
xxxx0007
DUPLICATES:
xxxx0008
The furthest I have been able to get is to use awk to get the column I need:
cat <file.txt> | awk '{ print $5 }'
which gets me to this:
NAME
xxxx0001
xxxx0002
xxxx0003
xxxx0004
xxxx0005
xxxx0006
xxxx0008
xxxx0008
But I do not know where to go from here.
Do I need to loop through the list items and parse so I get the number only, then start doing some comparisons to the next line?
Any help would be tremendously appreciated
Thank you!

As a starting point, please try the following:
awk '
NR>1 { gsub("[^0-9]", "", $5); count[$5]++ }
END {
print "Skipped:"
for (i=1; i<NR; i++)
if (count[i] == 0) printf "xxxx%04d\n", i
print "Duplicates:"
for (i=1; i<NR; i++)
if (count[i] > 1) printf "xxxx%04d\n", i
} ' file.txt
Output:
Skipped:
xxxx0007
Duplicates:
xxxx0008
The condition NR>1 is used to skip the top header line.
gsub("[^0-9]", "", $5) removes non-number characters from $5.
As a result, $5 is set to a number extracted from the 5th column.
The array count[] counts the occurances of each number. If the value
is 0 (or undefined), it means the number is skipped. If the value
is larger than 1, the number is duplicated.
The END { ... } block is executed after all the input lines are processed
and it is useful to report the final results.
However, the "Skipped/Duplicates" approach cannot well detect such cases as:
# LOCATION TIME REFERENCE UNITS NAME COMMENTS
1 0:00.500 24000 Samples xxxx0001
2 0:02.888 138652 Samples xxxx0003
3 0:04.759 228446 Samples xxxx0004
4 0:07.050 338446 Samples xxxx0005
5 0:09.034 433672 Samples xxxx0006
6 0:12.061 578958 Samples xxxx0007
7 0:14.111 677333 Samples xxxx0008
8 0:17.253 828181 Samples xxxx0009
or
# LOCATION TIME REFERENCE UNITS NAME COMMENTS
1 0:00.500 24000 Samples xxxx0001
2 0:02.888 138652 Samples xxxx0003
3 0:04.759 228446 Samples xxxx0002
4 0:07.050 338446 Samples xxxx0004
5 0:09.034 433672 Samples xxxx0005
6 0:12.061 578958 Samples xxxx0006
7 0:14.111 677333 Samples xxxx0007
8 0:17.253 828181 Samples xxxx0008
It will be better to perform a line-by-line comparison between expected value and the actual value. Then how about:
awk '
NR>1 {
gsub("[^0-9]", "", $5)
if ($5 != NR-1) printf "Line: %d Expected: xxxx%04d Actual: xxxx%04d\n", NR, NR-1, $5
} ' file.txt
output for the original example:
Line: 8 Expected: xxxx0007 Actual: xxxx0008
[EDIT]
According to the revised input file which includes more extra header lines, how about:
awk '
f {
gsub("[^0-9]", "", $5)
if ($5 != NR-skip) printf "Line: %d Expected: xxxx%04d Actual: xxxx%04d\n", NR, NR-skip, $5
}
/^#[[:blank:]]+LOCATION[[:blank:]]+TIME REFERENCE/ {
skip = NR
f = 1
}
' file.txt
Output:
Line: 19 Expected: xxxx0007 Actual: xxxx0008
The script above skips the lines until the specific pattern # LOCATION TIME REFERENCE is found.
The f { ... } block is executed if f is true. So the block is skipped
until f is set to a nonzero value.
The /^# .../ { ... } block is executed if the input line matches the
pattern. If found, skip is set to the number of header lines and
f (flag) is set to 1 so the upper block is executed from the next
iteration.
Hope this helps.

Print lines indexed by a second file

I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2

If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.

This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).

Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)

In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).

Remove linefeed from csv preserving rows

I have a CSV that was exported, some lines have a linefeed (ASCII 012) in the middle of a record. I need to replace this with a space, but preserve the new line for each record to load it.
Most of the lines are fine, however a good few have this:
Input:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :
Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
Output:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
I have been looking into Awk but cannot really make sense of how to preserve the actual row.
Another Example:
Input:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working
Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
Output:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"

One way using GNU awk:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS = "[,~]"
}
NF < 21 {
line = (line ? line OFS : line) $0
fields = fields + NF
}
fields >= 21 {
print line
line=""
fields=0
}
NF == 21 {
print
}
Alternatively, you can use this one-liner:
awk -F "[,~]" 'NF < 21 { line = (line ? line OFS : line) $0; fields = fields + NF } fields >= 21 { print line; line=""; fields=0 } NF == 21 { print }' file.txt
Explanation:
I made an observation about your expected output: it seems each line should contain exactly 21 fields. Therefore if your line contains less than 21 fields, store the line and store the number of fields. When we loop onto the next line, the line will be joined to the stored line with a space, and the number of fields totaled. If this number of fields is greater or equal to 21 (the sum of the fields of a broken line will add to 22), print the stored line. Else if the line contains 21 fields (NF == 21), print it. HTH.

I think sed is your choice. I assume all the records end with non-colon character, thus if a line end with a colon, it is recognized as an exception and should be concatenated to the previous line.
Here is the code:
cat data | sed -e '/[^"]$/N' -e 's/\n//g'
The first execution -e '/[^"]$/N' match an abnormal case, and read in next record without empty the buffer. Then -e 's/\n//g' remove the new line character.

try this one-liner:
awk '{if(t){print;t=0;next;}x=$0;n=gsub(/"/,"",x);if(n%2){printf $0" ";t=1;}else print $0}' file
idea:
count the number of " in a line. if the count is odd, join the following line, otherwise the current line would be considered as a complete line.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

process 10 lines of the sample data at a time - bash

Related

Parsing multiline program output

Change names of a columns using a mapping file

Check to see if numbers in a column are sequential via command line

Print lines indexed by a second file

Remove linefeed from csv preserving rows

Categories

Resources