Suppose I have the following file:
stub-foo-start: 10
stub-foo-stop: 15
stub-bar-start: 3
stub-bar-stop: 7
stub-car-start: 21
stub-car-stop: 51
# ...
# EOF at the end
with the goal of writing a script which would append to it like so:
stub-foo-start: 10
stub-foo-stop: 15
stub-bar-start: 3
stub-bar-stop: 7
stub-car-start: 21
stub-car-stop: 51
# ...
# appended:
stub-foo: 5 # 5 = stop(15) - start(10)
stub-bar: 4 # and so on...
stub-car: 30
# ...
# new EOF
The format is exactly this sequential pairing of start and stop tags (stop being the closing one) and no nesting in between.
What is the recommended approach to writing such a script using awk and/or sed? Mostly, what I've tried is greping lines, storing to a variable, but that seemed to overcomplicate things and trail off.
Any advice or helpful links welcome. (Most tutorials I found on shell scripting were illustrative at best)
A naive implementation in plain bash
#!/bin/bash
while read -r start && read -r stop; do
printf '%s: %d\n' "${start%-*}" $(( ${stop##*:} - ${start##*:} ))
done < file
This assumes pairs are contiguous and there are no interlaced or nested pairs.
Using GNU awk:
awk -F '[ -]' '{ map[$2][$3]=$4;print } END { for (i in map) { print i": "(map[i]["stop:"]-map[i]["start:"])" // ("map[i]["stop:"]"-"-map[i]["start:"]")" } }' file
Explanation:
awk -F '[ -]' '{ # Set the field delimiter to space or "-"
map[$2][$3]=$4; # Create a two dimensional array with the second and third field as indexes and the fourth field as the value
print # Print the line
}
END { for (i in map) {
print i": "(map[i]["stop:"]-map[i]["start:"])" // ("map[i]["stop:"]"-"-map[i]["start:"]")" # Loop through the array and print the data in the required format
}
}' file
I have a file that consists of a bunch of things but what I need are numbers between start and end strings: For example :
ghghgh
start
23
34
22
12
end
ghbd
wodkkh
234
start
14
56
74
end
So, I need two arrays here one containing 23,34,22,12 and one containing 14,56,74. What's the best command to use?
If I only had one start and one end I would be able to use mapfile and awk to obtain that array, but there's many start and ends in the file and I need to save all the arrays.
You can do it with sed.
sed -n '/start/{:a;N;/end/!ba;s/\n/, /g;s/, [^,][a-z][^,]*//Ig;s/start, //p}'
The code will iterate through all chunks between 'start' and 'end' lines.
It will remove all items with non-digit symbols and output each "array" on separate line.
Here is output from your data sample:
23, 34, 22, 12
14, 56, 74
You need to implement a small state machine - switching between in block and out of block:
awk '/end/{block = 0; print a; a = ""} (block) {a = a " " $0} /start/{block = 1}'
If at end, leave block, print and empty the accumulator. If in block, accumulate current line. If at start, mark that we're inside a block.
You can tell awk to change the output file every time a new sequence starts
awk '/start/{i++;f=1;next} /end/{f=0} f{print > "arr"i}' file
For the example file, this will create files: arr1, arr2. Then you can create separated arrays with the lines of these files:
for i in $( ls arr* ); do readarray -t $i < $i; done
note: I have assumed that all lines between matching patterns are numeric and acceptable as in the example.
If you trust your input files enough for an eval:
$ cat tst.sh
eval $(
awk '
f {
if ( /end/ ) {
print "declare arr" ++cnt "=(" vals " )"
vals = ""
f = 0
}
else {
vals = vals OFS $0
}
}
/start/ { f = 1 }
' "$1"
)
printf "arr1:%s\n" "${arr1[#]}"
printf "arr2:%s\n" "${arr2[#]}"
$ ./tst.sh file
arr1:23
arr1:34
arr1:22
arr1:12
arr2:14
arr2:56
arr2:74
Check the quoting and all other shell gotchas...
I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.
Sample.txt ( File will be sorted based on the third field on which pattern to be searched )
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/>
"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
Used
awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile
and grep commands but it was very time consuming since file is 300+ MB of size.
Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.
It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.
ndx=0; fromRow=1
for val in '00003' '00112' '|'; do # 2 sample values to match, plus dummy value
chunkFile="smallfile$(( ++ndx ))"
fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
NR < fromRow { next }
{ if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
' big_file)
done
Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.
Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:
awk -F'|' -v vals='00003|00112' '
BEGIN { split(vals, val); outFile="smallfile" ++ndx }
{
if ($3 != val[ndx]) {
if (p) { p=0; close(outFile); outFile="smallfile" ++ndx }
} else {
p=1
}
print > outFile
}
' big_file
You can try with Perl:
perl -ne '/00003/ && print' big_file > small_file
and compare its timing with other solutions...
EDIT
Limiting my answer to the tools you didn't try already... you can also use:
sed -n '/00003/p' big_file > small_file
But I tend to believe perl will be faster. Again... I'd suggest you to measure the elapsed for different solutions on your own.
I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2
If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.
This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).
Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)
In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).
I have 15 files like
file1.csv
a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0
file2.csv
d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0
.
.
.
.
file15.csv
s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0
I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.
in this above case
output is:
cg2
(taking it is common to second field of all 15 files)
I am little new to unix and shell scripting, please help
Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.
The former:
a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .
The latter:
cg2
. . .
If the latter, then this should work
awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv
Broken out on multiple lines:
awk -F, '{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv
Explanation:
increment the count of the number of times a field 2 value occurs
if the filename changes, increment the count of files (the first file changes from a null string to its filename and the count increments from 0 to 1)
save the current filename
once all the counting is done, iterate of the array by its keys
if the count contained in the array is greater than or equal to the number of files, then the field 2 value appeared in all the files (by checking for >= instead of == this will work in case a value appears more than once in a single file)
so print the key (which is a field 2 value)
a glob is used to get all the files, but you could list them explicitly
Edit:
Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.
awk -F, '
FILENAME == first && flag {
exit
}
! first {
first = FILENAME
}
FILENAME != first {
flag = 1
}
{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
# print the matching lines
do {
if ($2 in arr) {
print;
}
} while (getline);
# print the list of words
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv file*.csv
It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.
It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while or for).
If you print only the full lines, you can pipe the output to:
sort -t , -k2,2
to have them grouped.
Piping only the list of words to:
sort
will put them in the same order for easier comparison.
Fun problem.
One way to do it, entirely in Bash, is as follows.
One thing you will need to invoke is join -t ',' -1 2 -2 2 file1 file2 to join on the second column of two files. Before you can join, though, you must sort on the second column.
Do successive joins in a for-loop, because join takes only two files as arguments.
ADDENDUM
Here is a little transcript showing successive joins. You can adapt it fairly easily, I think.
$ cat 1.csv
a,b,c,d
e,f,g,h
i,j,k,l
$ cat 2.csv
7,5,4,3
3,b,s,e
2,f,5,5
$ cat 3.csv
4,5,6,7
0,0,0,0
1,b,4,4
$ join -t ',' -1 2 -2 2 1.csv 2.csv | cut -f 1 -d ',' > temp
$ cat temp
b
f
$ join -t ',' -2 2 temp 3.csv | cut -f 1 -d ','
b
The first join (on the first two files) produces the joined value in the first column of the result. So as you join to file3, file4, file5, etc. You will be using the first column of the result you are generating, which is why you only need the -2 option. To keep things very efficient, always cut out all but the first column each time you do the join.