I'm trying to split big tsv file into smaller parts depending on column value but I need to keep header in every file that was created by splitting. How can I do this?
I've tried some solutions but they can solve my problem only for particular files
awk -F'\t' 'NR==1 {h=$0};NR>1{print ((!a[$5]++ && !a[$9]++ && !a[$10]++)? h ORS $0 : $0) > "file_first-" $5 "_second-" $9 "_third-" $10 ".tsv"}' file.tsv
I expect to have header in each file, but for now it is only in files, where $5 $9 $10 are in such format : 1 1 1 2 2 2... But not the permutations.
You probably want to have the following per-line logic:
calculate output_file
If !header_sent[output_file]
Print Header to output_file
set header_sent[output_file]
EndIf
print current line to output_file
Implementation in AWK below. Can be converted into one-liner by removing comments, and compacting variable names, etc.
NR == 1 { header=$0 }
NR > 1 {
output_file = "file_first-" $5 "_second-" $9 "_third-" $10 ".tsv"
# Send header, if not sent to this file yet.
if (!header_sent[output_file] ) {
print header > output_file
header_sent[output_file] = 1
}
# Print the current line
print $0 > output_file
}
I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.
Currently using,
$ awk 'NR==FNR{a[$1];next} ($3 in a)' find.txt path_to_100_files/*
to search a directory containing multiple files, for strings from a .txt (find.txt)
find.txt contains
example1
example 2
example#eampol.com
exa exa exa123
...
example of .txt files within directory
example example example.com
example 2 example example lol
now currently it searches for the string within column 3, using ($3 in a) meaning $3 = column #3, but sometimes string can be on $1 or $5 and so on, how can I get it to search every column instead of just the 3rd?
awk '
NR==FNR{a[$1];next}
{ for (i=1; i<=NF; i++) if ($i in a) { print; next } }
' find.txt path_to_100_files/*
The above assumes your existing script behaves as desired given exa exa exa123.
I have the following script:
awk '
# Write 1st file into array
NR == FNR {
array[NR] = $0;
next;
}
# Process 2nd file
{
...
} ' file1 file2
What I want is to write 1st file into array and later use this array in 2nd file. First file may be empty, my problem appears when awk read empty file, it does not execute any user-level awk program code and skip to the second file. When awk is reading 2nd file NR == FNR is true and the awk program write 2nd file into array.
How I can avoid it, so only first file will be put into array if exist?
Use this condition to safeguard empty file scenarion:
ARGV[1]==FILENAME && FNR==NR {
array[NR] = $0
next
}
ARGV[1] will be set to first filename in the awk command line and FILENAME will represent the current filename being processed.
I have two files file1 and file2, Both the files have 5 columns.
I want to compare first 4 columns of file1 with file2.
If they are equal, need to compare the 5th column. If 5th column values are different, need to print the file1's 5th column as file2's 6th column.
I have used below awk to compare two columns in two different files, but how to compare multiple columns and append the particular column in another file if matches found?
awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]'
file1:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,topcat
132,item2,grade3,wing7,middlecat
134,item2,grade3,wing7,middlecat
177,item8,gradeA,wing11,lowcat
file2:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,lowcat
132,item3,grade3,wing7,middlecat
126,item2,grade3,wing7,maingroup
177,item8,gradeA,wing11,lowcat
Desired output:
123,item3,grade5,wing10,lowcat,topcat
Awk can simulate multidimensional arrays by sequencing the indices. Underneath the indices are concatenated using the built-in SUBSEP variable as a separator:
$ awk -F, -v OFS=, 'NR==FNR { a[$1,$2,$3,$4]=$5; next } a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }' file1.txt file2.txt
123,item3,grade5,wing10,lowcat,topcat
awk -F, -v OFS=,
Set both input and output separators to ,
NR==FNR { a[$1,$2,$3,$4]=$5; next }
Create an associative array from the first file relating the first four fields of each line to the
fifth. When using a comma-separated list of values as an index, awk actually concatenates them
using the value of the built-in SUBSEP variable as a separator. This is awk's way of
simulating multidimensional arrays with a single subscript. You can set SUBSEP to any value you like
but the default, which is a non-printing character unlikely to appear in the data, is usually
fine. (You can also just do the trick yourself, something like a[$1 "|" $2 "|" $3 "|" $4],
assuming you know that your data contains no vertical bars.)
a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }
Arriving here, we know we are looking at the second file. If the first four fields were found in the
first file, and the $5 from the first file is different than the $5 in the second, print the line
from the second file followed by the $5 from the first. (I am assuming here that no $5 from the first file will have a value that evaluates to false, such as 0 or empty.)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $0; sub("(,[^,]*){"NF-4"}$","",key) }
NR==FNR { file1[key] = $5; next }
(key in file1) && ($5 != file1[key]) {
print $0, file1[key]
}
$ awk -f tst.awk file1 file2
123,item3,grade5,wing10,lowcat,topcat