Compare headers of two delimited text files - bash

File1.txt(base file)
header1|header2|header3|header4
1|2|3|4
File2.txt
header1|header10|header3|header4
5|6|7
Desired O/P
header2 is missing in file 2 at position 2
header10 is addition in file 2 at position 2
I need to compare two file header and need to display missing header or addition columns with respect to base file header list.

I would try it with the diff command like this:
diff <(head -n1 fh1.txt | tr "|" "\n") <( head -n1 fh2.txt | tr "|" "\n")
where fh1.txt and fh2.txt are your files. The output gives the information that you want but it is not so verbose.

You can use awk, like this:
check.awk
# In the first line of every input file save the headers
FNR==1{
headers[f++]=$0
}
# Once all lines of input have been processed ...
END{
# split() returns the number of items. The resulting
# arrays 'a|b_headers' will be indexed starting from 1
lena = split(headers[0],a_headers,"|")
lenb = split(headers[1],b_headers,"|")
for(h=1;h<=lena;h++) {
if(a_headers[h] != b_headers[h]) {
print a_headers[h] " missing from file2 at column " h
}
}
for(h=1;h<=lenb;h++) {
if(b_headers[h] != a_headers[h]) {
print b_headers[h] " missing from file1 at column " h
}
}
}
Call it like this:
awk -f check.awk File1.txt File2.txt
Output:
header2 missing from file2 at column 2
header10 missing from file1 at column 2

Related

Extract common lines from multiple text files and display original line numbers

What I want?
Extract the common lines from n large files.
Append the original line numbers of each files.
Example:
File1.txt has the following content
apple
banana
cat
File2.txt has the following content
boy
girl
banana
apple
File3.txt has the following content
foo
apple
bar
The output should be a different file
1 3 2 apple
1, 3 and 2 in the output are the original line numbers of File1.txt, File2.txt and File3.txt where the common line apple exists
I have tried using grep -nf File1.txt File2.txt File3.txt, but it returns
File2.txt:3:apple
File3.txt:2:apple
Associate each unique line with a space separated list of line numbers indicating where it is seen in each file in an array, and print these next to each other at the end if the line is found in all three files.
awk '{
n[$0] = n[$0] FNR OFS
c[$0]++
}
END {
for (r in c)
if (c[r] == 3)
print n[r] r
}' file1 file2 file3
If the number of files is unknown, refer to Ravinder's answer, or just change the hardcoded 3 in the END block with ARGC-1 as shown there.
GNU awk specific approach that works with any number of files:
#!/usr/bin/gawk -f
BEGINFILE {
nfiles++
}
{
lines[$0][nfiles] = FNR
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (line in lines) {
if (length(lines[line]) == nfiles) {
for (file = 1; file <= nfiles; file++)
printf "%d\t", lines[line][file]
print line
}
}
}
Example:
$ ./showlines file[123].txt
1 3 2 apple
Could you please try following, written and tested with GNU awk, one could make use of ARGC value which gives us total number of element passed to awk program.
awk '
{
a[$0]=(a[$0]?a[$0] OFS:"")FNR
count[$0]++
}
END{
for(i in count){
if(count[i]==(ARGC-1)){
print i,a[i]
}
}
}
' file1.txt file2.txt file3.txt
A perl solution
perl -ne '
$h{$_} .= "$.\t"; # append current line number and tab character to value in a hash with key current line
$. = 0 if eof; # reset line number when end of file is reached
END{
while ( ($k,$v) = each %h ) { # loop over has entries
if ( $v =~ y/\t// == 3 ) { # if value contains 3 tabs
print $v.$k # print value concatenated with key
}
}
}' file1.txt file2.txt file3.txt

Merge multiple files into a single row file with a delimeter

UPDATED QS:
I have been working on a bash script that will merge multiple text files with numerical values into one a single row text file using delimiter for each file values while merging
Example:
File1.txt has the followling contents:
168321099
File2.txt has:
151304
151555
File3.txt has:
16980925
File4.txt has:
154292
149092
Now i want a output.txt file like below:
, 168321099 151304 151555 16980925 , 154292 149092
Basically each file delimited by space and in a single row. with comma as first and 6 field of the outputrow
tried:
cat * > out.txt but its not coming as expected
I am not very sure If I understood your question correctly, but I interpreted it as following :
The set of files file1,...,filen contain a set of words which you want to have printed in one single line.
Each word is space separated
In addition to the string of words, you want the first character to be a , and between word 4 and 5 you want to have a ,.
The cat+tr+awk solution:
$ cat <file1> ... <filen> | tr '\n' ' ' | awk '{$1=", "$1; $4=$4" ,"; print}'
The awk solution:
$ awk 'NR==1||NR==4{printf s",";s=" "}{printf " "$1}' <file1> ... <filen>
If tr is available on your system you can do the following cat * | tr "\n" " " > out.txt
tr "\n" " " translates all line breaks to spaces
If the number of lines per file is constant, then the easiest way is tr as #Littlefinix suggested, with a couple of anonymous files to supply the commas, and an echo at the end to add an explicit newline to the output line:
cat <(echo ",") File1.txt File2.txt File3.txt <(echo ",") File4.txt | tr "\n" " " > out.txt; echo >> out.txt
out.txt is exactly what you specified:
, 168321099 151304 151555 16980925 , 154292 149092
If the number of lines per input file might vary (e.g., File2.txt has 3 or 4 lines, etc.), then placing the commas always in the 1st and 6th field will be more involved, and you'd probably need a script and not a one-liner.
Following single awk could help you on same.
awk 'FNR==1{count++;} {printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0)} END{print ""}' *.txt
Adding a non-one liner form of solution too now.
awk '
FNR==1 { count++ }
{ printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0) }
END { print "" }
' *.txt

Include header in grep of specific csv columns

I am trying to extract relevant information from a large csv file for further processing, so I would like to have the column names (header) saved in my output mini-csv files.
I have:
grep "Example" $fixed_file | cut -d ',' -f 4,6 > $outputpath"Example.csv"
which works fine in generating a csv file with two columns, but I would like the header information to also be included in the output file.
Use command grouping and add head -1 to the mix:
{ head -1 "$fixed_file" && grep "Example" "$fixed_file" | cut -d ',' -f 4,6 ;} \
>"$outputpath"Example.csv
My suggestion would be to replace your multiple-command pipeline with a single awk script.
awk '
BEGIN {
OFS=FS=","
}
NR==1;
/Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
If you want your header only to contain the headers fields fields 4 and 6, you could change this to:
awk '
BEGIN {
OFS=FS=","
}
NR==1 || /Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
Awk scripts consist of pairs of condition { statement }. A missing statement assumes you want to print the line (which is why NR==1; prints the header).
And of course, you could compact this into a one-liner:
awk -F, 'NR==1||/Example/{print $4 FS $6}' "$fixed_file" > "$outputpath/Example.csv"

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

Making a new file using another as template in BASH

I have two files such as:
File_1
c1,c2,c3,c4
File_2
c1,c3,c2,c4
DA,CA,DD,CD
Thus, I want to make a File 3 using the File 1 as model using BASH:
File_3
c1,c2,c3,c4
DA,DD,CA,CD
In this example, the File_1 is a model of the correct disposition of the columns and the File_2 has the columns and their respective informations but in a wrong disposition. Thus, the File_3 used the file_1 as a template and ordered the information in the file_2 in a correct disposition.
In the example I just gave 4 columns, but my real file has 402 columns.
So, to do an
awk -F"," '{print $1","$3","$2","$4}' File_2
or something like this, will not work because I dont know the position of the itens of the File_1 in the File_2 (for example the c1 column in the File_2 could be in the sixth, the second, or the last columns positions).
I hope that you can help me using BASH (if possible) and I would like an small explanation of the script, because I'm newbie and I don't know a lot the commands.
Thanks in advance.
You can make a header index mapping like this:
File_2 => File_1
------ ------
1 => 1
2 => 3
3 => 2
4 => 4
awk -F, '
FNR==NR{
for(i=1;i<=NF;i++)
a[$i]=i
print
nextfile
}
FNR==1{
for(j=1;j<=NF;j++)
b[j]=a[$j]
next
}
{
for(k=1;k<=NF;k++)
printf( "%s%s",$b[k], k==NF?"\n":",")
}
' File_{1,2}
Note: This command works if File_{1,2} contain no empty lines!
If you are free to change the format of file 2 from:
File_2
c1,c3,c2,c4
DA,CA,DD,CD
to:
s/c1/DA/g
s/c3/CA/g
s/c2/DD/g
s/c4/CD/g
you can use sed:
sed -f File_2 File_1 > File_3
Else you may work with arrays:
key=($(head -n1 File_2 | tr "," " "))
val=($(tail -n1 File_2 | tr "," " "))
len=${#key[*]}
for i in $(seq 0 $((len-1))); do echo s/${key[$i]}/${val[$i]}/g; done > subst.sed
sed -f subst.sed File_1 > File_3
The generated sed-Program is the one as above. If a substitution matches the key of a following command, you might get unexpected results. If you only like to match whole words, you have to change the sed command a bit.

Resources