Concatenating CSV files in bash preserving the header only once - bash

Imagine I have a directory containing many subdirectories each containing some number of CSV files with the same structure (same number of columns and all containing the same header).
I am aware that I can run from the parent folder something like
find ./ -name '*.csv' -exec cat {} \; > ~/Desktop/result.csv
And this will work fine, expect for the fact that the header is repeated each time (once for each file).
I'm also aware that I can do something like sed 1d <filename> or tail -n +<N+1> <filename> to skip the first line of a file.
But in my case, it seems a bit more specialised. I want to preserve the header once for the first file and then skip the header for every file after that.
Is anyone aware of a way to achieve this using standard Unix tools (like find, head, tail, sed, awk etc.) and bash?
For example input files
/folder1
/file1.csv
/file2.csv
/folder2
/file1.csv
Where each file has header:
A,B,C and each file has one data row 1,2,3
The desired output would be:
A,B,C
1,2,3
1,2,3
1,2,3
Marked As Duplicate
I feel this is different to other questions like this and this specifically because those solutions reference file1 and file2 in the solution. My question asks about a directory structure with an arbitrary number of files where I would not want to type out each file one by one.

You may use this find + xargs + awk:
find . -name '*.csv' -print0 | xargs -0 awk 'NR==1 || FNR>1'
NR==1 || FNR>1 condition will be true for very first line in combined output or for every non-first line.

$ {
> cat real-daily-wages-in-pounds-engla.tsv;
> tail -n+2 real-daily-wages-in-pounds-engla.tsv;
> } | cat
You can pipe the output of multiple commands through cat. tail -n+2 selects all lines from a file, except the first.

Related

grep on multiple files, output to multiple files named after the original files

I have a folder with 64 items in it, numbered like this: 1-1, 1-2, 1-3, 1-4, 2-1 …
What I want to do is grep the file 1-1 for a specific pattern, have the output saved to a new file named 1-1a, then move on to file 1-2, and so on. The pattern stays the same for all 64 files.
I tried variations with find and -exec grep 'pattern' "{} > {}a" but it seems like I can't use {} twice on one line and still have it interpreted as a variable. I'm also open to suggestions using awk or similar.
Should be easy with awk
awk '/pattern/{print > FILENAME "a"}' *
In order to use find, replace * with the results of a find command:
awk '/pattern/{print > FILENAME "a"}' $(find ...)`
... but do that only if the file names do not contain special characters (e.g. spaces) and the list doesn't grow too long. Use a loop in these cases.
You can use this find command
find . -name '*[0-9]-*[0-9]' -exec bash -c '
for f; do grep "pattern" "$f" > "${f}a"; done' - {} +
Note that you can run this command multiple times and it won't keep creating files like 13-3aa or 13-3aaa etc.

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns
you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

OS X , how to merge bulk csv files into an excel worksheet?

OS X ,I have 50000 csv files in a folder. How to merge a specified range of all these csv files into a excel worksheet ?
ps. All of these csv files are in the same form. Each of them has two columns. What I want is the middle part of the second column, B45: B145. And in the new excel worksheet,I want the data from each csv file pasted next to each other. The result is in one worksheet.
Thanks for the following suggestions. I have found a solution to this problem.
find . -name \*.csv -print0 | xargs -0 -L 256 awk -F, 'FNR>=45 && FNR<=145{print $2}' > BigBoy.csv
Please try the following command with a single CSV file to see if it extracts the fields you want:
awk -F, 'FNR>=45 && FNR<=145{print $2}' AnySingleFile.csv
It prints the second field ($2) of all lines with line number greater than or equal 45 and less than or equal 145. The -F, sets the field separator to a comma.
If that works, the next command to try would be this, but I doubt it will work with 50,000 files:
awk -F, 'FNR>=45 && FNR<=145{print $2}' *.csv > BigBoy.csv
So, I would suggest you use find and xargs to process, say 256 files, at a time:
find . -name \*.csv -print0 | xargs -0 -L 256 awk -F, 'FNR>=45 && FNR<=145{print $2}' > BigBoy.csv
That command works like this...'Find all files ending in .csv and print their names separated with nul characters passing that list to xargs. It will then split the list and pass 256 files at a time to awk which will do exactly what the initial awk did."
The idea of passing 256 files to awk is to save having to execute a new process for every single one of your 50,000 CSV files. You may get away with a bigger number, depending on the length of your filenames. See Note at end.
Your results should be in BigBoy.csv.
Note 1: If your CSV files have many hundreds of lines, you will get a performance increase if you change the awk code to:
'FNR>=45 && FNR<=145{print $2} FNR==145{exit}'
Note 2: The limit (in characters) for the length of the arguments passed to awk (and any other program) can be found with:
sysctl kern.argmax
and on OSX, it is 262,144 characters. So if your CSV filenames are 8-10 characters long on average, you could probably pass over 26,000 filenames. If they are 260 characters long on average, you should not pass more than around 1,000 filenames.

Diff files in two folders ignoring the first line

I have two folders of files that I want to diff, except I want to ignore the first line in all the files. I tried
diff -Nr <(tail -n +1 folder1/) <(tail -n +1 folder2/)
but that clearly isn't the right way.
If the first lines that you want to ignore have a distinctive format that can be matched by a POSIX regular expression, then you can use diff's --ignore-matching-lines=... option to tell it to ignore those lines.
Failing that, the approach you want to take probably depends on your exact requirements. You say you "want to diff" the files, but it's not obvious exactly how faithfully your resulting output needs to match what you would get from diff -Nr if it supported that feature. (For example, do you need the line numbers in the diff to correctly identify the line numbers in the original files?)
The most precisely faithful approach would probably be as follows:
Copy each directory to a fresh location, using cp --recursive ....
Edit the first line of each file to prepend a magic string like IGNORE_THIS_LINE::, using something like find -type f -exec sed -i '1 s/^/IGNORE_THIS_LINE::/' '{}' ';'.
Use diff -Nr --ignore-matching-lines=^IGNORE_THIS_LINE:: ... to compare the results.
Pipe the output to sed s/IGNORE_THIS_LINE:://, so as to filter out any occurrences of IGNORE_THIS_LINE:: that still show up (due to being within a few lines of non-ignored differences).
Using Process Substitution ist the correct way to create intermediate input file descriptors. But tail doesnt work on folders. Just iterate over all the files in the folder:
for f in folder1/*.txt; do
tail -n +2 $f | diff - <(tail -n +2 folder2/$(basename $f))
done
Note i used +2 instead of +1. tail line numbering starts at line 1 not 0

combining grep and find to search for file names from query file

I've found many similar examples but cannot find an example to do the following. I have a query file with file names (file1, file2, file3, etc.) and would like to find these files in a directory tree; these files may appear more than once in the dir tree, so I'm looking for the full path. This option works well:
find path/to/files/*/* -type f | grep -E "file1|file2|file3|fileN"
What I would like is to pass grep a file with filenames, e.g. with the -f option, but am not successful. Many thanks for your insight.
This is what the query file looks like:
so the file contains one column of filenames separated by '\n' and here is how it looks like:
103128_seqs.fna
7010_seqs.fna
7049_seqs.fna
7059_seqs.fna
7077A_seqs.fna
7079_seqs.fna
grep -f FILE gets the patterns to match from FILE ... one per line*:
cat files_to_find.txt
n100079_seqs.fna
103128_seqs.fna
7010_seqs.fna
7049_seqs.fna
7059_seqs.fna
7077A_seqs.fna
7079_seqs.fna
Remove any whitespace (or do it manually):
perl -i -nle 'tr/ //d; print if length' files_to_find.txt
Create some files to test:
touch `cat files_to_find.txt`
Use it:
find ~/* -type f | grep -f files_to_find.txt
output:
/home/user/tmp/7010_seqs.fna
/home/user/tmp/103128_seqs.fna
/home/user/tmp/7049_seqs.fna
/home/user/tmp/7059_seqs.fna
/home/user/tmp/7077A_seqs.fna
/home/user/tmp/7079_seqs.fna
/home/user/tmp/n100079_seqs.fna
Is this what you want?

Resources