How can I combine a set of text files, leaving off the first line of each? - bash

As part of a normal workflow, I receive sets of text files, each containing a header row. It's more convenient for me to work with these as a single file, but if I cat them naively, the header rows in files after the first cause problems.
The files tend to be large enough (103–105 lines, 5–50 MB) and numerous enough that it's awkward and/or tedious to do this in an editor or step-by-step, e.g.:
$ wc -l *
20251 1.csv
124520 2.csv
31158 3.csv
175929 total
$ tail -n 20250 1.csv > 1.tmp
$ tail -n 124519 2.csv > 2.tmp
$ tail -n 31157 3.csv > 3.tmp
$ cat *.tmp > combined.csv
$ wc -l combined.csv
175926 combined.csv
It seems like this should be doable in one line. I've isolated the arguments that I need but I'm having trouble figuring out how to match them up with tail and subtract 1 from the line total (I'm not comfortable with awk):
$ wc -l * | grep -v "total" | xargs -n 2
20251 foo.csv
124520 bar.csv
31158 baz.csv
87457 zappa.csv
7310 bingo.csv
29968 niner.csv
2086 hella.csv
$ wc -l * | grep -v "total" | xargs -n 2 | tail -n
tail: option requires an argument -- n
Try 'tail --help' for more information.
xargs: echo: terminated by signal 13

You don't need to use wc -l to calculate the number of lines to output; tail can skip the first line (or the first K lines), just by adding a + symbol when using the -n (or --lines) option, as described in the man page:
-n, --lines=K output the last K lines, instead of the last 10;
or use -n +K to output starting with the Kth
This makes combining all files in a directory without the first line of each file as simple as:
$ tail -q -n +2 * > combined.csv
$ wc -l *
20251 foo.csv
124520 bar.csv
31158 baz.csv
87457 zappa.csv
7310 bingo.csv
29968 niner.csv
2086 hella.csv
302743 combined.csv
605493 total
The -q flag suppresses headers in the output when globbing for multiple files with tail.

Both tail and sed answers work fine.
For the sake of an alternative here is an awk command that does the same job:
awk 'FNR > 1' *.csv > combined.csv
FNR > 1 condition will skip first row for each file.

With GNU sed:
sed -ns '2,$p' 1.csv 2.csv 3.csv > combined.csv
or
sed -ns '2,$p' *.csv > combined.csv

Another sed alternative
sed -s 1d *.csv
deletes first line from each input file, without -s it will only delete from the first file.

Related

How can I send the out of the last pipe to two different commands?

So, I have text file with a a bunch of numbers, one number per line to be specific, so I do:-
cat filename.txt|sort -n|head -1 to get the top number and I can do cat filename.txt|sort -n|tail -1 to get the bottom number.
Just to be sure is there a way to send cat filename.txt|sort -n| and its output to two different commands in one line and have the out put (the highest number and the lowest number next to each other)
You can do interesting things with tee and process substitutions, but the order of the output may not be stable (due to timing of processes)
sort -n filename.txt | tee >(tail -1 >/dev/tty) | head -1
In this case, I'd use sed to print the first and last line:
sort -n filename.txt | sed -n '1p; $p'
As #chepner suggests
... | sed -n '1p; $p' | paste - - # tab separated
or
... | awk 'NR == 1 {first = $0} END {print first, $0}' # space separated
There is a useful command tee syntax tee second.txt will output to second.txt
You can combine that with bash executive pipes eg tee >(wc),
So you can do 2 or more commands by eg tee >(wc) | head

Delete the 3 last line of my txt with bash? [duplicate]

I want to remove some n lines from the end of a file. Can this be done using sed?
For example, to remove lines from 2 to 4, I can use
$ sed '2,4d' file
But I don't know the line numbers. I can delete the last line using
$sed $d file
but I want to know the way to remove n lines from the end. Please let me know how to do that using sed or some other method.
I don't know about sed, but it can be done with head:
head -n -2 myfile.txt
If hardcoding n is an option, you can use sequential calls to sed. For instance, to delete the last three lines, delete the last one line thrice:
sed '$d' file | sed '$d' | sed '$d'
From the sed one-liners:
# delete the last 10 lines of a file
sed -e :a -e '$d;N;2,10ba' -e 'P;D' # method 1
sed -n -e :a -e '1,10!{P;N;D;};N;ba' # method 2
Seems to be what you are looking for.
A funny & simple sed and tac solution :
n=4
tac file.txt | sed "1,$n{d}" | tac
NOTE
double quotes " are needed for the shell to evaluate the $n variable in sed command. In single quotes, no interpolate will be performed.
tac is a cat reversed, see man 1 tac
the {} in sed are there to separate $n & d (if not, the shell try to interpolate non existent $nd variable)
Use sed, but let the shell do the math, with the goal being to use the d command by giving a range (to remove the last 23 lines):
sed -i "$(($(wc -l < file)-22)),\$d" file
To remove the last 3 lines, from inside out:
$(wc -l < file)
Gives the number of lines of the file: say 2196
We want to remove the last 23 lines, so for left side or range:
$((2196-22))
Gives: 2174
Thus the original sed after shell interpretation is:
sed -i '2174,$d' file
With -i doing inplace edit, file is now 2173 lines!
If you want to save it into a new file, the code is:
sed -i '2174,$d' file > outputfile
You could use head for this.
Use
$ head --lines=-N file > new_file
where N is the number of lines you want to remove from the file.
The contents of the original file minus the last N lines are now in new_file
Just for completeness I would like to add my solution.
I ended up doing this with the standard ed:
ed -s sometextfile <<< $'-2,$d\nwq'
This deletes the last 2 lines using in-place editing (although it does use a temporary file in /tmp !!)
To truncate very large files truly in-place we have truncate command.
It doesn't know about lines, but tail + wc can convert lines to bytes:
file=bigone.log
lines=3
truncate -s -$(tail -$lines $file | wc -c) $file
There is an obvious race condition if the file is written at the same time.
In this case it may be better to use head - it counts bytes from the beginning of file (mind disk IO), so we will always truncate on line boundary (possibly more lines than expected if file is actively written):
truncate -s $(head -n -$lines $file | wc -c) $file
Handy one-liner if you fail login attempt putting password in place of username:
truncate -s $(head -n -5 /var/log/secure | wc -c) /var/log/secure
This might work for you (GNU sed):
sed ':a;$!N;1,4ba;P;$d;D' file
Most of the above answers seem to require GNU commands/extensions:
$ head -n -2 myfile.txt
-2: Badly formed number
For a slightly more portible solution:
perl -ne 'push(#fifo,$_);print shift(#fifo) if #fifo > 10;'
OR
perl -ne 'push(#buf,$_);END{print #buf[0 ... $#buf-10]}'
OR
awk '{buf[NR-1]=$0;}END{ for ( i=0; i < (NR-10); i++){ print buf[i];} }'
Where "10" is "n".
With the answers here you'd have already learnt that sed is not the best tool for this application.
However I do think there is a way to do this in using sed; the idea is to append N lines to hold space untill you are able read without hitting EOF. When EOF is hit, print the contents of hold space and quit.
sed -e '$!{N;N;N;N;N;N;H;}' -e x
The sed command above will omit last 5 lines.
It can be done in 3 steps:
a) Count the number of lines in the file you want to edit:
n=`cat myfile |wc -l`
b) Subtract from that number the number of lines to delete:
x=$((n-3))
c) Tell sed to delete from that line number ($x) to the end:
sed "$x,\$d" myfile
You can get the total count of lines with wc -l <file> and use
head -n <total lines - lines to remove> <file>
Try the following command:
n = line number
tail -r file_name | sed '1,nd' | tail -r
This will remove the last 3 lines from file:
for i in $(seq 1 3); do sed -i '$d' file; done;
I prefer this solution;
head -$(gcalctool -s $(cat file | wc -l)-N) file
where N is the number of lines to remove.
sed -n ':pre
1,4 {N;b pre
}
:cycle
$!{P;N;D;b cycle
}' YourFile
posix version
To delete last 4 lines:
$ nl -b a file | sort -k1,1nr | sed '1, 4 d' | sort -k1,1n | sed 's/^ *[0-9]*\t//'
I came up with this, where n is the number of lines you want to delete:
count=`wc -l file`
lines=`expr "$count" - n`
head -n "$lines" file > temp.txt
mv temp.txt file
rm -f temp.txt
It's a little roundabout, but I think it's easy to follow.
Count up the number of lines in the main file
Subtract the number of lines you want to remove from the count
Print out the number of lines you want to keep and store in a temp file
Replace the main file with the temp file
Remove the temp file
For deleting the last N lines of a file, you can use the same concept of
$ sed '2,4d' file
You can use a combo with tail command to reverse the file: if N is 5
$ tail -r file | sed '1,5d' file | tail -r > file
And this way runs also where head -n -5 file command doesn't run (like on a mac!).
#!/bin/sh
echo 'Enter the file name : '
read filename
echo 'Enter the number of lines from the end that needs to be deleted :'
read n
#Subtracting from the line number to get the nth line
m=`expr $n - 1`
# Calculate length of the file
len=`cat $filename|wc -l`
#Calculate the lines that must remain
lennew=`expr $len - $m`
sed "$lennew,$ d" $filename
A solution similar to https://stackoverflow.com/a/24298204/1221137 but with editing in place and not hardcoded number of lines:
n=4
seq $n | xargs -i sed -i -e '$d' my_file
In docker, this worked for me:
head --lines=-N file_path > file_path
Say you have several lines:
$ cat <<EOF > 20lines.txt
> 1
> 2
> 3
[snip]
> 18
> 19
> 20
> EOF
Then you can grab:
# leave last 15 out
$ head -n5 20lines.txt
1
2
3
4
5
# skip first 14
$ tail -n +15 20lines.txt
15
16
17
18
19
20
POSIX compliant solution using ex / vi, in the vein of #Michel's solution above.
#Michel's ed example uses "not-POSIX" Here-Strings.
Increment the $-1 to remove n lines to the EOF ($), or just feed the lines you want to (d)elete. You could use ex to count line numbers or do any other Unix stuff.
Given the file:
cat > sometextfile <<EOF
one
two
three
four
five
EOF
Executing:
ex -s sometextfile <<'EOF'
$-1,$d
%p
wq!
EOF
Returns:
one
two
three
This uses POSIX Here-Docs so it is really easy to modify - especially using set -o vi with a POSIX /bin/sh.
While on the subject, the "ex personality" of "vim" should be fine, but YMMV.
This will remove the last 12 lines
sed -n -e :a -e '1,10!{P;N;D;};N;ba'

Concatenating text files in bash

I have many text files with only one line float value in one folder and I would like to concatenate them in bash in order for example: file_1.txt, file_2.txt ...file_N.txt. I would like to have them in one txt file in the order from 1 to N. Could someone please help me ? Here is the code I have but it just concatenates them in random manner. Thank you
for file in *.txt
do
cat ${file} >> output.txt
done
As much as I recommend against parsing the output of ls, here we go.
ls has a "version sort" option that will sort numbered files like you want. See below for a demo.
To concatenate, you want:
ls -v file*.txt | xargs cat > output
$ touch file{1..20}.txt
$ ls
file1.txt file12.txt file15.txt file18.txt file20.txt file5.txt file8.txt
file10.txt file13.txt file16.txt file19.txt file3.txt file6.txt file9.txt
file11.txt file14.txt file17.txt file2.txt file4.txt file7.txt
$ ls -1
file1.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file2.txt
file20.txt
file3.txt
file4.txt
file5.txt
file6.txt
file7.txt
file8.txt
file9.txt
$ ls -1v
file1.txt
file2.txt
file3.txt
file4.txt
file5.txt
file6.txt
file7.txt
file8.txt
file9.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file20.txt
for file in *.txt
do
cat ${file} >> output.txt
done
This works for me as well as :
for file in *.txt
do
cat $file >> output.txt
done
You don't need {}
But the simpler is still :
cat file*.txt > output.txt
So if you have more than 9 files as suggested in the comment, you can do one of the following :
files=$(ls file*txt | sort -t"_" -k2g)
files=$(find . -name "file*txt" | sort -t "_" -k2g)
files=$(printf "%s\n" file_*.txt | sort -k1.6n) # Thanks to glenn jackman
and then:
cat $files
or
cat $(find . -name "file*txt" | sort -t "_" -k2g)
Best is still to number your files correctly, so file_01.txt if you have less than 100 files, et file_001.txt if less than 1000, an so on.
example :
ls file*txt
file_1.txt file_2.txt file_3.txt file_4.txt file_5.txt file_10.txt
They contain only their corresponding number.
$ cat $files
1
2
3
4
5
10
Use this:
find . -type f -name "file*.txt" | sort -V | xargs cat -- >final_file
If the files are numbered, then sorting doesn't happen in the natural way that we human expect. For that to happen, you will have to use -V option with sort command.
As others have pointed out, if you have files file_1, file_2, file_3... file_123283, the internal BASH sorting of these files will put file_11 before file_2 because they're sorted by text and not numerically.
You can use sort to get the order you want. Assuming that your files are file_#...
cat $(ls -1 file_* | sort -t_ -k2,2n)
The ls -1 lists your files out on one per line.
sort -t_ says to break the sorting fields down by underscores. This makes the second sorting field the numeric part of the file name.
-k2,2n says to sort by the second field numerically.
Then, you concatenate out all of the files together.
One issue is that you may end up filling up your command line buffer if you have a whole lot of files. Before cat can get the file names, the $(...) must first be expanded.
This works for me...
for i in $(seq 0 $N); do [[ -f file_$i.txt ]] && cat file_$i.txt; done > newfile
Or, more concisely
for i in $(seq 0 $N); do cat file_$i.txt 2> /dev/null ;done > newfile
You can use ls for listing files:
for file in `ls *.txt`
do·
cat ${file} >> output
done
Some sort techniques are discussed here: Unix's 'ls' sort by name
Glenn Jackman 's answer is a simple solution for GNU/Linux systems.
David W. 's answer is a portable alternative.
Both solutions work well for the specific case at hand, but not generally in that they'll break with filenames with embedded spaces or other metacharacters (characters that, when used unquoted, have special meaning to the shell).
Here are solutions that work with filenames with embedded spaces, etc.:
Preferable solution for systems where sort -z and xargs -0 are supported (e.g., Linux, OSX, *BSD):
printf "%s\0" file_*.txt | sort -z -t_ -k2,2n | xargs -0 cat > out.txt
Uses NUL (null character, 0x0) to separate the filenames and so safely preserves their boundaries.
This is the most robust solution, because it even handles filename with embedded newlines correctly (although such filenames are very rare in practice). Unfortunately, sort -z and xargs -0 are not POSIX-compliant.
POSIX-compliant solution, using xargs -I:
printf "%s\n" file_*.txt | sort -t_ -k2,2n | xargs -I % cat % > out.txt
Processing is line-based, and due to use of -I, cat is invoked once per input filename, making this method slower than the one above.

Command to get nth line of STDOUT

Is there any bash command that will let you get the nth line of STDOUT?
That is to say, something that would take this
$ ls -l
-rw-r--r--# 1 root wheel my.txt
-rw-r--r--# 1 root wheel files.txt
-rw-r--r--# 1 root wheel here.txt
and do something like
$ ls -l | magic-command 2
-rw-r--r--# 1 root wheel files.txt
I realize this would be bad practice when writing scripts meant to be reused, BUT when working with the shell day to day it'd be useful to me to be able to filter my STDOUT in such a way.
I also realize this would be semi-trivial command to write (buffer STDOUT, return a specific line), but I want to know if there's some standard shell command to do this that would be available without me dropping a script into place.
Using sed, just for variety:
ls -l | sed -n 2p
Using this alternative, which looks more efficient since it stops reading the input when the required line is printed, may generate a SIGPIPE in the feeding process, which may in turn generate an unwanted error message:
ls -l | sed -n -e '2{p;q}'
I've seen that often enough that I usually use the first (which is easier to type, anyway), though ls is not a command that complains when it gets SIGPIPE.
For a range of lines:
ls -l | sed -n 2,4p
For several ranges of lines:
ls -l | sed -n -e 2,4p -e 20,30p
ls -l | sed -n -e '2,4p;20,30p'
ls -l | head -2 | tail -1
Alternative to the nice head / tail way:
ls -al | awk 'NR==2'
or
ls -al | sed -n '2p'
From sed1line:
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
From awk1line:
# print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}' # more efficient on large files
For the sake of completeness ;-)
shorter code
find / | awk NR==3
shorter life
find / | awk 'NR==3 {print $0; exit}'
Try this sed version:
ls -l | sed '2 ! d'
It says "delete all the lines that aren't the second one".
You can use awk:
ls -l | awk 'NR==2'
Update
The above code will not get what we want because of off-by-one error: the ls -l command's first line is the total line. For that, the following revised code will work:
ls -l | awk 'NR==3'
Another poster suggested
ls -l | head -2 | tail -1
but if you pipe head into tail, it looks like everything up to line N is processed twice.
Piping tail into head
ls -l | tail -n +2 | head -n1
would be more efficient?
Is Perl easily available to you?
$ perl -n -e 'if ($. == 7) { print; exit(0); }'
Obviously substitute whatever number you want for 7.
Yes, the most efficient way (as already pointed out by Jonathan Leffler) is to use sed with print & quit:
set -o pipefail # cf. help set
time -p ls -l | sed -n -e '2{p;q;}' # only print the second line & quit (on Mac OS X)
echo "$?: ${PIPESTATUS[*]}" # cf. man bash | less -p 'PIPESTATUS'
Hmm
sed did not work in my case.
I propose:
for "odd" lines 1,3,5,7... ls |awk '0 == (NR+1) % 2'
for "even" lines 2,4,6,8 ls |awk '0 == (NR) % 2'
For more completeness..
ls -l | (for ((x=0;x<2;x++)) ; do read ; done ; head -n1)
Throw away lines until you get to the second, then print out the first line after that. So, it prints the 3rd line.
If it's just the second line..
ls -l | (read; head -n1)
Put as many 'read's as necessary.

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources