Remove half the lines of files recursively using sed and wc - bash

I wonder if there is a way to remove half the lines of a file using wc and sed.
I can do this:
sed -i '50,$d' myfile.txt
Which removes lines from 50 to the end of file. I can also do this:
wc -l myfile.txt
Which returns the number of lines in the file.
But what I really want to do is something like this:
wc -l myfile.txt | sed -i '{wc -l result}/2,$d' myfile.txt
How can I tell sed to remove the lines starting from the wc -l result divided by 2?
How can I do this recursively?

you can make use of head command:
head -n -"$(($(wc -l<file)/2))" file
awk is also possible, and with exit statement, it could be faster:
awk -v t="$(wc -l <file)" 'NR>=t/2{exit}7' file
You can get the file by awk/head ... > newFile or cmd > tmp && mv tmp file to have "in-place" change.

I guess you were close. Just use a command substitution with arithmetic expansion to get the value of the starting line:
startline="$(( $(wc -l <myfile.txt) / 2 ))"
sed -i "$startline"',$d' myfile.txt
Or a oneliner:
sed -i "$(( $(wc -l <myfile.txt) / 2 ))"',$d' myfile.txt

In some sense, reading the file twice (or, as noted by Kent, once and a half) is unavoidable. Perhaps it will be slightly more efficient if you use just a single process.
awk 'NR==FNR { total=FNR; next }
FNR>total/2 { exit } 1' myfile.txt myfile.txt
Doing this recursively with Awk is slightly painful. If you have GNU Awk, you can use the -t inline option, but I'm not sure of its semantics when you read the same file twice. Perhaps just fall back to a temporary output file.
find . -type f -exec sh -c "awk 'NR==FNR { total=FNR; next }
FNR>total/2 { exit } 1' {} {} >tmp &&
mv tmp {}" _ \;

Related

Estimate number of lines in a file and insert that value as first line

I have many files for which I have to estimate the number of lines in each file and add that value as first line. To estimate that, I used something like this:
wc -l 000600.txt | awk '{ print $1 }'
However, no success on how to do it for all files and then to add the value corresponding to each file as first line.
An example:
a.txt b.txt c.txt
>>print a
15
>> print b
22
>>print c
56
Then 15, 22 and 56 should be added respectively to: a.txt b.txt and c.txt
I appreciate the help.
You can add a pattern for example (LINENUM) in first line of file and then use the following script.
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i 's/LINENUM/LINENUM:{}/' a.txt
or just use from this script:
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' a.txt
This way you can add the line number as the first line for all *.txt files in current directory. Also using that group command here would be faster than inplace editing commands, in case of large files. Do not modify spaces or semicolons into the grouping.
for f in *.txt; do
{ wc -l < "$f"; cat "$f"; } > "${f}.tmp" && mv "${f}.tmp" "$f"
done
For iterate over the all file you can add use from this script.
for f in `ls *` ; do if [ -f $f ]; then wc -l $f | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' $f ; fi; done
This might work for you (GNU sed):
sed -i '1h;1!H;$!d;=;x' file1 file2 file3 etc ...
Store each file in memory and insert the last lines line number as the file size.
Alternative:
sed -i ':a;$!{N;ba};=' file?

In loop cat file - echo name of file - count

I trying make oneline command with operation where I can do:
in folder "data" have 570 files - each file have some text line - file are called from 1 to 570.txt
I want cat each file, grep by word and count how manny that word occurs.
For the moment he is trying to get this using ' for '
for FILES in $(find /home/my/data/ -type f -print -exec cat {} \;) ; do echo $FILES; cat $FILES |grep word ; done |wc -l
but if I do that they correctly counts but does not display the counted file
I would like it to look :
----> 1.txt <----
210
---> 2.txt <----
15
etc, etc, etc..
How to get it
grep -o word * | uniq -c
is practically all you need.
grep -o word * gives a line for each hit, but only prints the match, in this case "word". Each line is prefixed with the filename it was found in.
uniq -c gives only one line per file so to say and prefixes it with the count.
You can further format it to your needs with awk or whatever, though, for example like this:
grep -o word * | uniq -c | cut -f1 -d':' | awk '{print "File: " $2 " Count: " $1}'
You can try this :
for file in /path/to/folder/data/* ; do echo "----> $file <----" ; grep -c "word_to_count" /path/to/folder/data/$file ; done
for loop will ierate over file inside folder "data".
For each of these file, print the name and search for number of occurrence of "word_to_count" (grep -c will directly output a count of matching lines).
Be carefull, if there is more than one iteration of your search word inside a line, this solution will count only one for these iteration.
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
If you only want the total number of lines, you could use
find /home/my/data/ -type f -exec cat {} + | wc -l

Append wc lines to filename

Title says it all. I've managed to get just the lines with this:
lines=$(wc file.txt | awk {'print $1'});
But I could use an assist appending this to the filename. Bonus points for showing me how to loop this over all the .txt files in the current directory.
find -name '*.txt' -execdir bash -c \
'mv -v "$0" "${0%.txt}_$(wc -l < "$0").txt"' {} \;
where
the bash command is executed for each (\;) matched file;
{} is replaced by the currently processed filename and passed as the first argument ($0) to the script;
${0%.txt} deletes shortest match of .txt from back of the string (see the official Bash-scripting guide);
wc -l < "$0" prints only the number of lines in the file (see answers to this question, for example)
Sample output:
'./file-a.txt' -> 'file-a_5.txt'
'./file with spaces.txt' -> 'file with spaces_8.txt'
You could use the rename command, which is actually a Perl script, as follows:
rename --dry-run 'my $fn=$_; open my $fh,"<$_"; while(<$fh>){}; $_=$fn; s/.txt$/-$..txt/' *txt
Sample Output
'tight_layout1.txt' would be renamed to 'tight_layout1-519.txt'
'tight_layout2.txt' would be renamed to 'tight_layout2-1122.txt'
'tight_layout3.txt' would be renamed to 'tight_layout3-921.txt'
'tight_layout4.txt' would be renamed to 'tight_layout4-1122.txt'
If you like what it says, remove the --dry-run and run again.
The script counts the lines in the file without using any external processes and then renames them as you ask, also without using any external processes, so it quite efficient.
Or, if you are happy to invoke an external process to count the lines, and avoid the Perl method above:
rename --dry-run 's/\.txt$/-`grep -ch "^" "$_"` . ".txt"/e' *txt
Use rename command
for file in *.txt; do
lines=$(wc ${file} | awk {'print $1'});
rename s/$/${lines}/ ${file}
done
#/bin/bash
files=$(find . -maxdepth 1 -type f -name '*.txt' -printf '%f\n')
for file in $files; do
lines=$(wc $file | awk {'print $1'});
extension="${file##*.}"
filename="${file%.*}"
mv "$file" "${filename}${lines}.${extension}"
done
You can adjust maxdepth accordingly.
you can do like this as well:
for file in "path_to_file"/'your_filename_pattern'
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
example:
for file in /oradata/SCRIPTS_EL/text*
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
This would work, but there are definitely more elegant ways.
for i in *.txt; do
mv "$i" ${i/.txt/}_$(wc $i | awk {'print $1'})_.txt;
done
Result would put the line numbers nicely before the .txt.
Like:
file1_1_.txt
file2_25_.txt
You could use grep -c '^' to get the number of lines, instead of wc and awk:
for file in *.txt; do
[[ ! -f $file ]] && continue # skip over entries that are not regular files
#
# move file.txt to file.txt.N where N is the number of lines in file
#
# this naming convention has the advantage that if we run the loop again,
# we will not reprocess the files which were processed earlier
mv "$file" "$file".$(grep -c '^' "$file")
done
{ linecount[FILENAME] = FNR }
END {
linecount[FILENAME] = FNR
for (file in linecount) {
newname = gensub(/\.[^\.]*$/, "-"linecount[file]"&", 1, file)
q = "'"; qq = "'\"'\"'"; gsub(q, qq, newname)
print "mv -i -v '" gensub(q, qq, "g", file) "' '" newname "'"
}
close(c)
}
Save the above awk script in a file, say wcmv.awk, the run it like:
awk -f wcmv.awk *.txt
It will list the commands that need to be run to rename the files in the required way (except that it will ignore empty files). To actually execute them you can pipe the output to a shell for execution as follows.
awk -f wcmv.awk *.txt | sh
Like it goes with all irreversible batch operations, be careful and execute commands only if they look okay.
awk '
BEGIN{ for ( i=1;i<ARGC;i++ ) Files[ARGV[i]]=0 }
{Files[FILENAME]++}
END{for (file in Files) {
# if( file !~ "_" Files[file] ".txt$") {
fileF=file;gsub( /\047/, "\047\"\047\"\047", fileF)
fileT=fileF;sub( /.txt$/, "_" Files[file] ".txt", fileT)
system( sprintf( "mv \047%s\047 \047%s\047", fileF, fileT))
# }
}
}' *.txt
Another way with awk to manage easier a second loop by allowing more control on name (like avoiding one having already the count inside from previous cycle)
Due to good remark of #gniourf_gniourf:
file name with space inside are possible
tiny code is now heavy for such a small task

awk parse filename and add result to the end of each line

I have number of files which have similar names like
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out
etc.
I need to get number before .csv(1 or 2) from the file name and put it into end of every line in file with TAB separator.
I have written this code, it finds number that I need, but i do not know how to put this number into file. There is space in the filename, my script breaks because of it.
Also I am not sure, how to send to script list of files. Now I am working only with one file.
My code:
#!/bin/sh
string="DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out"
out=$(echo $string | awk 'BEGIN {FS="_"};{print substr ($7,0,1)}')
awk ' { print $0"\t$out" } ' $string
for file in *
do
sfx=$(echo "$file" | sed 's/.*_\(.*\).csv.*/\1/')
sed -i "s/$/\t$sfx/" "$file"
done
Using sed:
$ sed 's/.*_\(.*\).csv.*/&\t\1/' file
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out 1
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out 2
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out 1
To make this for many files:
sed 's/.*_\(.*\).csv.*/&\t\1/' file1 file2 file3
OR
sed 's/.*_\(.*\).csv.*/&\t\1/' file*
To make this changed get saved in the same file(If you have GNU sed):
sed -i 's/.*\(.\).csv.*/&\t\1/' file
Untested, but this should do what you want (extract the number before .csv and append that number to the end of every line in the .out file)
awk 'FNR==1 { split(FILENAME, field, /[_.]/) }
{ print $0"\t"field[7] > FILENAME"_aaaa" }' *.out
for file in *_aaaa; do mv "$file" "${file/_aaaa}"; done
If I understood correctly, you want to append the number from the filename to every line in that file - this should do it:
#!/bin/bash
while [[ 0 < $# ]]; do
num=$(echo "$1" | sed -r 's/.*_([0-9]+).csv.*/\t\1/' )
#awk -e "{ print \$0\"\t${num}\"; }" < "$1" > "$1.new"
#sed -r "s/$/\t$num/" < "$1" > "$1.mew"
#sed -ri "s/$/\t$num/" "$1"
shift
done
Run the script and give it names of the files you want to process. $# is the number of command line arguments for the script which is decremented at the end of the loop by shift, which drops the first argument, and shifts the other ones. Extract the number from the filename and pick one of the three commented lines to do the appending: awk gives you more flexibility, first sed creates new files, second sed processes them in-place (in case you are running GNU sed, that is).
Instead of awk, you may want to go with sed or coreutils.
Grab number from filename, with grep for variety:
num=$(<<<filename grep -Eo '[^_]+\.csv' | cut -d. -f1)
<<<filename is equivalent to echo filename.
With sed
Append num to each line with GNU sed:
sed "s/\$/\t$num" filename
Use the -i switch to modify filename in-place.
With paste
You also need to know the length of the file for this method:
len=$(<filename wc -l)
Combine filename and num with paste:
paste filename <(seq $len | while read; do echo $num; done)
Complete example
for filename in DWH_Export*; do
num=$(echo $filename | grep -Eo '[^_]+\.csv' | cut -d. -f1)
sed -i "s/\$/\t$num" $filename
done

grep for multiple strings in file on different lines (ie. whole file, not line based search)?

I want to grep for files containing the words Dansk, Svenska or Norsk on any line, with a usable returncode (as I really only like to have the info that the strings are contained, my one-liner goes a little further then this).
I have many files with lines in them like this:
Disc Title: unknown
Title: 01, Length: 01:33:37.000 Chapters: 33, Cells: 31, Audio streams: 04, Subpictures: 20
Subtitle: 01, Language: ar - Arabic, Content: Undefined, Stream id: 0x20,
Subtitle: 02, Language: bg - Bulgarian, Content: Undefined, Stream id: 0x21,
Subtitle: 03, Language: cs - Czech, Content: Undefined, Stream id: 0x22,
Subtitle: 04, Language: da - Dansk, Content: Undefined, Stream id: 0x23,
Subtitle: 05, Language: de - Deutsch, Content: Undefined, Stream id: 0x24,
(...)
Here is the pseudocode of what I want:
for all files in directory;
if file contains "Dansk" AND "Norsk" AND "Svenska" then
then echo the filename
end
What is the best way to do this? Can it be done on one line?
You can use:
grep -l Dansk * | xargs grep -l Norsk | xargs grep -l Svenska
If you want also to find in hidden files:
grep -l Dansk .* | xargs grep -l Norsk | xargs grep -l Svenska
Yet another way using just bash and grep:
For a single file 'test.txt':
grep -q Dansk test.txt && grep -q Norsk test.txt && grep -l Svenska test.txt
Will print test.txt iff the file contains all three (in any combination). The first two greps don't print anything (-q) and the last only prints the file if the other two have passed.
If you want to do it for every file in the directory:
for f in *; do grep -q Dansk $f && grep -q Norsk $f && grep -l Svenska $f; done
grep –irl word1 * | grep –il word2 `cat -` | grep –il word3 `cat -`
-i makes search case insensitive
-r makes file search recursive through folders
-l pipes the list of files with the word found
cat - causes the next grep to look through the files passed to it list.
You can do this really easily with ack:
ack -l 'cats' | ack -xl 'dogs'
-l: return a list of files
-x: take the files from STDIN (the previous search) and only search those files
And you can just keep piping until you get just the files you want.
How to grep for multiple strings in file on different lines (Use the pipe symbol):
for file in *;do
test $(grep -E 'Dansk|Norsk|Svenska' $file | wc -l) -ge 3 && echo $file
done
Notes:
If you use double quotes "" with your grep, you will have to escape the pipe like this: \| to search for Dansk, Norsk and Svenska.
Assumes that one line has only one language.
Walkthrough: http://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/
awk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print "0" }'
you can then catch the return value with the shell
if you have Ruby(1.9+)
ruby -0777 -ne 'print if /Dansk/ and /Norsk/ and /Svenka/' file
This searches multiple words in multiple files:
egrep 'abc|xyz' file1 file2 ..filen
Simply:
grep 'word1\|word2\|word3' *
see this post for more info
This is a blending of glenn jackman's and kurumi's answers which allows an arbitrary number of regexes instead of an arbitrary number of fixed words or a fixed set of regexes.
#!/usr/bin/awk -f
# by Dennis Williamson - 2011-01-25
BEGIN {
for (i=ARGC-2; i>=1; i--) {
patterns[ARGV[i]] = 0;
delete ARGV[i];
}
}
{
for (p in patterns)
if ($0 ~ p)
matches[p] = 1
# print # the matching line could be printed
}
END {
for (p in patterns) {
if (matches[p] != 1)
exit 1
}
}
Run it like this:
./multigrep.awk Dansk Norsk Svenska 'Language: .. - A.*c' dvdfile.dat
Here's what worked well for me:
find . -path '*/.svn' -prune -o -type f -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh
./another/path/to/file2.txt
./blah/foo.php
If I just wanted to find .sh files with these three, then I could have used:
find . -path '*/.svn' -prune -o -type f -name "*.sh" -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh
Expanding on #kurumi's awk answer, here's a bash function:
all_word_search() {
gawk '
BEGIN {
for (i=ARGC-2; i>=1; i--) {
search_terms[ARGV[i]] = 0;
ARGV[i] = ARGV[i+1];
delete ARGV[i+1];
}
}
{
for (i=1;i<=NF; i++)
if ($i in search_terms)
search_terms[$1] = 1
}
END {
for (word in search_terms)
if (search_terms[word] == 0)
exit 1
}
' "$#"
return $?
}
Usage:
if all_word_search Dansk Norsk Svenska filename; then
echo "all words found"
else
echo "not all words found"
fi
I did that with two steps. Make a list of csv files in one file
With a help of this page comments I made two scriptless steps to get what I needed. Just type into terminal:
$ find /csv/file/dir -name '*.csv' > csv_list.txt
$ grep -q Svenska `cat csv_list.txt` && grep -q Norsk `cat csv_list.txt` && grep -l Dansk `cat csv_list.txt`
it did exactly what I needed - print file names containing all three words.
Also mind the symbols like `' "
If you only need two search terms, arguably the most readable approach is to run each search and intersect the results:
comm -12 <(grep -rl word1 . | sort) <(grep -rl word2 . | sort)
If you have git installed
git grep -l --all-match --no-index -e Dansk -e Norsk -e Svenska
The --no-index searches files in the current directory that is not managed by Git. So this command will work in any directory irrespective of whether it is a git repository or not.
I had this problem today, and all one-liners here failed to me because the files contained spaces in the names.
This is what I came up with that worked:
grep -ril <WORD1> | sed 's/.*/"&"/' | xargs grep -il <WORD2>
A simple one-liner in bash for an arbitrary list LIST for file my_file.txt can be:
LIST="Dansk Norsk Svenska"
EVAL=$(echo "$LIST" | sed 's/[^ ]* */grep -q & my_file.txt \&\& /g'); eval "$EVAL echo yes || echo no"
Replacing eval with echo reveals, that the following command is evaluated:
grep -q Dansk my_file.txt && grep -q Norsk my_file.txt && grep -q Svenska my_file.txt && echo yes || echo no

Resources