extract columns with Bash or awk

extract columns with Bash or awk - bash

I have a folder with 1000 text files. I would like to extract fourth and fifth columns from each file and save it to an another folder with the same filenames. How can I do this with awk or Bash?

Since you haven't specified about the field separator, I am assuming as the default(space) field separator.
awk '{print $4,$5>"/tmp/another_directory/"FILENAME}' *

With GNU find and GNU sed, assuming space separated columns and the files to process are in the current directory:
mkdir tmp/
find . -maxdepth 1 -type f -print0 |
xargs -0 -l /bin/sh -c 'sed -r "s/([^ ] *){3}([^ ]) *([^ ]*).*/\2 \3/" "$1" > "tmp/$1"' ''
Note the last pair of single quotes are important, they're a placeholders so the filename gets passed correctly to sh. The processed files are saved to tmp/ in the current directory.

Related

How to grep files in date order

I can list the Python files in a directory from most recently updated to least recently updated with
ls -lt *.py
But how can I grep those files in that order?
I understand one should never try to parse the output of ls as that is a very dangerous thing to do.

You may use this pipeline to achieve this with gnu utilities:
find . -maxdepth 1 -name '*.py' -printf '%T#:%p\0' |
sort -z -t : -rnk1 |
cut -z -d : -f2- |
xargs -0 grep 'pattern'
This will handle filenames with special characters such as space, newline, glob etc.
find finds all *.py files in current directory and prints modification time (epoch value) + : + filename + NUL byte
sort command performs reverse numeric sort on first column that is timestamp
cut command removes 1st column (timestamp) from output
xargs -0 grep command searches pattern in each file

There is a very simple way if you want to get the filelist in chronologic order that hold the pattern:
grep -sil <searchpattern> <files-to-grep> | xargs ls -ltr
i.e. you grep e.g. "hello world" in *.txt, with -sil you make the grep case insensitive (-i), suppress messages (-s) and just list files (-l); this you then pass on to ls (| xargs), sorting it by date (-t) showing date (-l) and all files (-a).

Grep to Print all file content [duplicate]

This question already has answers here:
Colorized grep -- viewing the entire file with highlighted matches
(24 answers)
Closed 4 years ago.
How can I modify grep so that it prints full file if its entry matches the grep pattern , instead of printing Just the matching line ?
I tried using(say) grep -C2 to print two lines above and 2 below but this doesn't always works as no. of lines is not fixed ..
I am not Just searching a single file , I am searching an entire directory where some files may contain the given pattern and I want those Files to be completely Printed.
I am also using grep inside grep result without getting printed the first grep output.

Simple grep + cat combination:
grep 'pattern' file && cat file

Use grep's -l option to list the paths of files with matching contents, then print the contents of these files using cat.
grep -lR 'regex' 'directory' | xargs -d '\n' cat
The command from above cannot handle filenames with newlines in them.
To overcome the filename with newlines issue and also allow more sophisticated checks you can use the find command.
The following command prints the content of all regular files in directory.
find 'directory' -type f -exec cat {} +
To print only the content of files whose content matches the regexes regex1 and regex2, use
find 'directory' -type f \
-exec grep -q 'regex1' {} \; -and \
-exec grep -q 'regex2' {} \; \
-exec cat {} +
The linebreaks are only for better readability. Without the \ you can write everything into one line.
Note the -q for grep. That option supresses grep's output. grep's exit status will tell find whether to list a file or not.

Applying awk pattern to all files with same name, outputting each to a new file

I'm trying to recursively find all files with the same name in a directory, apply an awk pattern to them, and then output to the directory where each of those files lives a new updated version of the file.
I thought it was better to use a for loop than xargs, but I don't exactly how to make this work...
for f in $(find . -name FILENAME.txt );
do awk -F"\(corr\)" '{print $1,$2,$3,$4}' ./FILENAME.txt > ./newFILENAME.txt $f;
done
Ultimately I would like to be able to remove multiple strings from the file at once using -F, but also not sure how to do that using awk.
Also is there a way to remove "(cor*)" where the * represents a wildcard? Not sure how to do while keeping with the escape sequence for the parentheses
Thanks!

To use (corr*) as a field separator where * is a glob-style wildcard, try:
awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
For example:
$ echo '1(corr)2(corrTwo)3(corrThree)4' | awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
1 2 3 4
To apply this command to every file under the current directory named FILENAME.txt, use:
find . -name FILENAME.txt -execdir sh -c 'awk -F'\''[(]corr[^)]*[)]'\'' '\''{print $1,$2,$3,$4}'\'' "$1" > ./newFILENAME.txt' Awk {} \;
Notes
Don't use:
for f in $(find . -name FILENAME.txt ); do
If any file or directory has whitespace or other shell-active characters in it, the results will be an unpleasant surprise.
Handling both parens and square brackets as field separators
Consider this test file:
$ cat file.txt
1(corr)2(corrTwo)3[some]4
To eliminate both types of separators and print the first four columns:
$ awk -F'[(]corr[^)]*[)]|[[][^]]*[]]' '{print $1,$2,$3,$4}' file.txt
1 2 3 4

Counting number of occurrences in several files

I want to check the number of occurrences of, let's say, the character '[', recursively in all the files of a directory that have the same extension, e.g. *.c. I am working with the SO Solaris in Unix.
I tried some solutions that are given in other posts, and the only one that works is this one, since with this OS I cannot use the command grep -o:
sed 's/[^x]//g' filename | tr -d '012' | wc -c
Where x is the occurrence I want to count. This one works but it's not recursive, is there any way to make it recursive?

You can get a recursive listing from find and execute commands with its -exec argument.
I'd suggest like:
find . -name '*.c' -exec cat {} \; | tr -c -d ']' | wc -c
The -c argument to tr means to use the opposite of the string supplied -- i.e. in this case, match everything but ].
The . in the find command means to search in the current directory, but you can supply any other directory name there as well.

I hope you have nawk installed. Then you can just:
nawk '{a+=gsub(/\]/,"x")}END{print a}' /path/*

You can write a snippet code itself. I suggest you to run the following:
awk '{for (i=1;i<=NF;i++) if ($i=="[") n++} END{print n}' *.c
This will search for "[" in all files in the present directory and print the number of occurrences.

Bash Script which recursively makes all text in files lowercase

I'm trying to write a shell script which recursively goes through a directory, then in each file converts all Uppercase letters to lowercase ones. To be clear, I'm not trying to change the file names but the text in the files.
Considerations:
This is an old Fortran project which I am trying to make more accessible
I do not want to create a new file but rather write over the old one with the changes
There are several different file extensions in this directory, including .par .f .txt and others
What would be the best way to go about this?

To convert a file from lower case to upper case you can use ex (a good friend of ed, the standard editor):
ex -s file <<EOF
%s/[[:upper:]]\+/\L&/g
wq
EOF
or, if you like stuff on one line:
ex -s file <<< $'%s/[[:upper:]]\+/\L&/g\nwq'
Combining with find, you can then do:
find . -type f -exec bash -c "ex -s -- \"\$0\" <<< $'%s/[[:upper:]]\+/\L&/g\nwq'" {} \;
This method is 100% safe regarding spaces and funny symbols in the file names. No auxiliary files are created, copied or moved; files are only edited.
Edit.
Using glenn jackmann's suggestion, you can also write:
find . -type f -exec bash -c 'printf "%s\n" "%s/[[:upper:]]\+/\L&/g" "wq" | ex -- -s "$0"' {} \;
(the pro is that it avoids awkward escapes; the con is that it's longer).

You can translate all uppercase characters (A–Z) to lowercase (a–z) using the tr command
and specifying a range of characters, as in:
$ tr 'A-Z' 'a-z' <be.fore >af.ter
There is also special syntax in tr for specifying this sort of range for upper- and lowercase
conversions:
$ tr '[:upper:]' '[:lower:]' <be.fore >af.ter
The tr utility copies the given input to produced the output with substitution or deletion of selected characters. tr abbreviated as translate or transliterate. It takes as parameters two sets of characters, and replaces occurrences of the characters in the first set with the corresponding elements from the other set i.e. it is used to translate characters.
tr "set1" "set2" < input.txt > output.txt
Although tr doesn't support regular expressions, hmm, it does support a range of characters.
Just make sure that both arguments end up with the same number of characters.
If the second argument is shorter, its last character will be repeated to match the
length of the first argument. If the first argument is shorter, the second argument will
be truncated to match the length of the first.

sed -e 's/\(.*\)/\L\1/g' *
or you could pipe the files in from find

Expanding on #nullrevolution's solution:
find /path_to_files -type f -exec sed --in-place -e 's/\(.*\)/\L\1/g' '{}' \;
This one liner will look for all files in all sub-directories starting with /path_to_files as a base directory.
WARNING: This will change the case on ALL files in EVERY directory under */path_to_file*, so make sure you want to do that before you execute this script. You can limit the scope of the find based on file extensions by utilizing the following:
find /path_to_files -type f -name \*.txt -exec sed --in-place -e 's/\(.*\)/\L\1/g' '{}' \;
You may also want to make a backup of the original file before modifying the original:
find /path_to_files -type f -name *.txt -exec sed --in-place=-orig -e 's/(.*)/\L\1/g' '{}' \;
This will leave the original file name, while making an unmodified copy with the "_orig" appended to the file name (ie file.txt would become file.txt-orig).
An explanation of each piece:
find /path_to_file This will set the base directory to the path provided.
-type f This will search the directory hierarchy for files only.
-exec COMMAND '{}' \; This executes the provided command once for each matched file. The '{}' is replaced by the current file name. The \; indicates the end of the command.
sed --in-place -e 's/\(.*\)/\L\1/g' The --in-place will make the cnages to the file without backing up the file. The regular expression uses a backreference \1 to refer to the entire line and the \L to convert to lower case.
Optional
(For a more archaic solution.)
find /path_to_files -type f -exec dd if='{}' of='{}'-lc conv=lcase \;

Identifying text files can be a bit tricky in Unixlike environments. You can do something like this:
set -e -o noclobber
while read f; do
tr 'A-Z' 'a-z' <"$f" >"f.$$"
mv "$f.$$" "$f"
done < <(find "$start_directory" -type f -exec file {} + | cut -d: -f1)
This will fail on filenames with embedded colons or newlines, but should work on others, including those with spaces.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

extract columns with Bash or awk - bash

I have a folder with 1000 text files. I would like to extract fourth and fifth columns from each file and save it to an another folder with the same filenames. How can I do this with awk or Bash?

Since you haven't specified about the field separator, I am assuming as the default(space) field separator. awk '{print $4,$5>"/tmp/another_directory/"FILENAME}' *

Related

How to grep files in date order

Grep to Print all file content [duplicate]

Applying awk pattern to all files with same name, outputting each to a new file

Counting number of occurrences in several files

Bash Script which recursively makes all text in files lowercase

Categories

Resources