Bash pipes and Shell expansions - bash

I've changed my data source in a bash pipe from cat ${file} to cat file_${part_number} because preprocessing was causing ${file} to be truncated at 2GB, splitting the output eliminated the preprocessing issues. However while testing this change, I was unable to work out how to get Bash to continue acting the same for some basic operations I was using to test the pipeline.
My original pipeline is:
cat giantfile.json | jq -c '.' | python postprocessor.py
With the original pipeline, if I'm testing changes to postprocessor.py or the preprocessor and I want to just test my changes with a couple of items from giantfile.json I can just use head and tail. Like so:
cat giantfile.json | head -n 2 - | jq -c '.' | python postprocessor.py
cat giantfile.json | tail -n 3 - | jq -c '.' | python postprocessor.py
The new pipeline that fixes the issues the preprocessor is:
cat file_*.json | jq -c '.' | python postprocessor.py
This works fine, since every file gets output eventually. However I don't want to wait 5-10 minutes for each tests. I tried to test with the first 2 lines of input with head.
cat file_*.json | head -n 2 - | jq -c '.' | python postprocessor.py
Bash sits there working far longer than it should, so I try:
cat file_*.json | head -n 2 - | jq -c '.'
And my problem is clear. Bash is outputting the content of all the files as if head was not even there because each file now has 1 line of data in it. I've never needed to do this with bash before and I'm flummoxed.
Why does Bash behave this way, and How do I rewrite my little bash command pipeline to work the way it used to, allowing me to select the first/last n lines of data to work with for testing?

My guess is that when you split the json up into individual files, you managed to remove the newline character from the end of each line, with the consequence that the concatenated file (cat file_json.*) is really only one line in total, because cat will not insert newlines between the files it is concatenating.
If the files were really one line each with a terminating newline character, piping through head -n 2 should work fine.
You can check this hypothesis with wc, since that utility counts newline characters rather than lines. If it reports that the files have 0 lines, then you need to fix your preprocessing.

Related

Count the lines from output using pipeline

I am trying to count how many files have words with the pattern [Gg]reen.
#!/bin/bash
for File in `ls ./`
do
cat ./$File | egrep '[Gg]reen' | sed -n '$='
done
When I do this I get this output:
1
1
3
1
1
So I want to count the lines to get in total 5. I tried using wc -l after the sed but it didn't work; it counted the lines in all the files. I tried to use >file.txt but it didn't write anything on it. And when I use >> instead it writes but when I execute the shell it appends the lines again.
Since according to your question, you want to know how many files contain a pattern, you are interested in the number of files, not the number of pattern occurances.
For instance,
grep -l '[Gg]reen' * | wc -l
would produce the number of files which contain somewhere green or Green as a substring.

How to create argument variable in bash script

I am trying to write a script such that I can identify number of characters of the n-th largest file in a sub-directory.
I was trying to assign n and the name of sub-directory into arguments like $1, $2.
Current directory: Greetings
Sub-directory: language_files, others
Sub-directory: English, German, French
Files: Goodmorning.csv, Goodafternoon.csv, Goodevening.csv ….
I would be at directory “Greetings”, while I indicating subdirectory (English, German, French), it would show the nth-largest file in the subdirectory indicated and calculate number of characters as well.
For instance, if I am trying to figure out number of characters of 2nd largest file in English, I did:
langs=$1
n=$2
for langs in language_files/;
Do count=$(find language_files/$1 name "*.csv" | wc -m | head -n -1 | sort -n -r | sed -n $2(p))
Done | echo "The file has $count bytes!"
The result I wanted was:
$ ./script1.sh English 2
The file has 1100 bytes!
The main problem of all the issue is the fact that I don't understand how variables and looping work in bash script.
no need for looping
find language_files/"$1" -name "*.csv" | xargs wc -m | sort -nr | sed -n "$2{p;q}"
for byte counting you should use -c, since -m is for char counting (it may be the same for you).
You don't use the loop variable in the script anyway.
Bash loops are interesting. You are encouraged to learn more about them when you have some time. However, this particular problem might not need a loop. Set lang (you can call it langs if you prefer) and n appropriately, and then try this:
count=$(stat -c'%s %n' language_files/$lang/* | sort -nr | head -n$n | tail -n1 | sed -re 's/^[[:space:]]*([[:digit:]]+).*/\1/')
That should give you the $count you need. Then you can echo it however you like.
EXPLANATION
If you wish to learn how it works:
The stat command outputs various statistics about the named file (or files), in this case %s the file's size and %n the file's name.
The head and tail output respectively the first and last several lines of a file. Together, they select a specific line from the file
The sed command screens a certain part of the line. (You can use cut, instead, if you prefer.)
If you wish to be cleverer, then you can optimize as #karafka has done.

How to select most recent file based off of date in filename

I have a list of files
- backups/
- backup.2017-08-28.zip
- backup.2017-08-29.zip
- backup.2017-09-2.zip
I would like to be able to upload the most recent back to a server which I can do with command:
dobackup ~/backups/backup.2017-09-2.zip
My questions is: Within a .sh file (so I can start an automated/cron job for this) how can I get the latest file name to then run that command?
Limitation: I must use the date on the filename not the modifcation metadata.
Adding a couple more files:
backup.2017-08-28.zip
backup.2017-08-29.zip
backup.2017-09-10.zip
backup.2017-09-2.zip
backup.2017-09-28.zip
backup.2017-09-3.zip
How about something like this, though granted, a bit convoluted:
ls -1 backup*zip | sed 's/-\([1-9]\)\./-0\1\./g' | sort [-r] | sed 's/-0\([1-9]\)\./-\1\./g'
sed is looking for a match like -[0-9].
the escaped/matching parens - \( and \) designates a pattern we want to reference in the replacement portion
the new pattern will be -0\1. where the \1 is a reference to the first pattern wrapped in escaped/matching parens (ie, \1 will be replaced with the single digit that matched [0-9])
our period (.) is escaped to make sure it's handled as a literal period and not considered as a single-position wildcard
at this point the ls/sed construct has produced a list of files with 2-digit days
we run through sort (or sort -r) as needed
then run the results back through sed to convert back to a single digit day for days starting with a 0
at this point you can use a head or tail to strip off the first/last line based on which sort/sort -r you used
Running against the sample files:
$ ls -1 backup*zip | sed 's/-\([1-9]\)\./-0\1\./g' | sort | sed 's/-0\([1-9]\)\./-\1\./g'
backup.2017-08-28.zip
backup.2017-08-29.zip
backup.2017-09-2.zip
backup.2017-09-3.zip
backup.2017-09-10.zip
backup.2017-09-28.zip
# reverse the ordering
$ ls -1 backup*zip | sed 's/-\([1-9]\)\./-0\1\./g' | sort -r | sed 's/-0\([1-9]\)\./-\1\./g'
backup.2017-09-28.zip
backup.2017-09-10.zip
backup.2017-09-3.zip
backup.2017-09-2.zip
backup.2017-08-29.zip
backup.2017-08-28.zip
You can sort it on 2nd field delimited by dot:
printf '%s\n' backup.* | sort -t '.' -k2,2r | head -1
backup.2017-09-2.zip

No new line produced by >>

I have the following piece of code that selects two line numbers in a file, extracts everything between these lines, replaces the new line characters with tabs and places them in an output file. I want all lines extracted within one loop to be on the same line, but lines extracted on different loops to go on a new line.
for ((i=1; i<=numTimePoints; i++)); do
# Get the starting point for line extraction. This is just an integer.
startScan=$(($(echo "${dataStart}" | sed -n ${i}p)+1))
# Get the end point for line extraction. This is just an integer.
endScan=$(($(echo "${dataEnd}" | sed -n ${i}p)-1))
# From file ${file}, take all lines between ${startScan} and ${endScan}. Replace new lines with tabs and output to file ${tmpOutputFile}
head -n ${endScan} ${file} | tail -n $((${endScan}-${startScan}+1)) | tr "\n" "\t" >> ${tmpOutputFile}
done
This script works mostly as intended, however all new lines are appended to the previous line, rather than placed on new lines (as I thought >> would do). In other words, if I were to now do cat ${tmpOutputFile} | wc then it returns 0 12290400 181970555. Can anyone point out what I'm doing wrong?
Any redirection, including >>, does not have anything to do with newline creation at all -- redirection operations don't generate output themselves, newlines or otherwise; they only control where file descriptors (stdout, stderr, etc) are connected to, and it's the programs performing those writes which are responsible for contents.
Consequently, your tr '\n' '\t' is entirely preventing newlines from being added to the output file -- there's nowhere one could come from that doesn't go through that pipeline.
Consider the following instead:
while read -r startScan <&3 && read -r endScan <&4; do
# generate your output
head -n "$endScan" "$file" | tail -n $(( endScan - startScan + 1 )) | tr '\n' '\t'
# append your newline
printf '\n'
done 3<<<"$dataStart" 4<<<"$dataEnd" >"$tmpOutputFile"
Note:
We aren't paying the cost of running sed to extract startScan and endScan, but rather are reading them a line at a time from herestrings created from the contents of dataStart and dataEnd
We're redirecting to our output file exactly once, and reusing that file handle for the entire loop (over multiple commands -- first the pipeline, and then the printf)
We're actually running a printf to generate that newline, rather than expecting it to be somehow implicitly created by magic.

bash: shortest way to get n-th column of output

Let's say that during your workday you repeatedly encounter the following form of columnized output from some command in bash (in my case from executing svn st in my Rails working directory):
? changes.patch
M app/models/superman.rb
A app/models/superwoman.rb
in order to work with the output of your command - in this case the filenames - some sort of parsing is required so that the second column can be used as input for the next command.
What I've been doing is to use awk to get at the second column, e.g. when I want to remove all files (not that that's a typical usecase :), I would do:
svn st | awk '{print $2}' | xargs rm
Since I type this a lot, a natural question is: is there a shorter (thus cooler) way of accomplishing this in bash?
NOTE:
What I am asking is essentially a shell command question even though my concrete example is on my svn workflow. If you feel that workflow is silly and suggest an alternative approach, I probably won't vote you down, but others might, since the question here is really how to get the n-th column command output in bash, in the shortest manner possible. Thanks :)
You can use cut to access the second field:
cut -f2
Edit:
Sorry, didn't realise that SVN doesn't use tabs in its output, so that's a bit useless. You can tailor cut to the output but it's a bit fragile - something like cut -c 10- would work, but the exact value will depend on your setup.
Another option is something like: sed 's/.\s\+//'
To accomplish the same thing as:
svn st | awk '{print $2}' | xargs rm
using only bash you can use:
svn st | while read a b; do rm "$b"; done
Granted, it's not shorter, but it's a bit more efficient and it handles whitespace in your filenames correctly.
I found myself in the same situation and ended up adding these aliases to my .profile file:
alias c1="awk '{print \$1}'"
alias c2="awk '{print \$2}'"
alias c3="awk '{print \$3}'"
alias c4="awk '{print \$4}'"
alias c5="awk '{print \$5}'"
alias c6="awk '{print \$6}'"
alias c7="awk '{print \$7}'"
alias c8="awk '{print \$8}'"
alias c9="awk '{print \$9}'"
Which allows me to write things like this:
svn st | c2 | xargs rm
Try the zsh. It supports suffix alias, so you can define X in your .zshrc to be
alias -g X="| cut -d' ' -f2"
then you can do:
cat file X
You can take it one step further and define it for the nth column:
alias -g X2="| cut -d' ' -f2"
alias -g X1="| cut -d' ' -f1"
alias -g X3="| cut -d' ' -f3"
which will output the nth column of file "file". You can do this for grep output or less output, too. This is very handy and a killer feature of the zsh.
You can go one step further and define D to be:
alias -g D="|xargs rm"
Now you can type:
cat file X1 D
to delete all files mentioned in the first column of file "file".
If you know the bash, the zsh is not much of a change except for some new features.
HTH Chris
Because you seem to be unfamiliar with scripts, here is an example.
#!/bin/sh
# usage: svn st | x 2 | xargs rm
col=$1
shift
awk -v col="$col" '{print $col}' "${#--}"
If you save this in ~/bin/x and make sure ~/bin is in your PATH (now that is something you can and should put in your .bashrc) you have the shortest possible command for generally extracting column n; x n.
The script should do proper error checking and bail if invoked with a non-numeric argument or the incorrect number of arguments, etc; but expanding on this bare-bones essential version will be in unit 102.
Maybe you will want to extend the script to allow a different column delimiter. Awk by default parses input into fields on whitespace; to use a different delimiter, use -F ':' where : is the new delimiter. Implementing this as an option to the script makes it slightly longer, so I'm leaving that as an exercise for the reader.
Usage
Given a file file:
1 2 3
4 5 6
You can either pass it via stdin (using a useless cat merely as a placeholder for something more useful);
$ cat file | sh script.sh 2
2
5
Or provide it as an argument to the script:
$ sh script.sh 2 file
2
5
Here, sh script.sh is assuming that the script is saved as script.sh in the current directory; if you save it with a more useful name somewhere in your PATH and mark it executable, as in the instructions above, obviously use the useful name instead (and no sh).
It looks like you already have a solution. To make things easier, why not just put your command in a bash script (with a short name) and just run that instead of typing out that 'long' command every time?
If you are ok with manually selecting the column, you could be very fast using pick:
svn st | pick | xargs rm
Just go to any cell of the 2nd column, press c and then hit enter
Note, that file path does not have to be in second column of svn st output. For example if you modify file, and modify it's property, it will be 3rd column.
See possible output examples in:
svn help st
Example output:
M wc/bar.c
A + wc/qax.c
I suggest to cut first 8 characters by:
svn st | cut -c8- | while read FILE; do echo whatever with "$FILE"; done
If you want to be 100% sure, and deal with fancy filenames with white space at the end for example, you need to parse xml output:
svn st --xml | grep -o 'path=".*"' | sed 's/^path="//; s/"$//'
Of course you may want to use some real XML parser instead of grep/sed.

Resources