shell: Do wildcards guarantee alphabetical order? - bash

When I have the files a.txt, b.txt and c.txt is it guaranteed that
cat *.txt > all_files.txt
or
cat ?.txt > all_files.txt
will combine the files in alphabetical order?
(In all my tests, the alphabetical order was preserved, but I'm not sure because for example with ls the order is undefined and need not be alphabetic - but it often is, because the files have often been written to the directory alphabetically)

No, it depends on the locale. The order is dictated by the collation sequence in the locale, which can be changed using the LC_COLLATE or LC_ALL environment variables. Note that bash behaves differently in this respect to some other shells (e.g. Korn shell).
If you have a locale setting of C or POSIX then it will be in character set order. Otherwise you will probably only notice a difference with mixed case letters, e.g. the sequence for en_ locales is aAbBcC ... xXyYzZ. For example see http://collation-charts.org/fc6/fc6.en_GB.iso885915.html.
Available locales may be listed using locale -a.
Edit: another variable LANG is available, but it is generally not used much nowadays. According to the Single UNIX Specification it is used: in the absence of the LC_ALL and other LC_* ... environment variables.

Related

Read files in directory using bash script in alphabetical order and perform operation in alphabetical order [duplicate]

When I have the files a.txt, b.txt and c.txt is it guaranteed that
cat *.txt > all_files.txt
or
cat ?.txt > all_files.txt
will combine the files in alphabetical order?
(In all my tests, the alphabetical order was preserved, but I'm not sure because for example with ls the order is undefined and need not be alphabetic - but it often is, because the files have often been written to the directory alphabetically)
No, it depends on the locale. The order is dictated by the collation sequence in the locale, which can be changed using the LC_COLLATE or LC_ALL environment variables. Note that bash behaves differently in this respect to some other shells (e.g. Korn shell).
If you have a locale setting of C or POSIX then it will be in character set order. Otherwise you will probably only notice a difference with mixed case letters, e.g. the sequence for en_ locales is aAbBcC ... xXyYzZ. For example see http://collation-charts.org/fc6/fc6.en_GB.iso885915.html.
Available locales may be listed using locale -a.
Edit: another variable LANG is available, but it is generally not used much nowadays. According to the Single UNIX Specification it is used: in the absence of the LC_ALL and other LC_* ... environment variables.

GOBIN root setting with var multi GOPATH in .zshrc config

export GOPATH=~/mygo:~/go
export GOBIN=$GOPATH/bin
I expected the $GOBIN equals ~/mygo/bin:~/go/bin but it is ~/mygo:~/go/bin instead.
how could I set them a better way? thx
Solution
export GOPATH=~/mygo:~/go
export GOBIN=${(j<:>)${${(s<:>)GOPATH}/%//bin}}
Explanation
Although whatever program uses GOPATH might interprete it as an array, for zsh it is just a scalar ("string").
In order to append a string (/bin) to every element the string "$GOPATH" first needs to be split into an array. In zsh this can be done with the parameter expansion flag s:string:. This splits a scalar on string and returns an array. Instead of : any other character or matching pairs of (), [], {} or <> can be used. In this case it has to be done because string is to be :.
GOPATH_ARRAY=(${(s<:>)GOPATH)
Now the ${name/pattern/repl} parameter expansion can be used to append /bin to each element, or rather to replace the end of each element with /bin. In order to match the end of a string, the pattern needs to begin with a %. As any string should be matched, the pattern is otherwise empty:
GOBIN_ARRAY=(${GOPATH_ARRAY/%//bin})
Finally, the array needs to be converted back into a colon-separated string. This can be done with the j:string: parameter expansion flag. It is the counterpart to s:string::
GOBIN=${(j<:>)GOBIN_ARRAY}
Fortunately, zsh allows Nested Substitution, so this can be done all in one statement, without intermediary variables:
GOBIN=${(j<:>)${${(s<:>)GOPATH}/%//bin}}
Alternative Solution
It is also possible to do this without parameter expansion flags or nested substitution by simply appending /bin to the end of the string and additionally replace every : with /bin::
export GOBIN=${GOPATH//://bin:}/bin
The ${name//pattern/repl} expansion replaces every occurence of pattern with repl instead of just the first like with ${name/pattern/repl}.
This would also work in bash.
Personally, I feel that it is a bit "hackish", mainly because you need to write /bin twice and also because it completely sidesteps the underlying semantics. But that is only personal preference and the results will be the same.
Note:
When defining GOPATH like you did in the question
export GOPATH=~/mygo:~/go
zsh will expand each occurence of ~/ with your home directory. So the value of GOPATH will be /home/kevin/mygo:/home/kevin/go - assuming the user name is "kevin". Accordingly, GOBIN will also have the expanded paths, /home/kevin/mygo/bin:/home/kevin/go/bin, instead of ~/mygo/bin:~/go/bin
This could be prevented by quoting the value - GOPATH="~/mygo:~/go" - but I would recommend against it. ~ as synonym for the home directory is not a feature of the operating system and while shells usually support it, other programs (those needing GOPATH or GOBIN) might not do so.

Why can't environment variables with dashes be accessed in bash 4.1.2?

On a CentOS 5 host (with bash 3.2.32), we use Ruby (1.8.7) to
ENV['AWS_foo-bar_ACCESS_KEY'] = xxxxx
Then, using bash, we run a shell script that does:
BUCKET_NAME=$1
AWS_ACCESS_KEY_ID_VAR="AWS_${BUCKET_NAME}_ACCESS_KEY_ID"
AWS_ACCESS_KEY_ID="${!AWS_ACCESS_KEY_ID_VAR}"
export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
This works fine on CentOS 5.
However, on CentOS 6 (with bash 4.1.2), we get the error
-bash: export: `AWS_foo-bar_ACCESS_KEY_ID=xxxxx': not a valid identifier
It is our understanding that this fails because - is not allowed in the variable name. But why does this work on bash 3.2 and not bash 4.1?
The "why" is almost irrelevant: The POSIX standard makes it very clear that export is only required to support arguments which are valid names, and anything with a dash is not a valid name. Thus, no POSIX shell is required to support exporting or expanding variable names with dashes, via indirect expansion or otherwise.
It's worth noting that ShellShock -- a major security bug caused by sloppy handling of environment contents -- is fixed in the bash 4.1 present in the current CentOS 6 updates repo; increased rigor in an area which spawned security bugs should be no surprise.
The remainder of this answer will focus on demonstrating that the new behavior of bash 4.1 is explicitly allowed, or even required, by POSIX -- and thus that the prior behavior was an undefined implementation artifact.
To quote POSIX on environment variables:
These strings have the form name=value; names shall not contain the character '='. For values to be portable across systems conforming to IEEE Std 1003.1-2001, the value shall be composed of characters from the portable character set (except NUL and as indicated below). There is no meaning associated with the order of strings in the environment. If more than one string in a process' environment has the same name, the consequences are undefined.
Environment variable names used by the utilities in the Shell and Utilities volume of IEEE Std 1003.1-2001 consist solely of uppercase letters, digits, and the '_' (underscore) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; applications shall tolerate the presence of such names. Uppercase and lowercase letters shall retain their unique identities and shall not be folded together. The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities.
Note: Other applications may have difficulty dealing with environment variable names that start with a digit. For this reason, use of such names is not recommended anywhere.
Thus:
Tools (including the shell) are required to fully support environment variable names with uppercase and lowercase letters, digits (except in the first position), and the underscore.
Tools (including the shell) may modify their behavior based on environment variables with names that comply with the above and additionally do not contain lowercase letters.
Tools (including the shell) should tolerate other names -- meaning they shouldn't crash or misbehave in their presence -- but are not required to support them.
Finally, shells are explicitly allowed to discard environment variable names which are not also shell variable names. From the relevant standard:
It is unspecified whether environment variables that were passed to the shell when it was invoked, but were not used to initialize shell variables (see Shell Variables) because they had invalid names, are included in the environment passed to execl() and (if execl() fails as described above) to the new shell.
Moreover, what defines a valid shell name is well-defined:
Name - In the shell command language, a word consisting solely of underscores, digits, and alphabetics from the portable character set. The first character of a name is not a digit.
Notably, only underscores (not dashes) are considered part of a valid name in a POSIX-compliant shell.
...and the POSIX specification for export explicitly uses the word "name" (which it defined in the text quoted above), and describes it as applying to "variables" (shell variables, the restrictions on names for which are also subject to restrictions quoted elsewhere in this document):
The shell shall give the export attribute to the variables corresponding to the specified names, which shall cause them to be in the environment of subsequently executed commands. If the name of a variable is followed by = word, then the value of that variable shall be set to word.
All the above being said -- if your operating system provides a /proc/self/environ which represents the state of your enviroment variables at process startup (before a shell has, as it's allowed to do, potentially discarded any variables which don't have valid names in shell), you can extract content with invalid names like so:
# using a lower-case name where possible is in line with POSIX guidelines, see above
aws_access_key_id_var="AWS_${BUCKET_NAME}_ACCESS_KEY_ID"
while IFS= read -r -d '' var; do
[[ $var = "$aws_access_key_id_var"=* ]] || continue
val=${var#"${aws_access_key_id_var}="}
break
done </proc/self/environ
echo "Extracted value: $val"

concatenate files with similar names using shell

I have very limited knowledge of shell scripting, for example if I have the following files in a folder
abcd_1_1.txt
abcd_1_2.txt
def_2_1.txt
def_2_2.txt
I want the output as abcd_1.txt, def_2.txt. For each pattern in the file names, concantenate the files and generate the 'pattern'.txt as an output
patterns list <-?
for i in patterns; do echo cat "$i"* > "$i".txt; done
I am not sure how to code this in a shell script, any help is appreciated.
Maybe something like this (assumes bash, and I didn't test it).
declare -A prefix
files=(*.txt)
for f in "${files[#]"; do
prefix[${f%_*}]=
done
for key in "${!prefix[#]}"; do
echo "${prefix[$key]}.txt"
done
for i in abcd_1 def_2
do
cat "$i"*.txt > "$i".txt
done
The above will work in any POSIX shell, such as dash or bash.
If, for some reason, you want to maintain a list of patterns and then loop through them, then it is appropriate to use an array:
#!/bin/bash
patterns=(abcd_1 def_2)
for i in "${patterns[#]}"
do
cat "$i"*.txt > "$i".txt
done
Arrays require an advanced shell such as bash.
Related Issue: File Order
Does it the order in which files are added to abcd_1 or def_2 matter to you? The * will result is lexical ordering. This can conflict with numeric ordering. For example:
$ echo def_2_*.txt
def_2_10.txt def_2_11.txt def_2_12.txt def_2_1.txt def_2_2.txt def_2_3.txt def_2_4.txt def_2_5.txt def_2_6.txt def_2_7.txt def_2_8.txt def_2_9.txt
Observe that def_2_12.txt appears in the list ahead of def_2_1.txt. Is this a problem? If so, we can explicitly force numeric ordering. One method to do this is bash's brace expansion:
$ echo def_2_{1..12}.txt
def_2_1.txt def_2_2.txt def_2_3.txt def_2_4.txt def_2_5.txt def_2_6.txt def_2_7.txt def_2_8.txt def_2_9.txt def_2_10.txt def_2_11.txt def_2_12.txt
In the above, the files are numerically ordered.

Opposite of Linux Split

I have a huge file and I split the big file into several small chunks and divide and conquer. Now I have a folder contains a list of files like below:
output_aa #(the output file done: cat input_aa | python parse.py > output_aa)
output_ab
output_ac
output_ad
...
I am wondering is there a way to merge those files back together FOLLOWING THE INDEX ORDER:
I know I could do it by using
cat * > output.all
but I am more curious another magical command already exist comes with split..
The magic command would be:
cat output_* > output.all
There is no need to sort the file names as the shell already does it (*).
As its name suggests, cat original design was precisely to conCATenate files which is basically the opposite of split.
(*) Edit:
Should you use an (hypothetical ?) locale that use a collating order where the a-z order is not abcdefghijklmnopqrstuvwxyz, here is one way to overcome the issue:
LC_ALL=C "sh -c cat output_* > output.all"
There are other ways to concat files together, but there is no magical "opposite of split" in "linux".
Of course, talking about "linux" in general is a bit far fetched, as many distributions have different tools (most of them use a different shell already by default, like sh, bash, csh, zsh, ksh, ...), but if you're talking about debian based linux at least, I don't know of any distribution which would provide such a tool.
For sorting you can use the linux command "sort" ;
Also be aware that using ">" for redirecting stdout will override maybe existing contents, while ">>" will concat to an existing file.
I don't want to copycat, but still make this answer complete, so what jlliagre said about the cat command should also be considered of course (that "cat" was made to con-"cat" files, effectively making it possible to reverse the split command - but that's only provided you use the same ordering of files, so it's not exactly the "opposite of split", but will work that way in close to 100% of the cases (see comments under jlliagre answer for specifics))

Resources