Concatenate files based on numeric sort of name substring in awk w/o header - bash

I am interested in concatenate many files together based on the numeric number and also remove the first line.
e.g. chr1_smallfiles then chr2_smallfiles then chr3_smallfiles.... etc (each without the header)
Note that chr10_smallfiles needs to come after chr9_smallfiles -- that is, this needs to be numeric sort order.
When separate the two command awk and ls -v1, each does the job properly, but when put them together, it doesn't work. Please help thanks!
awk 'FNR>1' | ls -v1 chr*_smallfiles > bigfile

The issue is with the way that you're trying to pass the list of files to awk. At the moment, you're piping the output of awk to ls, which makes no sense.
Bear in mind that, as mentioned in the comments, ls is a tool for interactive use, and in general its output shouldn't be parsed.
If sorting weren't an issue, you could just use:
awk 'FNR > 1' chr*_smallfiles > bigfile
The shell will expand the glob chr*_smallfiles into a list of files, which are passed as arguments to awk. For each filename argument, all but the first line will be printed.
Since you want to sort the files, things aren't quite so simple. If you're sure the full range of files exist, just replace chr*_smallfiles with chr{1..99}_smallfiles in the original command.
Using some Bash-specific and GNU sort features, you can also achieve the sorting like this:
printf '%s\0' chr*_smallfiles | sort -z -n -k1.4 | xargs -0 awk 'FNR > 1' > bigfile
printf '%s\0' prints each filename followed by a null-byte
sort -z sorts records separated by null-bytes
-n -k1.4 does a numeric sort, starting from the 4th character (the numeric part of the filename)
xargs -0 passes the sorted, null-separated output as arguments to awk
Otherwise, if you want to go through the files in numerical order, and you're not sure whether all the files exist, then you can use a shell loop (although it'll be significantly slower than a single awk invocation):
for file in chr{1..99}_smallfiles; do # 99 is the maximum file number
[ -f "$file" ] || continue # skip missing files
awk 'FNR > 1' "$file"
done > bigfile

You can also use tail to concatenate all the files without header
tail -q -n+2 chr*_smallfiles > bigfile
In case you want to concatenate the files in a natural sort order as described in your quesition, you can pipe the result of ls -v1 to xargs using
ls -v1 chr*_smallfiles | xargs -d $'\n' tail -q -n+2 > bigfile
(Thanks to Charles Duffy) xargs -d $'\n' sets the delimiter to a newline \n in case the filename contains white spaces or quote characters

Using a bash 4 associative array to extract only the numeric substring of each filename; sort those individually; and then retrieve and concatenate the full names in the resulting order:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "Requires bash 4.0 or newer" >&2; exit 1;; esac
# when this is done, you'll have something like:
# files=( [1]=chr_smallfiles1.txt
# [10]=chr_smallfiles10.txt
# [9]=chr_smallfiles9.txt )
declare -A files=( )
for f in chr*_smallfiles.txt; do
files[${f//[![:digit:]]/}]=$f
done
# now, emit those indexes (1, 10, 9) to "sort -n -z" to sort them as numbers
# then read those numbers, look up the filenames associated, and pass to awk.
while read -r -d '' key; do
awk 'FNR > 1' <"${files[$key]}"
done < <(printf '%s\0' "${!files[#]}" | sort -n -z) >bigfile

You can do with a for loop like below, which is working for me:-
for file in chr*_smallfiles
do
tail +2 "$file" >> bigfile
done
How will it work? For loop read all the files from current directory with wild chard character * chr*_smallfiles and assign the file name to variable file and tail +2 $file will output all the lines of that file except the first line and append in file bigfile. So finally all files will be merged (accept the first line of each file) into one i.e. file bigfile.

Just for completeness, how about a sed solution?
for file in chr*_smallfiles
do
sed -n '2,$p' $file >> bigfile
done
Hope it helps!

Related

1. How to use the input not including the first one 2.Using grep and sed to find the pattern entered by the user and how to create the next line

The command that I'm making wants the first input to be a file and search how many times a certain pattern occurs within the file, using grep and sed.
Ex:
$ cat file1
oneonetwotwotwothreefourfive
Intended output:
$ ./command file1 one two three
one 2
two 3
three 1
The problem is the file does not have any lines and is just a long list of letters. I'm trying to use sed to replace the pattern I'm looking for with "FIND" and move the list to the next line and this continues until the end of file. Then, use $grep FIND to get the line that contains FIND. Finally, use wc -l to find a number of lines. However, I cannot find the option to move the list to the next line
Ex:
$cat file1
oneonetwosixone
Intended output:
FIND
FIND
twosixFIND
Another problem that I've been having is how to use the rest of the input, not including the file.
Failed attempt:
file=$1
for PATTERN in 2 3 4 5 ... N
do
variable=$(sed 's/$PATTERN/find/g' $file | grep FIND $file | wc -l)
echo $PATTERN $variable
exit
Another failed attempt:
file=$1
PATTERN=$($2,$3 ... $N)
for PATTERN in $*
do variable=$(sed 's/$PATTERN/FIND/g' $file | grep FIND $file | wc-1)
echo $PATTERN $variable
exit
Any suggestions and help will be greatly appreciated. Thank you in advance.
Non-portable solution with GNU grep:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(grep -o -e "$pattern" <"$file" | wc -l)
done
If you want to use sed and your "patterns" are actually fixed strings (which don't contain characters that have special meaning to sed), you could do something like:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(
sed "s/$pattern/\n&\n/g" "$file" |\
grep -e "$pattern" | wc -l
)
done
Your code has several issues:
you should quote use of variables where word splitting may happen
don't use ALLCAPS variable names - they are reserved for use by the shell
if you put a string in single-quotes, variable expansion does not happen
if you give grep a file, it won't read standard input
your for loop has no terminating done
This might work for you (GNU bash,sed and uniq):
f(){ local file=$1;
shift;
local args="$#";
sed -E 's/'${args// /|}'/\n&\n/g
s/(\n\S+)\n\S+/\1/g
s/\n+/\n/g
s/.(.*)/echo "\1"|uniq -c/e
s/ *(\S+) (\S+)/\2 \1/mg' $file; }
Separate arguments into file and remaining arguments.
Apply arguments as alternation within a sed substitution command which splits words into lines separated by a newline either side.
Remove unwanted words and unwanted newlines.
Evaluate the manufactured file within a sed substitution using the uniq command with the -c option.
Rearrange the output and print the result.
The problem is the file does not have any lines
Great! So the problem reduces to putting newlines.
func() {
file=$1
shift
rgx=$(printf "%s\\|" "$#" | sed 's#\\|$##');
# put the newline between words
sed 's/\('"$rgx"'\)/&\n/g' "$file" |
# it's just standard here
sort | uniq -c |
# filter only input - i.e. exclude fourfive
grep -xf <(printf " *[0-9]\+ %s\n" "$#")
};
func <(echo oneonetwotwotwothreefourfive) one two three
outputs:
2 one
1 three
3 two

Using NOT operator in an IF with multiple grep commands

I am new to Shell scripting, and am writing a Korn shell script.
My aim is to search for each line in fileA.txt in 4 separate files (let's call them fileA.txt, fileB.txt, fileC.txt and fileD.txt). I need to print "not found" for the lines from fileA.txt that were found in neither of the four files in a separate file.
So I came up with the following If statement. I am trying to combine the 4 grep commands using &&, and doing a logical Not (!) since I only need the lines that were found in neither of the 4 files.
for i in $(<fileA.txt);
do
if !((grep -q $i fileB.txt) && (grep -q $i fileB.txt) && (grep -q $i fileC.txt) && (grep -q $i fileD.txt)); then
print "$i not found in either of 4 files"
fi
done
I know there's something definitely wrong with the syntax, but being a beginner in shell scripting, I can't figure it out.
It doesn't answer the question you asked, and thus violates SO policy, but there's a way to solve your actual problem with awk in one pass that I can't fit in a reasonable comment:
awk 'FNR==NR{a[$0];next} {for(p in a)if($0~p){delete a[p]}} \
END{for(p in a)print "notfound: ",p}' patternfile data1 data2 data3 etc
The notfound: is just for clarity, you can change or omit it as desired.
The output values (patterns that were not found in any data file) are not necessarily in the same order as they were in patternfile; if you care about that:
awk 'FNR==NR{a[$0]=FNR;next} {for(p in a)if($0~p){delete a[p]}} \
END{for(p in a)print a[p],p}' patternfile data1 data2 data3 etc | sort -k1n | cut -f2-
# or in GNU awk v4+ only
awk 'FNR==NR{a[$0]=FNR;next} {for(p in a)if($0~p){delete a[p]}} \
END{PROCINFO["sorted_in"]="#val_num_asc";for(p in a)print p}' patternfile data1 data2 data3 etc
Your question is also ambiguous about 'lines'; do you mean each line in patternfile should occur as a line in one of the data files, or can it occur within a line but not necessarily the whole line? Also, are the values in the patternfile only data characters or are any of them special characters that match something different in the data? For example with grep defaults as you posted (or awk with ~ as I have above) if patternfile contains a line boojum.. that item will be considered found if a data file contains any of the following lines:
boojum..
boojumXY
the snark was a boojum!!
OTOH the patternfile line ^abc will match:
abc
abcdefghi
but will NOT match:
^abc
You can get full-line match in grep with option -x, literal (non-regex) match with -F, or both. These can also be achieved in awk but differently.
You don't need the parentheses. In fact, because you are using &&, you don't need 3 separate calls to grep.
while IFS= read -r line; do
if ! grep -q "$i" fileB.txt fileC.txt fileD.txt; then
print "$i not found in any of the 3 files"
fi
done < fileA.txt
You don't even need the loop; this pattern is covered by the -f option:
if ! grep -f fileA.txt fileB.txt fileC.txt fileD.txt; then
...
fi

Optimize shell script for multiple sed replacements

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s
You can use sed to produce correctly -formatted sed input:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.
From replace.c:
Replace strings in textfile
This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.
This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.
Here is what I would try:
store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.
patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[#]} file
This result in this command:
sed -e s/old/new/g -e s/tobereplaced/replacement/g file
You can try this.
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements. You may also want to replace awk with cut. cut may be more optimized then awk, though I am not sure about that.
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so --posix with GNU sed)
explaination
copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.
Thanks to #miku above;
I have a 100MB file with a list of 80k replacement-strings.
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".
I randomly picked 1000 replacements per file, so it all went like this:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours !
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB; how many pairs would that be for my script ?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks
to expand on chthonicdaemon's solution
live demo
#! /bin/sh
# build regex from text file
REGEX_FILE=some-patch.regex.diff
# test
# set these with "export key=val"
SOME_VAR_NAME=hello
ANOTHER_VAR_NAME=world
escape_b() {
echo "$1" | sed 's,/,\\/,g'
}
regex="$(
(echo; cat "$REGEX_FILE"; echo) \
| perl -p -0 -e '
s/\n#[^\n]*/\n/g;
s/\(\(SOME_VAR_NAME\)\)/'"$(escape_b "$SOME_VAR_NAME")"'/g;
s/\(\(ANOTHER_VAR_NAME\)\)/'"$(escape_b "$ANOTHER_VAR_NAME")"'/g;
s/([^\n])\//\1\\\//g;
s/\n-([^\n]+)\n\+([^\n]*)(?:\n\/([^\n]+))?\n/s\/\1\/\2\/\3;\n/g;
'
)"
echo "regex:"; echo "$regex" # debug
exec perl -00 -p -i -e "$regex" "$#"
prefixing lines with -+/ allows empty "plus" values, and protects leading whitespace from buggy text editors
sample input: some-patch.regex.diff
# file format is similar to diff/patch
# this is a comment
# replace all "a/a" with "b/b"
-a/a
+b/b
/g
-a1|a2
+b1|b2
/sg
# this is another comment
-(a1).*(a2)
+b\1b\2b
-a\na\na
+b
-a1-((SOME_VAR_NAME))-a2
+b1-((ANOTHER_VAR_NAME))-b2
sample output
s/a\/a/b\/b/g;
s/a1|a2/b1|b2/;;
s/(a1).*(a2)/b\1b\2b/;
s/a\na\na/b/;
s/a1-hello-a2/b1-world-b2/;
this regex format is compatible with sed and perl
since miku mentioned mysql replace:
replacing fixed strings with regex is non-trivial,
since you must escape all regex chars,
but you also must handle backslash escapes ...
naive escaper:
echo '\(\n' | perl -p -e 's/([.+*?()\[\]])/\\\1/g'
\\(\n

Loop over list of files to merge according their names

Files in the directory look like that:
A_1_email.txt
A_1_phone.txt
A_2_email.txt
A_2_phone.txt
B_1_email.txt
B_1_phone.txt
B_2_email.txt
B_2_phone.txt
What I want:
To merge files A_1_email.txt and A_1_phone.txt; to merge files B_1_email.txt and B_1_phone.txt and so on.
What I mean by that: if first to flags of files names matches (for example A to A; 1 to 1) than merge files.
How I tried to do this:
ls * | cut -d "_" -f 1-2 | sort | uniq -c | awk '{print $2}' > names && for name in
$(cat names); do
And I am lost here, really don't know how should I go on further.
The following are based on #MichaelJ.Barber's answer (which had the excellent idea of using join), but with the specific intention to avoid the dangerous practice of parsing the output of ls:
# Simple loop: avoids subshells, pipelines
for file in *_email.txt; do
if [[ -r "$file" && -r "${file%_*}_phone.txt" ]]; then
join "$file" "${file%_*}_phone.txt"
fi
done
or
##
# Use IFS and a function to avoid contaminating the global environment.
joinEmailPhone() {
local IFS=$'\n'
local -x LC_COLLATE=C # Ensure glob expansion sorting makes sense.
# According to `man (1) bash`, globs expand sorted "alphabetically".
# If we use LC_COLLATE=C, we don't need to sort again.
# Use an awk test (!seen[$0]++) to ensure uniqueness and a parameter expansion instead of cut
awk '!seen[$0]++{ printf("join %s_email.txt %s_phone.txt\n", $1, $1) }' <<< "${*%_*}" | sh
}
joinEmailPhone *
But in all probability (again assuming LC_COLLATE=C) you can get away with:
printf 'join %s %s\n' * | sh
I'll assume that the files all have tab-separated name-value pairs, where the value is email or phone as appropriate. If that's not the case, do some pre-sorting or otherwise modify as appropriate.
ls *_{email,phone}.txt |
cut -d "_" -f1-2 | # could also do this with variable expansion
sort -u |
awk '{ printf("join %s_email.txt %s_phone.txt\n", $1, $1) }' |
sh
What this does is to identify the unique prefixes for the files and use 'awk' to generate shell commands for joining the pairs, which are then piped into sh to actually run the commands.
You may use printf '%s\n' *_{email,phone}.txt | ... instead of ls *-... in the given scenario, i. e. no newline chars in file path names are to be expected. At least one external command less!
Use a for loop to iterate over the email files, using the read command
with the proper value of IFS to split the file name into the necessary parts.
Note that this does use one non-POSIX feature that bash provides, namely
using a here-string (<<<) to pass the value of $email to the read command.
for email in *_email.txt; do
IFS=_ read fst snd <<< $email
phone=${fst}_${snd}_phone.txt
# merge $email and $phone
done

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Resources