Grab just the first filename from a zip file stream? - bash

I want to extract just the first filename from a remote zip archive without downloading the entire zip. In particular, I'm trying to get the build number of dartium (link to zip file). Since the file is quite large, I don't want to download the entire thing.
If I download the entire thing, unzip -l reports the first file as being: 0 2013-04-07 12:18 dartium-lucid64-inc-21033.0/. I want to get just this filename so I can parse out the 21033 portion as the build number.
I was doing this (total hack):
_url="https://storage.googleapis.com/dartium-archive/continuous/dartium-lucid64.zip"
curl -s $_url | head -c 256 | sed -n "s:.*dartium-lucid64-inc-\([0-9]\+\).*:\1:p"
It was working when I had my shell in ASCII mode, but I recently switched it to UTF-8 and it seems sed is now honoring that, which breaks my script.
I thought about hacking it by doing:
export LANG=
curl -s ...
But that seemed like an even bigger hack.
Is there a better way?

Firstly, you can set bytes range using curl.
Next, use "strings" to extract all strings from binary stream.
Add "q" after "p" to quit after find only first occurrence.
curl -s $_url -r0-256 | strings | sed -n "s:.*dartium-lucid64-inc-\([0-9]\+\).*:\1:p;q"
Or this:
curl -s $_url -r0-256 | strings | sed -n "/dartium-lucid64/{s:.*-\([^-]\+\)\/.*:\1:p;q}"
It must be a bit faster and more reliable. Also it extracts full version, including subversion (if you need it).

Related

how to copy all the URLs of a certain column of a web page?

I want to import several number of files into my server using wget , the 492 files are here:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736
so I want to copy the URL of all files in "File Name" column to save them into a file and import them with wget.
So how can I copy all those URLs from that column ?
Thanks for reading :)
Since you've tagged bash, this should work.
wget -O- is used to output the data to the standard output, where it's greppable. (curl would do that by default.)
grep -oE is used to capture the URLs (which happily are in a regular enough format that a simple regexp works).
Then, wget -i is used to read URLs from the file generated. You might wish to add -nc or other suitable partial-fetch flags; those files are pretty hefty.
wget -O- https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 | grep -oE 'http://ftp.sra.ebi.ac.uk/[^"]+' > urls.txt
wget -i urls.txt
First, I recommend using a more specific and robust implementation...
but, in the case you are against a wall and in a hurry -
$: curl -s https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 |
sed -En '/href="http:\/\/.*clean.fastq.gz"/{s/^.*href="([^"]+)".*/\1/;p;}' |
while read url; do wget "$url"; done
This is a quick and dirty rough first pass, but it will give you something to work with.
If you aren't in a screaming hurry, try writing something more robust and step-wise in perl or python.

How to download URLs in a csv and naming outputs based on a column value

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

How to get the highest numbered link from curl result?

i have create small program consisting of a couple of shell scripts that work together, almost finished
and everything seems to work fine, except for one thing of which i'm not really sure how to do..
which i need, to be able to finish this project...
there seem to be many routes that can be taken, but i just can't get there...
i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links
i only need to get (into a variable) the link of the highest number (without the always same text)
the links are all similar, and have this structure:
always same text
always same text
always same text
i was thinking about something like;
content="$(curl -s "$url/$param")"
linksArray= get from $content all links that are in the href section of the links
that contain "always same text"
declare highestnumber;
for file in $linksArray
do
href=${1##*/}
fullname=${href%.html}
OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"
if ${nameparts[1]} > $highestnumber;
then
highestnumber=${nameparts[1]}
fi
done
echo ${nameparts[1]}_${highestnumber}.html
result:
https://always/same/link/unique-name_19.html
this was just my guess, any working code that can be run from bash script is oke...
thanks...
update
i found this nice program, it is easily installed by:
# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb
it looks awsome, but i'm not really sure how to tweak it to my needs.
based on that link and the below answer, i guess a possible solution would be..
use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:
< a href="https://always/same/link/same-name_17.html">always same text< /a>
then filter out all that doesn't end with ( ">always same text< /a> )
and then use the grep sort as mentioned below.
Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:
Example List
$ cat dat/links.txt
always same text
always same text
always same text
Parsing the Highest Numbered Link
$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'
(note: the command above is all one line separate by the line-continuation '\')
Applying Directly to Results of curl
Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,
$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"
or pipe the result of curl to grep,
$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"
(same line continuation note.)
Why not use Xidel with xquery to sort the links and return the last?
xidel -q links.txt --xquery "(for $i in //#href order by $i return $i)[last()]" --input-format xml
The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.
If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

Making a script that uses 'sed' to patch hex strings inside binaries in OSX

patching hex strings inside binaries with sed.
how do i use Sed to open a binary file inside a .app, search for a unique string of hex values , replace them with the new string and then save the binary and exit.?
i have done alot of research and im stuck.
ultimately i would like to wright this as a script and below i have written some code as terminal commands that basically doesn't work but represents what i want to happen to the best of my ability.
//binary patcher script attempt
hexdump -ve '1/1 "%.2X"' /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
sed "s/\x48\x85\xc0\x75\x33/\x48\x85\xc0\x74\x33/g" | \
xxd -r -p > /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched | \
cd /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/ | \
mv /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
sudo chmod u+x /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp
//returns 1 if the string is in the file
xxd -p /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | tr -d '\n' | grep -c ‘4885c07533'
(this is not in use in the script at the moment but i tested it and it does return 1 if the sequence is there and so i thought it would be handy when it comes to possibly of making these patches into small applications of their own. implementing by means of something along the lines of :-
'if(xxd -p /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | tr -d '\n' | grep -c ‘4885c07533' == 1){runTheRestOfTheScript;
else if (xxd -p /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | tr -d '\n' | grep -c ‘4885c07533' == 1){ThrowERROR;'
ok so back to the stuff in the script
//First it dumps the binaries hex information into memory
hexdump -ve '1/1 "%.2X"' /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
//calls sed to find the string of values and replace it with the new one.
sed "s/\x48\x85\xc0\x75\x33/\x48\x85\xc0\x74\x33/g" | \
//saves the new patched file as MyApp.Patched
xxd -r -p > /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched | \
//cds to the directory of the patched file
cd /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/ | \
// renames the file to its original executable name
mv /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
//sets the new file as executable after a password.
sudo chmod u+x /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp
now this is my first attempt and i am aware some of the functions probably are completely wrong and really, apart from it does not do the patching and it deletes the contents of the binary it works as far as the renaming goes and hopefully gives you an overview of how i need the runtime of the script to work.
now i am a real newbie but i really need to get this done and i really have no idea what to do.
i need this script to basically work by waiting for the user to point the program in the direction of the file that needs patching (as I’m patching the apps iv made preferably it would accept dragging of a .app file into the window and it finding the binary in the macOSX folder by itself (this will come later tho and could also be done in various ways)
i then need it to search for the string in the binary and replace it with the edited string in this case :-
original :- 4885c07533
patched:-4885c07433 {its worth re mentioning this string will always be unique but may vary in length depending on the function that needs patching}
I then need to save it with the same name as the original which this script handles by saving the patched file as .patched appended and subsequently renaming it accordingly .
It then makes the file executable and exits leaving the patched .app ready to run.
This method of creating patches would be particularly helpful if i notice i have made a mistake in many of my programs for instance. if the function is unique i could make a single patch that could edit the binaries at the touch of a button while just holding the section of code that is relevant to patch inside.
so to sum up.
what i am looking for is some way of getting this script working and maybe, if any of you can help a little advice on turning this into a little application to make my life easier.
many thanks in advance for any and all help you can offer.
i will be checking daily so if i need to clarify something let me know and ill be on it in a flash.
MiRAGE
With regards to the sed line
sed "s/\x48\x85\xc0\x75\x33/\x48\x85\xc0\x74\x33/g"
Firstly, you can use sed to change around arbitrary binary - but you should beware newlines. sed processes its inputs always newline separated, so if the value \x0a appears in your string you will have problems.
The following will allow you to consider the entire file as pure binary. (call sed with the -n option so that it won't print out lines after processing them by default).
# Append the current line to the hold space
H
# On the last line the hold space contains all of the file - now swap pattern and hold space, operate on the pattern space and print the line
${
# exchange hold and pattern space
x
# do substitution
s/.../.../g
# print out result, required due to -n option
p
}
or, more succinctly
sed -n 'H;${x;s/.../.../g;p}'
When you append the pattern space to hold space the new line will be inserted - so this circumvents issues with new lines.
Also, in your example you used double quotes for your sed expression. Due to shell escaping rules for backslashes and the nature of sed, I would recommend the use of single quotes to avoid complication. Apologies if it is the case that this is not true for your shell.
Lastly about sed, you should beware of special values contained in the hex.
If you escape a byte literal in sed with \x.., the way this is interpreted is by first replacing the escapted byte literal with its value, and then executing the line. Importantly, regex special characters still do what they do if they weren't escaped.
Example:
sed 's/\x5e\x2f/foo/'
# becomes
substitute pattern '\x5e\x2f' for 'foo'
# becomes
substitute pattern '^/' for 'foo'
# which matches a / at the beginning of a line as opposed to ^/
So the characters to look out for on the left of a substitution are the usual suspects, and beware \x26 (&) on the right hand side of a substitution.
Hopefully that at least clarifies sed's potential role in your script :-).

Resources