bash command splitting elements of xpath - shell

I'm searching for some linux shell command that help me to examine the elements of an XPath.
For example, given:
/a[#e=2]/bee/cee[#e<9]/#dee
I would need three commands returning, respectively:
a
[#e=2]
/bee/cee[#e<9]/#dee
Later I can repeat the process, in order to analyse all the Xpath.
I have tried using sed and regular expressions, but I was not able.

Strange request ... but you can do this pretty easily with sed with the pattern ^/\([-a-zA-Z0-9_]*\)\(\[[^]]*\]\)\(.*\)$. The segments that you requested are the three captures.
zsh% data='/a[#e=2]/bee/cee[#e<9]/#dee'
zsh% pattern='^/\([-a-zA-Z0-9_]*\)\(\[[^]]*\]\)\(.*\)$'
zsh% sed "s#$pattern#\1#" <($data)
a
zsh% sed "s#$pattern#\2#" <($data)
[#e=2]
zsh% sed "s#$pattern#\3#" <($data)
/bee/cee[#e<9]/#dee
zsh%
The pattern is very much specialized for the request. XPath expressions are pretty difficult to rip apart in shell - both cumbersome and expensive. Depending one what you are trying to accomplish, you would probably be better off translating the shell script into either Python, Ruby, or Perl depending on language preference. You might also want to take a look at using zsh instead of Bash. I've found it to be much more capable for advanced scripting.

Related

Unix : Optimized command for subsituting words in large file

This question is not related to any code issue. Just need your suggestions.
We have a file which is ~ 100GB and we are applying sed to substitute a few parameters.
This process is taking long time and eating up CPU as well
Can the replacement of sed with awk/tr/perl or any other unix utilities will help in this scenario.
Note:
Any suggestion other than time command.
You can do a couple of things to speed it up:
use fixed pattern matching instead of regexes wherever you can
run sed for example as LANG=C sed '...'
These two are likely to help a lot. Anything else will lead to just minor improvements, even different tools.
About LANG=C - normally the matching is done in whatever encoding your environment is set to which can likely be UTF-8 which causes additional lookups of the UTF-8 characters. If your patterns use just ascii, then definitely go for LANG=C.
Other things that you can try:
if you have to use regexes then use the longest fixed character strings you can - this will allow the regex engine to skip non-matching parts of the file faster (it will skip bigger chunks)
avoid line by line processing if possible - the regex engine will not have to spend time looking for the newline character
Try different AWK's: mawk has been particularly fast for me.

How to read text between two particular text in unix shell scripting

I want to read text between two particular words from a text file in unix shell scripting.
For example in the following:
"My name is Sasuke Uchiha."
I want to get Sasuke.
This is one of the many ways it can be done:
To capture text between "is" and "Uchiha":
sed -n "s/^.*is \(.*\)Uchiha.*/\1/p" inFile
I'm tempted to add a "let me google that for you" link, but it seems like you're having a hard enough time as is.
What's the best way to find a string/regex match in files recursively? (UNIX)
Take a look at that. It's similar to what you're looking for. Regex is the go to tool for matching strings and such. And Grep is the easiest way to use it from shell in unix.
Take a look at this as well: http://www.robelle.com/smugbook/regexpr.html

Store and query a mapping in a file, without re-inventing the wheel

If I were using Python, I'd use a dict. If I were using Perl, I'd use a hash. But I'm using a Unix shell. How can I implement a persistent mapping table in a text file, using shell tools?
I need to look up mapping entries based on a string key, and query one of several fields for that key.
Unix already has colon-separated records for mappings like the system passwd table, but there doesn't appear to be a tool for reading arbitrary files formatted in this manner. So people resort to:
key=foo
fieldnum=3
value=$(cat /path/to/mapping | grep "^$key:" | cut -d':' -f$fieldnum)
but that's pretty long-winded. Surely I don't need to make a function to do that? Hasn't this wheel already been invented and implemented in a standard tool?
Given the conditions, I don't see anything hairy in the approach. But maybe consider awk to extract data. awk approach allows for picking only the first, or the last entry, or imposing any arbitrary additional conditions:
value=$(awk -F: "/^$key:/{print \$$fieldnum}" /path/to_mapping)
Once bundled in a function it's not that scary:)
I'm afraid there's no better way at least within POSIX. But you may also have a look at join command.
Bash supports arrays, which is not exactly the same. See for example this guide.
area[11]=23
area[13]=37
area[51]=UFOs
echo ${area[11]}
See this LinuxJournal article for Bash >= 4.0. For other versions of Bash you can fake it:
hput () {
eval hash"$1"='$2'
}
hget () {
eval echo '${hash'"$1"'#hash}'
}
# then
hput a blah
hget a # yields blah
Your example is one of several ways to do this using shell tools. Note that cat is unnecessary.
key=foo
fieldnum=3
filename=/path/to/mapping
value=$(grep "^$key:" "$filename" | cut -d':' -f$fieldnum)
Sometimes join comes in handy, too.
AWK, Python, Perl, sed and various XML, JSON and YAML tools as well as databases such as MySQL and SQLite can also be used, of course.
Without using them, everything else can sometimes be convoluted. Unfortunately, there isn't any "standard" utility. I would say that the answer posted by pooh comes closest. AWK is especially adept at dealing with plain-text fields and records.
The answer in this case appears to be: no, there's no widely-available implementation of the ‘passwd’ file format for the general case, and wheel re-invention is necessary in each case.

How do you convert character case in UNIX accurately? (assuming i18N)

I'm trying to get a feel for how to manipulate characters and character sets in UNIX accurately given the existance of differing locales - and doing so without requiring special tools outside of UNIX standard items.
My research has shown me the problem of the German sharp-s character: one character changes into two - and other problems. Using tr is apparently a very bad idea. The only alternative I see is this:
echo StUfF | perl -n -e "print lc($_);"
but I'm not certain that will work, and it requires Perl - not a bad requirement necessarily, but a very big hammer...
What about awk and grep and sed and ...? That, more or less, is my question: how can I be sure that text will be lower-cased in every locale?
Perl lc/uc works fine for most languages but it won't work with Turkish correctly, see this bug report of mine for details. But if you don't need to worry about Turkish, Perl is good to go.
You can't be sure that text will be correct in every locale. That's not possible, there are always some errors in software libraries regarding implementation of i18n related staff.
If you're not afraid of using C++ or Java, you may take a look at ICU which implement broad set of collation, normalization, etc. rules.

What is the best character to use as a delimiter in a custom batch syntax?

I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!
You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.
Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.
Tabs. And then expand or compress any tabs found in the text.
Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.
This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?
I would use '|'
It's one of the rarest characters.
How about String.fromCharCode(1) ?

Resources