Differences between boost::string_ref and boost::string_view - boost

Boost provides two different implementations of string_view, which will be a part of C++17:
boost::string_ref in utility/string_ref.hpp
boost::string_view in core/string_view.hpp
Are there any significant differences between these? Which should be preferred going forward?
Note: I noticed in Boost 1.61, boost::log has deprecated string_ref in favor of string_view; perhaps that's an indicator? (http://www.boost.org/users/history/version_1_61_0.html)

Funnily enough right now I'm at the ACCU conference with Marshall Clow (the force behind string_view et al on the committee) and I was quite literally about to ask him at the bar earlier today before I was called away about his views on string_view versus Bjarne's Guideline Support Library (GSL) gsl::span<T> which is a very similar thing (gsl-lite is my personal favourite implementation of the GSL as it's 03 compatible, but there are many others). I had heard they were to be unified into a single implementation for standardisation, and the gsl::span<T> direction is to be the future, but I'll report back here from the horse's mouth himself if I'm wrong on that. For now, assume the gsl::span<T> direction is the current future and Boost will get updated to have something similar soon, even if using string_view = gsl::span<char> is essentially string_view.
Edit: I just spoke to Marshall downstairs. He tells me that string_view, as per the implementation in Boost, is definitely in C++ 17. array_view is not, nor is anything historically surrounding string_view for now.
The GSL string_span is a separate entity not expected to enter in C++ 17, nor are there any present plans to unify the implementations as they solve different use cases, specifically that string_view is always a constant view of the borrowed character array, whereas string_span is expected to be a potentially modifiable view of the borrowed character array with potential uses as a source for construction of new strings, so string_span might perhaps eventually become a generalisation of string_view in some future C++ standard.

According to this email from the boost mailing list, boost::string_ref won't be used in the future and is being replaced by string_view in other boost libraries.
boost::string_view has the following advantages:
Better matches what the standards committee is
doing for C++17
Has WAY more constexpr support

Related

Complexity of IDE error detection and auto-completion dependent upon language syntax?

Are fewer checks/less rigorous code analysis required to provide development environment error feedback and auto completion for programming languages that are composed largely of human-readable phrases and words (i.e. Python, VB.NET)? This is in contrast to C-style languages, that depend more upon symbols and punctuation for code structure.
I have experience/am responsible for building dozens of language front ends.
Wordy languages vs. punctuationy languages are generally equally hard to parse and statically analyze.
The folks that define languages of either kind have either been decorating them for decades (e.g., COBOL since 1958), or building sophisticated languages (C++, Scala, Ruby) with both complex syntax and complex name resolution and type inference rules; the compiler vendors then proceed to add obscure syntax to support the strange things they do or to provide a customer lock (e.g., MS "managed C++", DLL declarations, etc.). There's the third problem of lousy definitions; the top languages may have precise rules about how they work, but many languages have sloppy definitions (e.g., PHP) which creates dark corner cases that have to be ironed out by painful experimentation with the actual implementation.
C++ has been our worst, esp. with the C++11 committee making a massive recent mess of things. We have full C++ parsers, but are still working on full name resolution for C++11 on top of our C++98 implementation. (The name resolution code is some 250,000 lines of code and its not enough!).
IBM COBOL is a close second; the language is just giant, and there are all sorts of funny name resolution rules ("an unqualified name can refer to a particular name without qualification if the reference is unambiguous" So, is this name an unambiguous reference in this context?).
Once you get past parsing and name/type resolution, then you get into control flow, data flow, points-to analysis, range anlaysis, call graph construction, ... which are generally about the same amount of effort as the earlier phases; we get away with less by having really good libraries that support these tasks.
With all this as background analyses, you can start to do "static analyis" of the smart kind that people want.
Another poster noted that recovering from syntax errors and (emphasis) "continue to generate meaningful error messages". All I can say to this is "Amen, brother". See this SO answer https://stackoverflow.com/a/6657974/120163 for a discussion of what goes wrong when you have "partial programs", which is essentially what you get when syntax error repairs guess at a fix.

GTK data types vs base data types

I'm starting to fiddle around a little bit with GTK+ for some little project.
GLib defines a series of data type, like gint gpointer and so on, which are just typedefs of base data types (gint is a typedef for int, gpointer for void* and so on).
Now, say I have a function or a class that does in no way make use of GTK. I would be really tempted to use the base data types so that I can reuse the class/function somewhere else even if I don't include the GTK headers.
On the other hand, I find it quite ugly to have a mix of gint and int in the code, when they are actually the same thing.
In summary, I am wondering whether there is a standard practice of when to use one or the other, or if one should just mix them at will...
I deal with this issue a lot working with third party libraries where they all want their own type alias for integers, floats, longs, shorts, byte aliases instead of chars, etc.
It's very annoying. This is often done to ensure portability but ends up giving each library its own standards.
What I find displeasing most here is from a coupling perspective. I might have a general mesh interface which should be decoupled from any rendering concerns. Yet some of its data may be passed directly to an OpenGL function which wants to assume that size of the integers we pass will match sizeof(GLint).
In some cases this isn't merely aesthetic. It might not even be plausible to include GL headers in this mesh header, as it may be part of a widely-used software development kit which should not require such compile-time dependencies on the third party plugin writers who use it.
Yet portability is an issue. I managed to survive a nightmarish scenario in a very large-scale legacy C codebase where the implicit assumption was made throughout the codebase that sizeof(int) == sizeof(void*). It took years of looking for needles in a haystack to port this codebase to 64-bit.
What I've settled on personally is to start favoring plain old unaliased data types over the years. I've also taken a liking to just using signed integers, e.g. I found it a nuisance in the past to even avoid warnings in basic loops through containers where some would use int, others unsigned int, others size_t, etc. to indicate the number of elements contained. At least personally, I found my maintenance time reduced by just favoring int without a very good reason not to do so.
To try to mitigate a potential worst-case scenario on some platform where sizeof(int) != sizeof(GLint), e.g., I tend to liberally sprinkle assertions around code that makes the assumption that these two are equal: assert(sizeof(int) == sizeof(GLint));. This should significantly mitigate the pain associated with that kind of nightmarish scenario I faced before when porting from 32-bit to 64-bit. It also explicitly expresses these assumptions.
I've found this to establish a comfortable balance for my case. Of course this is all subjective and can vary considerably based on your use cases. But this is one possible solution that might allow you to just favor plain old unaliased data types more and more in spite of all these third party libraries and not face a worst-case scenario if your assumptions cease to be correct on some platform.

Why doesn't Haskell have symbols (a la ruby) / atoms (a la erlang)?

The two languages where I have used symbols are Ruby and Erlang and I've always found them to be extremely useful.
Haskell does have algebraic datatypes, but I still think symbols would be mighty convenient. An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
The syntactic sugar for atoms can be minor - :something or <something> is an atom. All atoms are instances of a Type called Atom which derives Show and Eq. You can then use it for more descriptive error codes, for example
type ErrorCode = Atom
type Message = String
data Error = Error ErrorCode Message
loginError = Error :redirect "Please login first"
In this case :redirect is more efficient than using a string ("redirect") and easier to understand than an integer (404).
The benefit may seem minor, but I say it is worth adding atoms as a language feature (or at least a GHC extension).
So why have symbols not been added to the language? Or am I thinking about this the wrong way?
I agree with camccann's answer that it's probably missing mainly because it would have to be baked quite deeply into the implementation and it is of too little use for this level of complication. In Erlang (and Prolog and Lisp) symbols (or atoms) usually serve as special markers and serve mostly the same notion as a constructor. In Lisp, the dynamic environment includes the compiler, so it's partly also a (useful) compiler concept leaking into the runtime.
The problem is the following, symbol interning is impure (it modifies the symbol table). Because we never modify an existing object it is referentially transparent, however, but if implemented naïvely can lead to space leaks in the runtime. In fact, as currently implemented in Erlang you can actually crash the VM by interning too many symbols/atoms (current limit is 2^20, I think), because they can never get garbage collected. It's also difficult to implement in a concurrent setting without a huge lock around the symbol table.
Both problems can be (and have been) solved, however. For example, see Erlang EEP 20. I use this technique in the simple-atom package. It uses unsafePerformIO under the hood, but only in (hopefully) rare cases. It could still use some help from the GC to perform an optimisation similar to indirection shortening. It also uses quite a few IORefs internally which isn't too great for performance and memory usage.
In summary, it can be done but implementing it properly is non-trivial. Compiler writers always weigh the power of a feature against its implementation and maintenance efforts, and it seems like first-class symbols lose out on this one.
I think the simplest answer is that, of the things Lisp-style symbols (which is where both Ruby and Erlang got the idea, I believe) are used for, in Haskell most are either:
Already done in some other fashion--e.g. a data type with a bunch of nullary constructors, which also behave as "convenient names for integers".
Awkward to fit in--things that exist at the level of language syntax instead of being regular data usually have more type information associated with them, but symbols would have to either be distinct types from each other (nearly useless without some sort of lightweight ad-hoc sum type) or all the same type (in which case they're barely different from just using strings).
Also, keep in mind that Haskell itself is actually a very, very small language. Very little is "baked in", and of the things that are most are just syntactic sugar for other primitives. This is a bit less true if you include a bunch of GHC extensions, but GHC with -XAndTheKitchenSinkToo is not the same language as Haskell proper.
Also, Haskell is very amenable to pseudo-syntax and metaprogramming, so there's a lot you can do even without having it built in. Particularly if you get into TH and scary type metaprogramming and whatever else.
So what it mostly comes down to is that most of the practical utility of symbols is already available from other features, and the stuff that isn't available would be more difficult to add than it's worth.
Atoms aren't provided by the language, but can be implemented reasonably as a library:
http://hackage.haskell.org/package/simple-atom
There are a few other libs on hackage, but this one looks the most recent and well-maintained.
Haskell uses type constructors* instead of symbols so that the set of symbols a function can take is closed, and can be reasoned about by the type system. You could add symbols to the language, but it would put you in the same place that using strings would - you'd have to check all possible symbols against the few with known meanings at runtime, add error handling all over the place, etc. It'd be a big workaround for all the compile-time checking.
The main difference between strings and symbols is interning - symbols are atomic and can be compared in constant time. Both are types with an essentially infinite number of distinct values, though, and against the grain of Haskell's specifying arguments and results with finite types.
I'm more familiar with OCaml than Haskell, so "type constructor" may not be the right term. Things like None or Just 3.
An immediate use that springs to mind is that since symbols are isomorphic to integers you can use them where you would use an integral or a string "primary key".
Use Enum instead.
data FileType = GZipped | BZipped | Plain
deriving Enum
descr ft = ["compressed with gzip",
"compressed with bzip2",
"uncompressed"] !! fromEnum ft

What levels should static analyzers analyze?

I've noticed that some static analyzers operate on source code, while others operate on bytecode (e.g., FindBugs). I'm sure there are even some that work on object code.
My question is a simple one, what are the advantages and disadvantages of writing different kinds of static analyzers for different levels of analysis?
Under "static analyzers" I'm including linters, bug finders, and even full-blown verifiers.
And by levels of analysis I would include source code, high-level IRs, low-level IRs, bytecode, object code, and compiler plugins that have access to all phases.
These different facets can influence the level at which an analyzer may decide to work:
Designing a static analyzer is a lot of work. It would be a shame not to factor this work for several languages compiled to the same bytecode, especially when the bytecode retains most of the structure of the source program: Java (FindBugs), .NET (various tools related to Code Contracts). In some cases, the common target language was made up for the purpose of analysis although the compilation scheme wasn't following this path.
Related to 1, you may hope that your static analyzer will be a little less costly to write if it works on a normalized version of the program with a minimum number of constructs. When authoring static analyzers, having to write the treatment for repeat until when you have already written while do is a bother. You may structure your analyzer so that several functions are shared for these two cases, but the care-free way to handle this is to translate one to the other, or to translate the source to an intermediate language that only has one of them.
On the other hand as already pointed out in Flash Sheridan's answer, source code contains the most information. For instance, in languages with fuzzy semantics, bugs at the source level may be removed by compilation. C and C++ have numerous "undefined behaviors" where the compiler is allowed to do anything, including generating a program that works accidentally. Fine, you might think, if the bug is not in the executable it's not a problematic bug. But when you ever re-compile the program for another architecture or with the next version of the compiler, the bug may appear again. This is one reason for not doing the analysis after any phase that might potentially remove bugs.
Some properties can only be checked with reasonable precision on compiled code. That includes absence of compiler-introduced bugs as pointed out again by Flash Sheridan, but also worst-case execution time. Similarly, many languages do not let you know what floating-point code does precisely unless you look at the assembly generated by the compiler (this is because existing hardware does not make it convenient for them to guarantee more). The choice is then to write an imprecise source-level analyzer that takes into account all possibilities, or to analyze precisely one particular compilation of a floating-point program, as long as it is understood that it is that precise assembly code that will be executed.
Source code analysis is the most generally useful, of course; sometimes heuristics even need to analyze comments or formatting. But you’re right that even object code analysis can be necessary, e.g., to detect bugs introduced by GCC misfeatures. Thomas Reps, head of GrammaTech and a Wisconsin professor, gave a good talk on this at Stanford a couple of years ago: http://pages.cs.wisc.edu/~reps/#TOPLAS-WYSINWYX.

Pointer aliasing- in C++0x

I'm thinking about (just as an idea) disjointed pointer aliasing in C++0x. I was thinking about seeing if it could be implemented similarly to const correctness- that is, enforced by the compiler. What would be the requirements for such a thing? As this is more of a thought experiment, I'm perfectly happy to look at solutions that destroy legacy code or redefine half the language and that kind of thing.
What I'd really rather not do is have, say, restrict from C99 where the programmer just promises it. It should be enforced.
I was thinking about having unique_ptr be not part of the library, but part of the language. That way, the compiler can perform special optimizations on it and write their own unique pointer classes if they need to.
The Standard C++ Library (including std::unique_ptr) is a part of the language.
Also, conforming programs are not allowed to add declarations and definitions to the namespace std.
Upon seeing an instantiation of std::unique_ptr<T>, the compiler knows everything about the behavior of this instantiation - it's exactly that behavior which was implemented as a part of the language implementation the compiler itself is a part of and the compiler is free to perform "special optimizations" coming from the guarantees of the C++ standard.
As an example for something coming from the same line of thinking, GCC already does this with a number of standard C99 functions in hosted mode - it may replace standard function calls with inline insn sequence or with calls to other functions - precisely because GCC knows the exact semantics by just knowing the name of the function.

Resources