fs.hdfs.impl.disable.cache causes SparkSQL slowness

fs.hdfs.impl.disable.cache causes SparkSQL slowness - hadoop

This question is related to Hive/Hadoop intermittent failure: Unable to move source to destination
We found that we could avoid the problem of "Unable to move source ... Filesystem closed" by setting fs.hdfs.impl.disable.cache to true.
However, we also observed that the SparkSQL queries became very slow -- queries that used to finish within a few seconds now take more than 30 to 40 seconds to finish (even when the query is very simple, like reading a tiny table).
Is this normal?
My understanding of fs.hdfs.impl.disable.cache being true means that FileSystem#get() would always createFileSystem() instead of returning a cached FileSystem. This setting prevents a FileSystem object from being shared by multiple clients and it really makes sense, because it would prevent, for example, two callers of FileSystem#get() from closing each other's filesystem.
(For example, see this discussion )
This setting would slow things down, but probably not by so much.
From: hadoop-source-reading
/**
* Returns the FileSystem for this URI's scheme and authority. The scheme of
* the URI determines a configuration property name,
* <tt>fs.<i>scheme</i>.class</tt> whose value names the FileSystem class.
* The entire URI is passed to the FileSystem instance's initialize method.
*/
public static FileSystem get(URI uri, Configuration conf)
throws IOException {
String scheme = uri.getScheme();
String authority = uri.getAuthority();
if (scheme == null) { // no scheme: use default FS
return get(conf);
}
if (authority == null) { // no authority
URI defaultUri = getDefaultUri(conf);
if (scheme.equals(defaultUri.getScheme()) // if scheme matches
// default
&& defaultUri.getAuthority() != null) { // & default has
// authority
return get(defaultUri, conf); // return default
}
}
String disableCacheName = String.format("fs.%s.impl.disable.cache",
scheme);
if (conf.getBoolean(disableCacheName, false)) {
return createFileSystem(uri, conf);
}
return CACHE.get(uri, conf);
}
Would the slowness point to some other networking issues, such as resolving domain names? Any insights to this problem are welcome.

Related

Performance in microservice-to-microservice data transfer

I have controller like this:
#RestController
#RequestMapping("/stats")
public class StatisticsController {
#Autowired
private LeadFeignClient lfc;
private List<Lead> list;
#GetMapping("/leads")
private int getCount(#RequestParam(value = "count", defaultValue = "1") int countType) {
list = lfc.getLeads(AccessToken.getToken());
if (countType == 1) {
return MainEngine.getCount(list);
} else if (countType == 2) {
return MainEngine.getCountRejected(list);
} else if (countType == 3) {
return MainEngine.getCountPortfolio(list);
} else if (countType == 4) {
return MainEngine.getCountInProgress(list);
} else if (countType == 5) {
return MainEngine.getCountForgotten(list);
} else if (countType == 6) {
return MainEngine.getCountAddedInThisMonth(list);
} else if (countType == 7) {
return MainEngine.getCountAddedInThisYear(list);
} else {
throw new RuntimeException("Wrong mapping param");
}
}
#GetMapping("/trends")
private boolean getTrend() {
return MainEngine.tendencyRising(list);
}
It is basically a microservice that will handle statistics basing on list of 'Business Leads'. FeignClient is GETting list of trimmed to the required data leads. Everything is working properly.
My only concern is about performance - all of this statistics (countTypes) are going to be presented on the landing page of webapp. If i will call them one by one, does every call will retrieve lead list again and again? Or list will be somehow stored in temporary memory? I can imagine that if list become longer, it could take a while to load them.
I've tried to call them outside this method, by #PostConstruct, to populate list at the start of service, but this solution has two major problems: authentication cannot be handled by oauth token, retrieved list will be insensitive to adding/deleting leads, cause it is loaded at the beginning only.

The list = lfc.getLeads(AccessToken.getToken()); will be called with each GET request. Either take a look at caching the responses which might be useful when you need to obtain a large volume of data often.
I'd start here: Baeldung's: Spring cache tutorial which gives you an idea about the caching. Then you can take a look at the EhCache implementation or implement own interceptor putting/getting from/to external storage such as Redis.
The caching is the only way I see to resolve this: Since the Feign client is called with a different request (based on the token) the data are not static and need to be cached.

You need to implement a caching layer to improve performance. What you can do is, you can have cache preloaded immediately after application starts. This way you will have the response ready in the cache. I would suggest to go with Redis cache. But any cache will do the job.
Also, it will be better if you can move the logic of getCount() to some service class.

Azure cache write implementation approaches - when to use which

I used to call the Put(Key, Value) method to set data in Azure cache. I later learnt that this method could lead to race conditions during writes and introduced the following code for setting data into cache.
try
{
if (GetData(key) == null)
{
_cache.Add(key, "--dummy--");
}
DataCacheLockHandle lockHandle;
TimeSpan lockTimeout = TimeSpan.FromMinutes(1);
_cache.GetAndLock(key, lockTimeout, out lockHandle);
if (ttlInMinutes == 0)
{
_cache.PutAndUnlock(key, value, lockHandle);
}
else
{
TimeSpan ttl = TimeSpan.FromMinutes(ttlInMinutes);
_cache.PutAndUnlock(key, value, lockHandle, ttl);
}
}
catch (Exception e)
{}
This involves two IOs as against one in the previous call. Is this locking really needed in application code? Is cache consistency not taken care of by Azure's caching framework? What is the standard way of managing cache writes in Azure? When to use Put and when PutAndUnlock?

Cache Shows Old Values on IIS7, not Debug Server

I have a pretty standard MVC3 application. I'm trying to store some data that's application-wide (not user wide) in a the cache (in this case, a Theme object/name). When debugging (on the development server that integrates with Visual Studio), if I call SwitchTheme, I see the new theme right away. On IIS7, whatever theme was cached, stays cached; it doesn't update to the new theme.
Edit: Some code:
public static Theme CurrentTheme { get {
Theme currentTheme = HttpContext.Current.Cache[CURRENT_THEME] as Theme;
if (currentTheme == null)
{
string themeName = DEFAULT_THEME;
try
{
WebsiteSetting ws = WebsiteSetting.First(w => w.Key == WebsiteSetting.CURRENT_THEME);
if (ws != null && !string.IsNullOrEmpty(ws.Value))
{
themeName = ws.Value;
}
}
catch (Exception e)
{
// DB not inited, or we're installing, or something broke.
// Don't panic, just use the default.
}
// Sets HttpContext.Current.Cache[CURRENT_THEME] = new themeName)
Theme.SwitchTo(themeName);
currentTheme = HttpContext.Current.Cache[CURRENT_THEME] as Theme;
}
return currentTheme;
} }
public static void SwitchTo(string name)
{
HttpContext.Current.Cache.Insert(CURRENT_THEME, new Theme(name), null, System.Web.Caching.Cache.NoAbsoluteExpiration, TimeSpan.FromMinutes(30));
// Persist change to the DB.
// But don't do this if we didn't install the application yet.
try
{
WebsiteSetting themeSetting = WebsiteSetting.First(w => w.Key == WebsiteSetting.CURRENT_THEME);
if (themeSetting != null)
{
themeSetting.Value = name;
themeSetting.Save();
}
// No "else"; if it's not there, we're installing, or Health Check will take care of it.
}
catch (Exception e)
{
// DB not inited or install not complete. No worries, mate.
}
}
I'm not sure where the problem is. I am calling the same method and updating the cache; but IIS7 just shows me the old version.
I can disable output caching in IIS, but that's not what I want to do. That seems like a hacky work-around at best.

Without a code sample it's difficult to know what your problem is. In an attempt to provide some assistance, here is how I frequently set the cache in my applications:
public static void SetCache(string key, object value) {
if (value != null) {
HttpRuntime.Cache.Insert(key, value, null, System.Web.Caching.Cache.NoAbsoluteExpiration, TimeSpan.FromMinutes(30));
}
}

The HTTP cache is reset only if you do so manually or the app domain (or app pool) resets for whatever reason. Are you sure that's not happening in this case? And generally speaking, any global static variables would also be maintained in memory under the same circumstances.
There are many reasons why an app pool might be reset at any given point, such as a change to a web.config file, etc. I suggest checking that's not happening in your case.
By the way, output caching is a different thing, although it is maintained in memory largely the same way.

Given that this only happens on IIS7 when Output Caching is not disabled, this seems very likely to be an IIS7 bug. Seriously.
Whether it is or not, is irrelevant to the solution. What you need to do is find some manual process of invalidating the cache, such as touching the web.config file.
But beware: doing this will wipe out the cache (as you expect), but also all static variables (as a side-effect). Whether this is another bug or not, I don't know; but in my case, this was sufficient to solve the problem.

.Net 4/Mvc Runtime Cache strangeness

Update: I have dropped the cache system in favor of a database solution - pitty.
I have a backend MVC controller where i need data caching. I use MemoryCache.Default to store key/value pairs, nothing big. Nevermind policies and expire times, i'f got that. The thing that mystifys me is why my cache gets cleaned out after I'f accessed a key (retrived the value) the first time. If i don't access the cached item, eventually the item will expire and my remove handler is called - it's all good. But when i retrive the item the first time, my remove handler is called after a short while. The ChacheEntryRemovedReason is set to:
CacheSpecificEviction // A cache entry was evicted for as reason that is defined by a particular cache implementation.
I can't find any explanation to what this means.
The mystifying thing here is that when i inspect the cache object when debugging in the handler (and on succeeding controller calls), the cache enum is empty. If I "set" (add) a new CacheItem to the cache, I can yet again access the key once, and history repeats.
The behavior is like a one-off caching mechanism which i totally don't need.
Any help or comments would be much appreciated!
Some simplified code just for the fun of it:
private static ObjectCache cache = MemoryCache.Default;
internal void insertInCache(string key, int value) {
CacheItemPolicy policy= new CacheItemPolicy() {
AbsoluteExpiration = ObjectCache.InfiniteAbsoluteExpiration,
Priority = CacheItemPriority.NotRemovable,
SlidingExpiration = TimeSpan.FromMinutes(ITEM_EXPIRE_TIME),
RemovedCallback = new CacheEntryRemovedCallback(RemovedHandler)
};
cache.Set(key, value, policy);
}
static void RemovedHandler(CacheEntryRemovedArguments args) {
if(args.RemovedReason == CacheEntryRemovedReason.Expired) {
//do something - or i actually want it to disappear when expired
} else {
cache.Set(args.CacheItem, somepolicy);//reinsert to keep in cache
}
}
//Apparently this triggers some cache mong mode
internal void getSome(string key){
int thisIsWhatIWanted = (int)cache.GetCacheItem(key).Value;
}
This is just example code so please don't nag me about my skillz.
My own best guess is that it may have to do with the cache not being setup properly, MVC witchery or the fact I'm running my application on a debug IIS (visual studido)

Is spring Resource a file or directory?

I am using the spring Resource API and using a ResourcePatternResolver to scan my classpath for files.
In one situation the scan is picking up some directories and files that are in a pre-built jar and some that are on the file system.
In either case a 'resource' will either be a file or a directory. How can I reliably detect whether a resource points to a directory or file, whether in a jar file or not? Calling getFile() on a Resource inside a jar throws an Exception so I can't use that plus isFile() as I initially tried.

Spring’s Resource interface is meant to be a more capable interface for abstracting access to low-level resources.
It wraps File sometimes while sometimes not.
It has six built-in implements: UrlResource, ClassPathResource, FileSystemResource, ServletContextResource, InputStreamResource, ByteArrayResource.
You can implement yourself resource form.
The UrlResource wraps a java.net.URL, and may be used to access any Object that is normally accessible via a URL. If you use http: prefix ,the resource is a URL.
The ClassPathResource represents a resource which should be obtained from the classpath. This Resource implementation supports resolution as java.io.File if the class path resource resides in the file system, but not for classpath resources which reside in a jar and have not been expanded (by the servlet engine, or whatever the environment is) to the filesystem. To address this the various Resource implementations always support resolution as a java.net.URL.
FileSystemResource is an implement for java.io.File handles.It obviously supports resolution as a File and as a URL.
InputStreamResource is a resource implements for a given InputStream. Do not use it if you need to keep the resource descriptor somewhere, or if you need read a stream multiple times.
ByteArrayResource is a Resource implement for a given byte array. It creates a ByteArrayInputStream for the given byte array.
So you should not always use getFile() as Spring's Resource doesn't always represent a file system resource.For this reason, we recommend that you use getInputStream() to access resource contents because it is likely to function for all possible resource types.
Refer to： Resources

I think you can just surround the code checking for file by a try catch block:
boolean isFile = true;
try {
resource.getFile()
...
} catch (...Exception e) {
ifFile = false
}

I had a similar requirement, and solved it by excluding directories from my search pattern. Then for each resource found I lookup the parent item in the path, and ensure the directory has been created before writing the file.
In my case the file could be in the filesystem, or in the classpath, so I check the scheme of the URI first..
Although my search pattern may still pickup dirs if they have a dot in the name, so it would be better to catch the exception in that case -
search pattern - classpath*:/**/sprout/plugins/**/*.*
Example code -
private void extractClientPlugins() throws IOException {
Resource[] resourcePaths = resolver.getResourcePaths(sproutPluginSearchPattern);
Path pluginFolderPath = Paths.get(sproutHome, "./plugins/");
pluginFolderPath.toFile().mkdirs();
if (resourcePaths.length == 0) {
log.info("No Sprout client side plugins found");
}
for (Resource resource : resourcePaths) {
try {
Path destinationPath = generateDestinationPath(pluginFolderPath, resource);
File parentFolder = destinationPath.getParent().toFile();
if (!parentFolder.exists()) {
parentFolder.mkdirs();
}
destinationPath.toFile().mkdirs();
copy(resource, destinationPath.toFile());
} catch (IOException e) {
log.error("could not access resource", e);
throw e;
}
}
}
private Path generateDestinationPath(Path rootDir, Resource resource) throws IOException {
String relativePath = null;
String scheme = resource.getURI().getScheme();
if ("JAR".contains(scheme.toUpperCase())) {
String[] uriParts = resource.getURL().toString().split("!");
relativePath = trimPluginPathPrefix(uriParts[1]);
} else {
String filePath = resource.getFile().getAbsolutePath();
relativePath = trimPluginPathPrefix(filePath);
}
return Paths.get(rootDir.toString(), relativePath);
}
private String trimPluginPathPrefix(String filePath) {
String[] pathParts = filePath.split("sprout/plugins/");
if (pathParts.length != 2) {
throw new RuntimeException("The plugins must be located in a path containing '**/sprout/plugins/*'");
}
return pathParts[1];
}
Using it in this project -
https://github.com/savantly-net/sprout-platform/blob/master/sprout-core/src/main/java/net/savantly/sprout/core/ui/UiLoader.java

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

fs.hdfs.impl.disable.cache causes SparkSQL slowness - hadoop

Related

Performance in microservice-to-microservice data transfer

Azure cache write implementation approaches - when to use which

Cache Shows Old Values on IIS7, not Debug Server

.Net 4/Mvc Runtime Cache strangeness

Is spring Resource a file or directory?

Categories

Resources