Friday, May 27, 2011

Mule startup quirks

Mule saga continues. Next problem I wanted to tackle was "we had to restart the mule server and the problems went away … or not ... and we had to restart it again ...".

A small detour: Mule can run as a standalone application where it handles everything itself. It can also be embedded into some application or packaged as a JEE web application and then deployed in a web container. Originally the project's Mule was running as a standalone server. This was changed at the client's request some time before going into production and now Mule is an EAR file with a single web application that runs in a separate OC4J instance. Mule provides the necessary code to make it work: a servlet context listener ( MuleXmlBuilderContextListener) to initialize Mule and a servlet ( MuleReceiverServlet) to receive HTTP requests. As a side effect of this embedded mode Mule can't really handle its lifecycle. Everything is managed through OC4J and OPMN. This was also given as the reason to use embedded mode: "look, this way we get Mule into our management infrastructure!"

Back to the topic: needless to say stopping/starting the mule server multiple times locally or in a development or test environment did not cause any problems. Time to look at the production log files. No surprises, it is full with exception stack traces, which makes it much harder to find something if one does not know what to look for. It took me some time to find this lonely line:

ERROR [OC4J Launcher] org.mule.retry.notifiers.ConnectNotifier: Failed to connect/reconnect: jms://QUEUE_NAME. Root Exception was: Io exception: Connection reset. Type: class java.sql.SQLException


After looking at the log entries before and after the found one I realized what has happened and why it was necessary to restart Mule multiple times. It goes like this:

  1. The Oracle database that provides AQ JMS support experiences some problems. As a result Mule also has problems, connections get dropped, etc.

  2. Administrators restart Mule server without (fully) solving database problems.

  3. Mule JMS support code fails to initialize JMS connection at Mule startup. This stops Mule startup sequence but not OC4J instance or OC4J Mule application. Mule can't stop itself because it does not control its lifecycle when it runs embedded.

  4. As a result Mule is started in a half broken state with at least some services uninitialized but is seen by OPMN as successfully started. So much for the management infrastructure.

  5. Subsequent requests to Mule fail because even though Mule servlet is up and running most of Mule services are not.


Okay, the problem is clear. Admins are notified to watch for that "Failed to connect/reconnect: jms://QUEUE_NAME" message in the logs. Except that it does not help much because OC4J (actually OPMN) can choose to bring OC4J instances down and up any time OPMN finds necessary. Given the software runs in OC4J cluster this can lead to situations when Mule runs on one node in the cluster and fails on another. Oops.

Now the question is: can anything be done to improve the situation?
I looked at the Mule initialization logic (class MuleXmlBuilderContextListener) and found that it does something ... interesting.

Method MuleXmlBuilderContextListener.initialize(ServletContext context) is responsible for initialization and here it is (from Mule 2.2.1), with some minor details omitted:

public void initialize(ServletContext context)
{
...
try
{
muleContext = createMuleContext(config, context);
muleContext.start();
}
catch (MuleException ex)
{
context.log(ex.getMessage(), ex);
// Logging is not configured OOTB for Tomcat, so we'd better make a
// start-up failure plain to see.
ex.printStackTrace();
}
catch (Error error)
{
...
throw error;
}
}

Basically almost any failure during initialization, be it configuration file errors, connection problems, whatever, are ignored. Sure they are logged, at best, but that is it. RuntimeExceptions are not caught, but most of Mule failures during startup result in MuleException, which is caught. For example, errors in the configuration file lead to an exception thrown somewhere down the chain from createMuleContext(). Some connection failures happen somewhere in muleContext.start() and do not throw any exception but do stop the initialization. What were they thinking?!

Is it possible to do better than that? I ended up with this, but a copy-paste version of MuleXmlBuilderContextListener with some modification would work as well:

public class CustomMuleXmlBuilderContextListener extends MuleXmlBuilderContextListener {

public void contextInitialized(ServletContextEvent event) {
initialize(event.getServletContext());
if (muleContextEx != null) {
String errMsg = "Mule context is not initialized";
logger.error(errMsg, muleContextEx);
throw new IllegalStateException(errMsg, muleContextEx);
}
if (muleContext != null && !muleContext.isStarted()) {
String errMsg = "Mule context is initialized but some services failed to start";
IllegalStateException ex = new IllegalStateException(errMsg);
logger.error(errMsg, ex);
throw ex;
}
}

protected MuleContext createMuleContext(String configResource, ServletContext context)
throws ConfigurationException, InitialisationException
{
try {
return muleContext = super.createMuleContext(configResource, context);
} catch (ConfigurationException ex) {
muleContextEx = ex;
throw ex;
} catch (InitialisationException ex) {
muleContextEx = ex;
throw ex;
}
}

//
private MuleContext muleContext;
private MuleException muleContextEx;
}

I had to override createMuleContext to get the created context instance because muleContext.isStarted() is the only way to find out if Mule has actually started successfully. This new listener makes sure that:

  1. If Mule is not started but reports no exception the listener makes sure there is one, and it is logged. (Yeah, even more noise in the log.) But this makes a failure like the one caused this investigation "Failed to connect/reconnect: jms://QUEUE_NAME" be more visible in the log.

  2. A failure during Mule initialization and startup is reported back to the web container. This hits another grey area of servlet specification which does not say what MUST or SHOULD happen in this case. There is a note, for example, in v. 2.4 of Java Servlet Specification:
    SRV.10.6 Listener Exceptions

    Some exceptions do not occur under the call stack of another component in the application. An example of this is ... a ServletContextListener that throws an unhandled exception during a notification of servlet context initialization ... In this case, the Developer has no opportunity to handle the exception. The container may respond to all subsequent requests to the Web application with an HTTP status code 500 to indicate an application error....

    OC4J has here fortunately a sensible behavior: the web application is marked internally as not started. Any HTTP request to it triggers an attempt to start it. So the listener gets another chance to start Mule and if the problem went away Mule starts and ready to serve requests. This does not work of course for configuration file failures but does its job nicely for connection failures.

Note: I know about RetryPolicy and its friends. It is a subject of one of my upcoming posts.

1 comment:

  1. I have not faced this kind of problem yet. But I am sure that your experience is going to help me in resolving this kind of problem. Thanks for sharing your experience.

    sap erp system

    ReplyDelete