Monday, September 28, 2015

WTF collection: how not to add an API

Some time ago we decided to move one of our projects to JDK 1.8. After all, JDK 1.7 reached EOL. While we did not really want to start with all those JDK 1.8 features, the security updates were important for us. In addition, the newest versions of popular browsers refuse to work over HTTPS with web servers running on JDK 1.7, and our project has a web UI.

The move was not an easy exercise. Yes, most of the problems were traced to our code, although for some of them I would have expected better diagnostics from the tools or frameworks.

But one problem stood out.

The project communicates over HTTPS with some remote web services. And we started getting errors for one of them. The web service is a bit different than others: the web service provider insisted on using the SSL/TLS Server Name Indication (SNI).

We were having some problems when we started communicating with this web service initially while we were still running with JDK 1.7. And now the errors with JDK 1.8 were remarkably similar to the errors we had initially with the web service. It was immediately clear who is the primary suspect.

After all I knew that the JDK 1.8 includes some API to explicitly control the SNI behavior. But I hoped that the JDK does the right thing if SNI settings are not explicitly controlled. Our code did not do it.

Let's look closer at it. This is what -Djavax.net.debug=all for.

First surprise: after setting the property on, I could not log in in our web based UI! We got some errors in browsers saying that the HTTPS connection could not be established. Removing the property helped with UI. How it is even possible to release a product where enabling some debugging output would break the functionality?! Yes, a lot of developers, me including, did make similar mistakes, but leaving such a bug in a released version is too far from my point of view.

And how the hell I suppose to debug the suspected SNI problem? Let's try to use another JDK. OpenJDK 1.8.0.51 out, Oracle JDK 1.8.0_60 in. Result: the problem with the debugging property and HTTPS in the web UI is gone.

Hope dies last ... the problem with the web service is still there. Of course it would have been too easy if this problem were also solved.

But at least I could now look at the SSL debugging output. And indeed, exactly what I thought: SNI is not sent.

I also knew about jsse.enableSNIExtension system property. We started the JVM without this property so the default behavior must have been used. Needless to say that explicitly setting the property to true did not change a thing.

The rest was just a tedious work of creating a reproduction scenario and some googling. A simple program with some basic HttpURLConnection manipulations did not reproduce the problem: the JDK was sending the SNI info. Time to look at the JDK source code and do more debugging.

From my point of view the authors of that part of Java have had reserved themselves a long time ago a permanent place in the developer's hell. This code is a mess and a bloody nightmare. Yes, I have seen some code that was much worse. But somehow I expected the JDK code be of a better quality...

After many cycles of modifying the test program, debugging, and studying the mess they called "source code" I came to this beauty:

HttpsClient.java, starting from line 430, method public void afterConnect():
[430]    public void afterConnect()  throws IOException, UnknownHostException {
...
[439]                    s = (SSLSocket)serverSocket;
[440]                    if (s instanceof SSLSocketImpl) {
[441]                       ((SSLSocketImpl)s).setHost(host);
[442]                    }
...
[470]            // We have two hostname verification approaches. One is in
[471]            // SSL/TLS socket layer, where the algorithm is configured with
            ...
            The rest of the very long and insightful comment is stripped
[518]             boolean needToCheckSpoofing = true;
[519]             String identification =
[520]                s.getSSLParameters().getEndpointIdentificationAlgorithm();
[521]             if (identification != null && identification.length() != 0) {
[522]                 if (identification.equalsIgnoreCase("HTTPS")) {
[523]                    // Do not check server identity again out of SSLSocket,
[524]                    // the endpoint will be identified during TLS handshaking
[525]                    // in SSLSocket.
[526]                    needToCheckSpoofing = falsee;
[527]                }   // else, we don't understand the identification algorithm,
[528]                    // need to check URL spoofing here.
[529]            } else {
[530]                 boolean isDefaultHostnameVerifier = false;
...
[535]                 if (hv != null) {
[536]                    String canonicalName = hv. getClass().getCanonicalName();
[537]                     if (canonicalName != null &&
[538]                    canonicalName.equalsIgnoreCase(defaultHVCanonicalName)) {
[539]                        isDefaultHostnameVerifier = true;
[540]                    }
[541]                }  else {
...
[545]                    isDefaultHostnameVerifier = true;
[546]                }

[548]                 if (isDefaultHostnameVerifier) {
[549]                    // If the HNV is the default from HttpsURLConnection, we
[550]                    // will do the spoof checks in SSLSocket.
[551]                    SSLParameters paramaters = s.getSSLParameters();
[552]                    paramaters.setEndpointIdentificationAlgorithm ("HTTPS");
[553]                    s.setSSLParameters(paramaters);

[555]                    needToCheckSpoofing = false;
[556]                }
[557]            }

[559]            s.startHandshake();
...
[581]    }
There are so many things that are wrong here. But let's start:
  1. Lines 440 - 442: the hostname is passed to the SSL socket via a non-public API. This basically prevents you from providing your own SSL socket factory with your own SSL sockets delegates. Your sockets will not get hostname info. And hostname is used by the default trust verification mechanism invoked from the default SSL socket implementation.

  2. The biggest "wrong" in this code starts on line 470. The authors have probably wasted all their energy on those 50 lines of comments, and got nothing left to properly implement the logic. Basically the SNI information is sent only if method SSLSocketImpl.setSSLParameters() is invoked. And if it is not invoked, no SNI is sent. And the code above shows that setSSLParameters() is invoked in one case only: if no endpoint identification algorithm was specified and no "default" hostname verifier was set. Our code had a custom hostname verifier, and oooops SNI was not send.

    The funny thing about it: if one bothers to explicitly specify an endpoint identification algorithm, even the default one, SNI is not sent either.

    There is actually a bug JDK-8072464 about the "non-default hostname verifier", but it does not mention an explicitly specified endpoint identification algorithm. And it looks like they do not plan to fix it in 1.8.

  3. There is another bug lurking in the API and the implementation: there is no easy way to disable the SNI or actually just customize it for a particular connection. Yes, one can disable sending SNI by setting jsse.enableSNIExtension system property to false, but it is a JVM-wide setting. Don't you hate when the only way to get some functionality is use some system property? I do hate that kind of "all or nothing" approach. And JDK is full of it. One of the worst offenders is javamail: it gives you a way to specify per-connection settings and still relies on JVM system properties is some cases. Really clever technic!

    Back to SNI: you see, to explicitly specify SNI you have to implement an SSL socket factory, which is already quite a step. Then you can use setSSLParameters() to customize the SNI or provide an empty list if you do not want to have SNI sent. So far so good, but this is the only place where you are in control of a socket. And it is too early. Because HttpsCient.afterConnect() is invoked much later. Say there is no endpoint identification algorithm specified and no "default" hostname verifier set. Or just imagine bug JDK-8072464 is actually fixed. In this case the default SNI behavior kicks in and overwrites whatever you have specified in the socket factory. Remember that little setHost() on line 441? This is where the host name gets into the SSL parameters. And then the code on line 551 - 553 overrides your precious SNI with the one that was set on line 441.

    So in reality you have to implement an SSL socket factory and an SSLSocket so that you can do some additional SNI manipulation in method startHandshake(). But then you will not get hostname set because lines 440 - 442 are not executed for your custom SSLSocket.

A small detour: go read what they call a "JDK Enhancement Proposal" about SNI, especially the section about testing.
Need to verify that the implementation doesn't break backward
compatibility in unexpected ways.
A noble goal, is it not?

Just imagine how much time they have spent on that JEP, on the API, on the implementation. Net result? Puff, zilch, and JDK-8072464.

Of course all this applicable only if your code relies on the JDK's HTTP support. This is probably another very good reason to move to libraries like Apache HttpComponents. I do not know if it properly supports SNI and gives you enough rope but it can at least be patched much easier if needed.

Since we still have to rely on the JDK's HTTP support I had to resort to a custom SSL socket factory and a custom SSL socket and on things like "instanceof SSLSocketImpl" and typecasts. Too much code to my liking to work around some silly bug. But at least we now can send messages to that web service.

And, by the way, there is another problem with the JDK's SNI. From my point of view it is also a bug but this time it goes over somewhat grey area of definition ambiguity. The code in question is Utilities.rawToSNIHostName().

The SNI RFC section 3.1 prohibits use of IP addresses in SNI, and the JDK behaves correctly if hostname in the above code is an IP address.

But they also ignore hostnames without a '.' character. This is wrong. I guess they try to follow the RFC where it says:
"HostName" contains the fully qualified DNS hostname of the server,
as understood by the client.

There two problems with the JDK's behavior. First of all, the spec says "as understood by the client". If a hostname with or without a '.' character is correctly resolved to an IP, it is as good as "fully qualified as understood by the client". So the JDK incorrectly excludes hostnames without a '.' character.

On the other hand, if the JDK follows some other specification of a "fully qualified DNS hostname", then a mere presence of a '.' character in a hostname does not make it fully qualified. It is at most "a qualified name". Unless of course the JDK authors have somewhere a specification that says "a hostname is fully qualified if it has at least one '.' character". But I bet they just got tired after writing all those specs, JEPs, APIs, and comments in the code.

Sunday, August 2, 2015

WTF collection: JDK and broken expectations

Imagine: you have a piece of code that have been working fine for a long time. The code connects to some web server, sends some data, gets the result, etc. Nothing fancy here.

One day the very same piece of software was used to send data to yet another web server. This time it was different: we got long delays and a lot of errors like "Unexpected end of stream". Time to investigate.

The packet capture revealed one particular HTTP request header that just screamed "You want some problems? Just use me! " The header was Expect: 100-continue

The code in question uses java.net.HttpURLConnection class. (Yes, I know about Apache commons HTTP, but that's beyond the point.) I just hoped that this header is not set by the JDK code automatically. It is always a great PITA to change HTTP headers that the JDK babysits. Just try to set HTTP request header "Host" for example! Fortunately "Expect" header is not one of those restricted by the JDK. It was set by the application itself based on some conditions. Couple of changes and this header is not sent any more. No more delays, no more "Unexpected end of stream" errors, everything works.

By now I was curious enough to find out what was going on. Even the fact that I had to go through some JDK code did not stop me. Almost every time I had to look in the JDK code I got the feeling I could easily have guess where the authors of that code grew up...

Yep, no surprises here, the same style, see: sun.net.www.protocol.http. HttpURLConnection class. Here is the relevant piece of "logic":

private void  expect100Continue()throws IOException {
            // Expect: 100-Continue was set, so check the return code for
            // Acceptance
            int oldTimeout = http.getReadTimeout();
            boolean enforceTimeOut = false;
            boolean timedOut = false;
            if (oldTimeout <= 0) {
                // 5s read timeout in case the server doesn't understand
                // Expect: 100-Continue
                http.setReadTimeout(5000);
                enforceTimeOut = true;
            }

            try {
                http.parseHTTP(responses, pi, this);
            } catch (SocketTimeoutException se) {
                if (!enforceTimeOut) {
                    throw se;
                }
                timedOut = true;
                http.setIgnoreContinue(true);
            }
            if (!timedOut) {
                // Can't use getResponseCode() yet
                String resp = responses.getValue(0);
                // Parse the response which is of the form:
                // HTTP/1.1 417 Expectation Failed
                // HTTP/1.1 100 Continue
                if (resp != null && resp.startsWith("HTTP/")) {
                    String [] sa = resp.split("\\s+");
                    responseCode = -1;
                    try {
                        // Response code is 2nd token on the line
                        if (sa.length > 1)
                            responseCode = Integer.parseInt(sa[1]);
                    } catch (NumberFormatException numberFormatException) {
                    }
                }
                if (responseCode != 100) {
                    throw new ProtocolException("Server rejected operation");
                }
            }

            http.setReadTimeout(oldTimeout);

            responseCode = -1;
            responses.reset();
            // Proceed
    }

Nice thing: they decided to work around some broken HTTP servers that may send a 100 HTTP response even if a request does not have the "Expect: 100-continue" header. See those http.setIgnoreContinue() here and there?

This is actually the only nice thing I can say about this code.

The authors also decided to work around another possible misbehavior with respect to the "Expect" header. See that comment 5s read timeout in case the server doesn't understand Expect: 100-Continue on line 1185? Except that all this happens only if the connection does not have its own read timeout set, see line 1184.

But if you decided to protect your application against some servers that take too long to respond by setting read timeout on an HttpURLConnection to some reasonable value like 30 or 60 seconds, you are screwed. Because the JDK starts waiting for a 100 HTTP response and if the server does not send it, the JDK times out after waiting as long as your read timeout setting (30 or 60 or whatever seconds!). In that case the JDK does not bother sending the request body to the server, and your application gets a SocketTimeoutException. Nice work, guys!

And it is not all. Another very interesting logic starts on line 1200. If some response was read from the server, the code verifies the response code. Completely useless and extremely harmful piece of code: if the server responded with anything other than 100 the JDK reports "Server rejected operation" error.

Now go back to RFC 2616 and read the very first requirement for HTTP/1.1 origin servers (emphasis mine):

- Upon receiving a request which includes an Expect request-header
field with the "100-continue" expectation, an origin server MUST
either respond with 100 (Continue) status and continue to read
from the input stream, or respond with a final status code. The
origin server MUST NOT wait for the request body before sending
the 100 (Continue) response. If it responds with a final status
code, it MAY close the transport connection or it MAY continue
to read and discard the rest of the request.  It MUST NOT
perform the requested method if it returns a final status code.

See that or respond with a final status code part? Say the server implements that part of the specification correctly. It receives a request with "Expect: 100-continue" header, decides that it cannot handle the request because for example it does not recognize the specified context path. The server immediately sends a 404 HTTP response. But the JDK knows better and instead of the real error your application gets a stupid and meaningless "Server rejected operation" error. Good luck finding out what went wrong.

Be warned and do not use the "Expect: 100-continue" HTTP request header if you have to rely on java.net.HttpURLConnection class. And do not think the JDK code is somehow of a better quality than rest of code out there. In fact it is normally on par or even worse.

Thursday, February 6, 2014

Oracle

I have my share of irritations at Oracle's software. Oracle buys companies and software products, good or bad, rebrands them, then it makes them worse through the years, and still manages to sell that crap. In the end poor developers have to deal with enormous behemoths of software world, not really understanding what it is all about and how the hell this thing supposed to work. And even if developers understand all of that the crap from Oracle does not work as documented or expected anyway.

In my mind there was one exception from this: Oracle database. Probably because it is something the company started with.

Yes, Oracle database has its shares of strange features, non-standard implementations, etc. Yes, many argue that Oracle is the worst database out there. No, I am not going to join holly wars on what database is better.

Note that I wrote 'was one exception'. Recently my opinion is changed.

I am not a DBA. I do not have really deep knowledge of relational databases. I know enough to write SQLs if necessary. And if I can I try to off-load some work from my applications to the database. This keeps the processing close to the data being processed. As a result some of my SQLs end up being quite complicated but application code is much cleaner. Sometimes I run into a strange behavior that turns out to be a feature of the database product. It is not a big deal if it happens every now and them.

Unfortunately for the last week or two I had to write a lot of SQL statements. Some of them looked complicated, at least to me. I tested them against PostgreSQL and Oracle. I did not have any problem with PostgreSQL.

But Oracle... Well, it turned out my SQL was complicated to Oracle as well. Constructs that looked obvious just did not work with Oracle, resulting either in error messages, sometimes very strange, or in downright wrong results.

Did you know that Oracle treats an empty string as null? Now I do. If you have a not null column of say VARCHAR2 type you cannot insert an empty string into it. How clever it is?! Now what? Insert a string containing a single space??? Brr. Worst of all: there are people out there who defend Oracle on this issue.

Or how about not having a boolean type? You cannot have a column of boolean type in a table or in a query result. You have to go with 1/0 or 'Y'/'N', or whatever. Wow, real database shines through.

Another thing I ran into is limitations of updatable result sets. Sure, these limitations do not have anything to do with the database itself. They are part of a jdbc driver. But it is just playing with words. I do not care how great the database is if its jdbs driver is lousy.

Oracle documents support for updatable result sets. It is interesting to look at the difference between the versions: Oracle8i and Oracle9i. There are also more recent versions, for example, Oracle 11g, but they are quite similar to Oracle9i.

There is one interesting limitation that is mentioned in Oracle8i documentation and is not there in Oracle9i: "A query cannot use ORDER BY"

One might draw a conclusion that Oracle lifted this limitation long time ago. Ha-ha, gotcha! It is still there, even in a driver version 11.2.0.2.0.

But the funniest thing is not the documentation, but why the limitation is there in the first place. Turns out the driver parses SQL statements passed to it looking for quite a number of SQL fragments. One of such fragment is ORDER BY. And when an application uses methods like ResultSet.updateRow(), the driver gets the original statement, truncates it right before the ORDER BY and then appends the result after some UPDATE ... SET ... fragment it has generated. Now imagine what it does to a statement that has some analytic functions like ROW_NUMBER() OVER (ORDER BY ...). Bloody clowns!

Next Oracle "feature" hit me hard. I had quite a complicated INSERT from SELECT query that did some grouping reducing a 1M row table to a result set of about 70 rows. Worked without a hitch in PostgreSQL. But in Oracle I was getting one of those "unable to extend segment ... " errors. Looking at what and why I discovered that Oracle did not apply grouping at all: the result set contained 1M rows! WTF?!

The grouping was done "per date": the original data contained event timestamps and the query should have produced "so many events of a particular type per day". To convert a timestamp to a date I used cast(timestamp as DATE). "CAST converts one built-in datatype ... value into another built-in datatype ... value. " To hell it does. It probably just sets the type tag on a value without any conversion. So yes, if you run something like
select
    cast(systimestamp as date) d1,
    cast(systimestamp + interval '1' second as date) d2
  from dual;
you see the same two values in the output and think "yeap, cast() works". But if you run
select d1, d2 from (
    select
        cast(systimestamp as date) d1,
        cast(systimestamp + interval '1' second as date) d2
      from dual
) where d1 = d2;
you get no rows back! On the other hand this works as expected returning a single row:
select d1, d2 from (
    select
        trunc(systimestamp) d1,
        trunc(systimestamp + interval '1' second) d2
      from dual
) where d1 = d2;

Now my SQL produces the expected 70 rows both in PostgreSQL and Oracle. But I still wonder how many of such "small details" will hit me tomorrow?

Next one is a bit questionable: ORA-00904: XXX: invalid identifier. It is described quite in details for example here: "Ask Tom".

Why is it questionable? Well, Tom claims Oracle follows the standard here:
ANSI SQL has table references (correlation names) scoped to just one level deep.

This might very well be the case. The language used in all those standards is usually quite incomprehencible. It is really difficult to understand what the authors ment to say. And everyone who tries to follow standard usually understands things slightly differently. I found SQL92 text on the web. As I expected it is completely unclear if there is such a limitation in the standard. Actually I say stronger: if I did not know about Oracle's intrepretation I would not even think there is a limitation.

Now imagine Java had the same scoping rules:
void a() {
    int  c = 0;
    while (c < 10) {  // OK, visible
        ...
        if (c == 5) { // OK, visible
            int  k = c; // Oops, c is not visible here
            ...
        }
    }
}
But let's give Oralce the benefit of the doubt on this one.

Time for my favorite: error ORA-32036. Just go and read its description. And read again ... and again ... Wonderful piece of English prose, is it not? I guess if you cannot appreciate its beauty you cannot be a real DBA. Now goolge it. Turns out the error depends on how a jdbc connection is made. The error happens if the statement is executed in an XA connection. But everything works OK if it is a non-XA connection. And it is not only jdbc, it is the same in .NET world.

I could have added much more but it is getting late...

Only one thing bothers me: I can explain all the issues I mentioned above. Stupid decisions of not having boolean, empty string, bugs in parsing, [mis-]interpretation of the standard, etc. Yes, I can see how they could have happened. But that last one (ORA-32036)?! Damn, I am losing my imagination.

Friday, September 13, 2013

SQL: Upsert in PostgreSQL

Upsert is a very useful SQL statement. Unfortunately not every database supports it. The product I am working on can use several different databases. One of them is PostgreSQL which does not support upserts.

It is not a big deal: googling for it turns several solutions, including triggers, functions, and "Writeable CTE". I find the Writeable CTE solution quite elegant. Unfortunately it has one ... well feature that sometimes might be completely unimportant, just a nuisance in some cases, or a real problem in other situations.

For me it was the latter.

If you execute the example from the "Writeable CTE" page you will get the following results before executing the upsert:
1   12   CURR
2   13   NEW
3   15   CURR
After upsert the results are (changes are in bold):
1   12   CURR
2   37   CURR
3   15   OBS
4   42   NEW
So rows with ids 2 and 3 were updated and a new row with id 4 was inserted. But if you paid a close attention to messages in pgAdmin or psql, you might have noticed the following:
Query returned successfully: 1 row affected 

The query did its job, but reported only the inserted rows! Imagine the query results in only updates. It will report then
Query returned successfully: 0 row affected

By the way Oracle reports the correct result: combined number of affected rows. In the example above it says
3 rows merged

Is it important? After all the query did what it was asked to do.

For me it is important. My real query could do 3 things: update a single row, insert a single row, or do nothing. And I need to know which way it went. Actually all I need to know if a row is affected or not. With Oracle I know. With PostgreSQL I know only if a row was inserted. Sure I always can go to the database and ask, but this means another query, another roundtrip...

But who says my upsert query can stop at only one CTE? Meat the beauty:
WITH
upsert as
(update mytable2 m
    set sales = m.sales + d.sales,
        status = d.status
   from mytable d where m.pid = d.pid
 returning m.*),
ins as
(insert into mytable2
 select a.pid, a.sales, 'NEW' from mytable a
  where a.pid not in (select b.pid from upsert b)
 returning *)
select (select count(*) from ins) inserted,
       (select count(*) from upsert) updated;

If you repeat the example, but run this query instead of the original upsert, you get the job done and you also get the following result:
inserted   updated;
1          2

You immediately know the answer. And it is better than Oracle because in Oracle you cannot differentiate between inserted and updated rows!

You can tweak the select the way you want. Need only "total number of affected rows"? Use:
select (select count(*) from ins) + (select count(*) from upsert);

I ended up with something like:
select 'i' kind from ins
union
select 'u' kind from upsert

Since there is at most one affected row in my case, I get either an empty result set, or a result set with a single row and column having value 'u' or 'i'. And I do not really need to know whether a row was inserted or updated, so my java code looks really simple:
boolean  isChanged = stmt.executeQuery().next();

Nice and simple.

Wednesday, June 19, 2013

Aaaaaah! Metro, you made my day!

As I have mentioned before, WS-security unit tests in metro are using SAAJ to handle SOAP messages. A typical test goes like this:
  1. A SOAP message is created and populated with some elements by the test code.
  2. A WS-Policy-based configuration that defines what to do with the message is created.
  3. A security operation is performed.
  4. The resulting SOAP message is written to a file.
  5. A new instance of SOAP messages is created from the file.
  6. A new WS-Policy-based configuration that defines how to validate the message is created.
  7. Validation is performed.
I decided to add some JAX-WS code to one of the test. Originally the test was doing both signing and encryption, but I removed encryption. It is much easier to see what is going on. Then I verified that the test is still green, and that it fails if I modify the file the test generates before the file is read in. Just in case: you never can be too careful.

I have added the following steps at the end of the test:
  1. The file with the message, created after the security operation, is read in the streaming mode as a JAX-WS message.
  2. A new WS-Policy-based validation configuration is created. It is actually exactly the same code as in SAAJ SOAP message case.
  3. Validation is performed. This is done by a different set of classes than in SAAJ case, although class names are similar. Here metro shines again: although the operations are similar, the way validation is invoked is different. Worse, all the parameters for JAX-WS way of validating are type-compatible with SAAJ validation. Trying to use SAAJ validation code with JAX-WS message compiles but fails with NPE.

After some failed attempts I have got a version that not only compiled but went quite deep into metro code, and then failed with signature validation. It was "Reference #idxxx: signature digest values mismatch" or something like that.

This was ... interesting. The same message is OK if validated as a SOAPMessage instance and fails if validated as a JAX-WS message. Something fishy was going on. Metro had not only provided multiple implementations of XML-Security, but they also managed to make them incompatible. Remind me, what does that "WSIT" stand for?

Of course the problem might have been in the way I have set up JAX-WS-based validation, but I was quite sure it is another genuine "feature" of metro.

In order to understand what the error message means it is necessary to understand what the signed message looks like. It is a SOAP message with a lot of additional stuff added to the SOAP Header (a lot of details omitted):
<Envelope ...
   <Header>
       <Header1 wsu:Id="id_h1".../>
...
          <ds:Signature >
          <ds:SignedInfo>
...
              <ds:Reference URI="#id_h1">
                  <ds:DigestValue>some base64</ds:DigestValue>
              </ds:Reference>
              <ds:Reference URI="#id_body">... <ds:Reference>
...
          </ds:SignedInfo>
          <ds:SignatureValue>some base64</ds:SignatureValue>
          </ds:Signature>
       <HeaderN wsu:Id="id_hN".../>
   </Header>
   <Body wsu:Id="id_body">...</Body/>
</Envelope>

Each <ds:Reference> element "describes" a particular thing that is signed, typically some element from the same message, as well as how that thing has to be preprocessed before calculating the digest, what algorithm is used to produce the message digest, and the message digest value. URI attribute of <ds:Reference> specifies which element is digested.

<ds:SignedInfo> can contain a lot of such <ds:Reference> elements. And then comes <ds:SignatureValue> that is actually a digitally signed message digest of <ds:SignedInfo> element.

The order of signed header elements and <ds:Signature> is not important. Some say the signature must come after to be signed header elements to facilitate streaming, but this is a moot point. Most often than not SOAP Body also has to be signed, so you can kiss goodbye that nice streaming theory.

Anyway the error I was getting, "Reference #idxxx: signature digest values mismatch", was about the very first <ds:Reference> in the message. It meant that the verifying code looked at URI attribute, URI="#id_h1" in this case, found the corresponding header element by its id, <Header1 wsu:Id="id_h1".../>, and calculated its digest. And the calculated digest did not match <ds:DigestValue> of the <ds:Reference>.

I switched on the logging and repeated the test several times, with the same result. I was not sure what I wanted to see. The logging did not show anything exciting. But then I noticed some pattern. The output contained calculated and expected digest values taken from <ds:DigestValue> of the <ds:Reference>. The values were unreadable because the digest is just byte[], and metro guys did not bother with encoding/decoding or pretty-printing them. While the expected digest was clearly changing from run to run, the calculated digest looked the same. This was clearly wrong because the digest should have been calculated over the header element including all its attributes. While most of the things remained unchanged, wsu:Id attribute differed from run to run. So the calculated digest had to be different as well.

Checking the verification code under the debugger confirmed this: the calculated digest was exactly the same every time the test was executed. So what exactly metro is using as the source for the digest calculation? Turned out: in this particular case nothing.

Yeap, nothing. So the calculated "digest" is probably some fixed initial state of the message digest implementation.

The problem had nothing to do with how I used metro API. The real reason was the signed message itself. Time to show the relevant <ds:Reference> in its full glory:
<ds:Reference URI="#_5002">
    <ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
    <ds:DigestValue>dKof/iss1y+eaCxi5xQGzXZw8RQ=<ds:DigestValue>
<ds:Reference>

The key to the problem is not what is there, but rather what is absent. Some important piece is missing, namely, "instructions" on how the date has to be preprocessed before calculating the digest. Normally a <ds:Reference> looks like this:
<ds:Reference URI="#_5006">
    <ds:Transforms>
        <ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#">
            <exc14n:InclusiveNamespaces PrefixList="S"/>
        <ds:Transform>
    <ds:Transforms>
    <ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
    <ds:DigestValue>JWq3aJtzUP98fkiThJ0WYtcrWCY=<ds:DigestValue>
<ds:Reference>
<ds:Transforms> and its child <ds:Transform> elements are such "instructions".

The rest was easy: knowing that there were no <ds:Transform> elements I looked at what metro does in this case with the referenced element. Well, nothing. No data is digested.

Some questions remained though:
  1. Why the signing code produced a messaged without <ds:Transform> elements?
  2. Why the SAAJ message was successfully validated?
  3. What is the expected behavior in case there is no <ds:Transform> elements according to the specification?

Let's start with the last question. The answer is: I do not care. It might be valid. It might be not valid, but in any case it is definitely not the "signature digest values mismatch". This is actually an answer on the first question as well. Why the signing code produced a messaged without <ds:Transform> elements? It does not matter because metro might need to process such messages anyway, no matter how they are created.

Why the SAAJ message was successfully validated? Well, the validation was performed by same code as the singing. For SAAJ messages metro delegates all the work to JSR 105 "Java XML Digital Signature API Specification" implementation, which is now part of JDK. Basically it is some older version of Apache Santuario, repackaged by Sun. I checked Apache Santuario source and found some remarkable similarities with metro code. Except that Santuario's code does not have this particular bug because after applying all the transforms it checks if there is any data left to be processed, and processes it. And metro does not perform this check. The check existed in Santuario's code for ages, from the first check-in of JSR 105 support in 2005. I guess metro had "borrowed" some even older version of that code. As a result metro fails if there are no <ds:Transform> elements, and also might fail if there are <ds:Transform> elements. I did not check completely the logic there but it looks like some combination of <ds:Transform> elements might result in the same error. The "I" is WSIT looks more and more like a joke.

By the way, why the signing code produced a messaged without <ds:Transform> elements? The transforms to use are coming from WS-Policy, but not directly. At least when I tested some metro samples, the generated messages had <ds:Transform> elements, but WS-Policy declarations used by the samples did not have any explicit mentioning of transforms. Metro runtime is probably adding this during parsing of WS-Policy. The signing code in the unit test uses a combination of WS-Policy file and some runtime policy manipulation to create the final policy for the signature process. What exactly has to be singed is defined by java code, so this is probably the reason why the final policy ended up having no transforms specified. Sure enough after I found out how to add this info to the java-created policy and modified the code, signed messages were produced with <ds:Transform> in <ds:Reference>. And the JAX-WS way of verifying the message went OK as well.


What can I add? At least interoperability-wise metro really shines.

Monday, June 17, 2013

Web services: in search of interoperability

I have some experience with web services, SOAP, SAAJ, what not, but until recently I did not have any "pleasure" to work with all the rest of WS-* standards. WS-Addressing, WS-Crap, WS-Security, WS-whatever.

Not that long ago however I ran out of luck.

The product I am working on can operate as a WS service or a WS client. And it works. But there are situations when "it works" is not good enough. Usually it has to do with artificial barriers created to protect certain markets. This all is accompanied by some "specification" to which a product must conform. As a rule such specification is badly edited collection of copy-pasted fragment from other standard documents from well-known authorities like W3, OASIS, etc., often with contradictory statements. Nobody cares.

Recently we needed to prove our product conforms to one of such specs. We were lucky. Not only there was a specification, there was also a "compliancy service" provided. The "compliancy service" could accept SOAP requests and thus validate if the sender conforms to the specification, or it could send SOAP requests and validate SOAP responses to validate if the receiver conforms to the specification.

Last year I had to deal with another "specification" and "compliancy service" from the same people. Do you know what one needs to have a good compliancy service? No, you do not have to conform to well-known standards, or even to your own specification. Monopoly is good enough. Add some crappy software and you are set.

For example, the software they used then (and still use) could not handle some erroneous HTTP requests. Instead of returning any kind of response the "compliancy service" did nothing. Literally. The connection was kept open, but not a single byte of response was sent back. Eventually the client timed out trying to read a response. It took us more than a month collecting data and e-mailing them before they agreed that the problem is on their side. The problem is still not fixed.

So I knew I had a lot of fun ahead, I just did not know how much.

This time everything revolved around WS-Addressing and optionally WS-Security. How exactly WS-* stuff had to be applied was specified in an "interoperability standard" document. The document was unclear on couple of dozens points, but it was a "standard", so our product had to be "standard"-compliant.

The "compliancy service" found no problem in our requests and responses in case no WS-Security had to be applied. Adding XML signature to the picture changed everything.

First, the "compliancy service" did not like what request elements our product signed. It complained we were signing more than needed. Turned out it was the case of "do what I do and not what I say". The "standard" defined some elements as mandatory and allowed some optional elements to be present. In the section that described what has to be signed it said that all mandatory and optional (if present) elements must be signed. But "compliancy service" did not like that our requests had optional elements that were signed. OK, no optional stuff then. And no more complains from the "compliancy service".

But when I started testing our product as a web service provider all hell broke loose. No matter what I did the "compliancy service" said "signature verification failed". Just that.

Since then I have learned what JWSDP, XWSS, WSIT, Metro, you name it, means. I have seen monsters much worse than in JBoss code.

And I found out that by 2008 there were still XML parsers in the wild that would reject valid XML documents as invalid. And that in 2013 somebody would still use that parser. Ok, ok, granted, I am not really sure if that XML parsing problem is a feature of a parser itself. It might very well be that the parser was improved as part of the "compliancy service" development. But still.. failing to parse XML if there is a character entity representing a whitespace character between some XML elements? Like this:

<Header>
<header1 …/>
<header2 …/>&#x20;
</Header>&#x20;
<Body …/>
</Envelope>

Remove any of these two entities, and the problem goes away, even if is added anywhere else. +100 to "compliance level". Grrr.

After a lot of experiments and quite some test code to generate XML signature I found out that the "compliancy service" did not like new lines in our response messages. Only after I produced a signed response that did not contain new line characters, the "compliancy service" gave up and accepted the response.

This was really strange because request messages with new lines did not cause any trouble. Submitting a bug report to them was not really an option. We did not have another month.

I found out that the "compliancy service" uses some WS-* toolkit from Sun, not sure of the exact name and version of the toolkit. Nowadays it goes under name "Metro". Or is it WSIT? Beats me. Anyway, based on some stack traces I have seen it was a version from around 2008. Oh, Sun! I had some pleasures debugging JAX-WS RI some time ago. Fine experience, unforgettable.

So I decided to download the latest version of that toolkit to play with it. The decision opened a bright world of project code names and their relationships. Googling classes from the stacktrace resulted in XWSS, JSWDP, WSIT, with XWSS being the primary suspect. Project migrations, consolidations, broken download links, Oracle buying Sun added even more fun.

All the roads led to metro and WSIT. The latest version is 2.3, so be it.

Setting it up and running some samples went mostly flawless, but when I started experimenting with soapUI, I immediately ran into an issue. The sample I was using was a SOAP 1.2 web service, but I sent to it a SOAP 1.1 request. Granted, it was a faulty request, but a SOAPFault with NullPointerException and nothing more is quite an extreme way to say "I do not support SOAP 1.1 here".

By the way do you know what WSIT stands for? Web Services Interoperability Technologies. Yeap, "Interoperability".

I also tested how character entities are parsed. I could not reproduce the problem. At least this one is solved. Or it was not a problem of the toolkit at all.

The real fun began when I started sending signed requests from soapUI. First I have got bitten by the fact that soapUI "friendly" modified my messages.

Next problem I ran into was much more serious: some of the signed messages my test code produced were happily accepted by metro and some were rejected with an error that sounded like "Signature verification of SOAP Body failed".

Some of the messages accepted by metro had new line characters, so again the problem we had with the "compliancy service", if it was the problem of the toolkit, was solved. Needless to say when I generated response messages with the exact formatting they still were rejected by the "compliancy service".

And what about the test messages that metro rejected? I actually found the cause quite quickly. Under some circumstances metro chops off whitespace characters and probably also comments that are direct child nodes on after SOAP Body. They probably do it in order to promote "I" ("-nteroperability"). What else can be the reason? And of course whitespaces are significant with XML digital signature.

For example, this:
<Envelope>
…
<Body>
    <elementX .../>
</Body>
</Envelope>

is treated by metro as if it were
<Envelope>
…
<Body><elementX .../></Body>
</Envelope>
But not always. Who said life is easy?

Looking back I know that I was lucky when I have tested our product as a WS client sending data to the "compliancy service". Pure by chance the request messages did not have any whitespace characters in Body.

I should say the source code of metro is a ... well, I do not know. Saying "mess" would not do it justice. It would be more like a huge compliment. Classes with same name in multiple packages maybe doing the same thing? Or maybe not the same? Methods longer than couple of thousand lines? Copy-paste? You name it, and it is there.

I also found that the problem was reported to them, maybe even multiple times. It is always easy when you know exactly what you are looking for. And of course it was reported fixed. Ha! This is another thing I do not understand: you have found a problem, you have fixed it. Is it so much work to fire "find usage" in your IDE? One of the remaining places is in the file next to the one you just have modified! To me it says a lot about quality of the project and people working on it.

The problem is in JAX-WS integration code of WSIT, but given the complexity of metro, JAX-WS is probably the only way metro is used, so the problem affects everybody who is using metro with XML-Security. And people are still running into this problem. The answer from "Interoperability" specialists is of course "it is your problem". Unfortunately it is true.

Another "stamp of quality" is their unit tests. WS-Security subproject has only 11 tests that do something about WS-Security. Compare that with 27 tests around WS-Policy parsing. Even more interesting fact is that their WS-Security tests do not test JAX-WS code paths. Metro web site claims that they use XML streaming to improve performance. And part of their code is using XMLStreamReader. Whether it improves performance I do not know since they like to copy data into XMLStreamBuffer objects to use them in other places as XMLStreamReader. But their unit tests are using SAAJ to read SOAP messages, and not the streaming code. As a result the code that is actually used by a metro-based WS client or server is not tested.

I should probably even not mention the possibility to have some unit tests for testing the interoperability with other toolkits. Doubt they would understand the concept.

Anyway, knowing the problem and the fix I repeated my failing tests, this time fixing the data as needed under debugger. Sure thing, no more signature errors.

Net result: our product is compliant with WS-Security standard. I know what we need to do to make a particular configuration of the latest version of "the great interoperability toolkit of all times" to accept our messages. Given the complexity of metro I have no idea if some other configuration would be OK with our messages.

I still have no idea why the "compliancy service" did not like our responses with new lines in it. Needless to say I tried again making sure there are no whitespaces in Body, but the error was still the same.

If you are thinking of using metro for your projects, do not. Even if you do not need WS-Security. If they manage to screw things up during parsing I do not want to think what they can do in more complicated cases.

If you are unfortunate to use metro now, especially with WS-Security... Well, if you have metro on both sides, you will not be beaten by this bug, because normally metro generates SOAP Body without whitespace characters.

If you have some interoperability issues with XML signatures using metro and some other toolkit, check the messages. Maybe you are lucky and all your issues are caused by whitespaces in SOAP Body.

If you are a metro developer... let me say no more.

Tuesday, June 11, 2013

Watch out what soapUI is sending out

I recently had to debug an interoperability issue in how two software products exchange SOAP messages. soapUI is quite handy in such cases, but it can be too helpful and change messages for you.

In my case the messages I was sending had some WS-Addressing elements in SOAP header. I needed these elements to have some specific values, so my messages were prepared accordingly. Nonetheless the receiving side seemed to get confused during processing of these messages.

It wasted some time poking around the receiving side configuration and logging before I did what I had to do from the beginning: look very closely at "Raw request" tab in soapUI. It turned out soapUI "helpfully" modified my carefully crafted messages.

The soapUI project was created from a WSDL that contained some WS-Addressing. soapUI detected it and automatically modified every message. The thing is: my messages already had all the necessary and correct WSA elements, but soapUI went ahead anyway and replaced my wsa:Action element with its own, with the same text value. It also added some namespace declaration here and there. Why could it not leave my wsa:Action element alone? It was correct from the start. soapUI did not just insert its own wsa:Action, it replaced the original one, so it could have just checked if it is valid. Probably it was too much work for them.

And it is not that easy to spot such a small change in all that XML mess. Yeah, it is a silly excuse, but it is true.