Occam's Moving Parts

2013-11-11

Component Contracts in Service Oriented Systems

PRINCIPLE: Relationships must be governed by contracts that are monitored for performance.

In order to build a reliable system that is composed of many services, we need to have some guidelines for making the services reliable, both in the technical sense, and in the more psychological sense of people having confidence that things will work.

In a system of services, just like in a society, business relationships should be governed by contracts that are monitored for performance. Wherever a dependency exists between services, components, or teams, a contract needs to exist to govern that dependency. That contract comprises an agreement that defines the scope of responsibility of the service provider and the service consumer. Here's a description of the contracts each service should provide to its customers.

Interface Contract

Every service must guarantee that its interface will remain consistent. Assuming the service is delivered over HTTP, the interface includes:

Names and meanings of query string parameters.
Definitions of what HTTP headers are used or ignored.
Format of any document body submitted in the request
Format of the response body.
Use of HTTP methods.

Note that in this context, "consistent" does not have to mean unchanging. It only means that no backwards incompatible changes can be made. If your service is designed on the same RESTful hypermedia principles of the web, your interface can remain consistent while growing over time.

The Interface Contract must be documented and available to both your customers and your delivery team. In fact, I would strongly recommend that the Interface Contract be created and delivered before you begin writing code for your service. It serves not only as documentation, but as the specification for developers to work from, and as the starting point for your test plan.

If changes require breaking compatibility, the best policy is to expose a new version of your service at a different endpoint. You must then establish a deprecation cycle to ensure clients have time to move to the new version. Only after all clients have migrated to the new version can you stop providing the old version. Such deprecation cycles can be very long, depending on the complexity of the service and the velocity of client development. Avoid backwards-incompatible changes in your interface if at all possible.

Service Level Agreement

Where your Interface Contract defines what your service will deliver, the Service Level Agreement (SLA) governs how it will be delivered (or how much). Things that need to be documented in your SLA include:

Availability: Uptime guarantees, scheduled maintenance windows, and communication policies around downtime.
Response time: What is the target for acceptable response times? What is the limit beyond which you will consider the service unavailable?
Throughput: How many requests is the service expected to handle? How many is the client allowed to send in a given time window?
Service classes: Are there certain kinds of requests that have non-standard response time or throughput requirements? Document them explicitly.

Your SLA should also describe how you monitor and report on conformance with the agreement. Measurements of these aspects of performance are usually called Key Performance Indicators (KPIs), and those measurements should be made available to your customers as well as your delivery team. These might be circulated in a regular email, or made available as a web-based dashboard.

If there is a financial arrangement involved in using the service, your SLA should also include remedies for non-conformance. However, even for services designed for internal consumption only, the SLA should be explicitly documented and agreed on by the service provider and the service consumer.

Internally, you should also monitor the error rate of your application and subtract it from your availability. A server that throws a 500 Internal Server Error was not available to the customer who received the error. If a high percentage of requests result in errors, you have an availability problem.

Communication and Escalation Policy

The key to any relationship is communication. When you provide a service, you must have a communication plan around delivering that service to customers. Some of that communication is discussed above. Issues to cover in your communication plan include:

Notification of changes and new service features.
Notification of deprecation cycles.
Reporting on service level performance.
Notification of incidents and how problems that affect customers are being managed.

In addition to these important communications from you to your consumers, it is also important to establish how your customers will communicate to you.

How can your customers contact you with questions or concerns?
How do they report problems?
What are the business hours for normal communications, and what is the policy for after-hours emergencies?

Establishing these policies up front will help people remain calm when an emergency does occur. A clear communication plan can ensure that you can focus on solving problems rather than fielding complaints. It also ensures that the customer feels confident that you have things well in hand.

Conclusion

At any point where dependencies exist between systems (or teams), that relationship must be governed by a contract. That contract comprises an agreement that defines the scope of responsibility of the service provider, including the interface for the service, a Service Level Agreement that establishes Key Performance Indicators along with targets and limits, and a Communication and Escalation Policy to ensure good support for the running service.

With these parameters defined and clearly communicated, all parties should have confidence in the reliability of the service (or at least a clear path to getting there).

2013-07-22

Toward a Reusable Content Repository

There are a plethora of web-based content management systems and website publishing systems in the world. Almost all of them are what you might call "full stack solutions," meaning that they try to cover everything you need to cook up a full publishing system, from content editing to theming. Wordpress is the most obvious example, but there are hundreds of such systems varying in complexity, cost, and implementation platform.

So many of the available products are full stack solutions that the market seems to have forgotten the possibility of anything else. What would it look like if you could assemble a CMS from ready-made components? What might those components be, and how would they interoperate?

CASTED: Cooperative Agents, Single Threaded, Event Driven

The past looked like this: A User logs into a Computer, launches a Program, and interacts with it.

The future looks like this: The Computer on your desk runs a Program (in the background) that collaborates with a Program running on the Computer in your pocket and another Program running on a Computer in the Cloud, operating on your behalf without the need to interact.

In the past, a Program and an Application were the same thing. More and more, the Applications of today and tomorrow are made up of multiple Programs running on multiple Computers but cooperating with each other to achieve some utility for You (formerly the User).

The web development community has lately been very excited about single-threaded event-driven servers like Node.js. These processes are very good at maintaining a large number of connections, each of which requires only a small amount of work. (These servers are not very good at the inverse case, a small number of clients asking for very hard work to be done. For that, you want a different model.)

This paradigm of a large number of connections and small amounts of work fits neatly into the world where large numbers of processes collaborate to create a useful result. Each process does a relatively small amount of work, but the value emerges from the coordination of the processes through their communication.

Example: There is a process on your phone that displays emails. There is a process on the mail server that sends the messages to your phone. There is a process that examines messages as they arrive at the server to filter out junk mail. There is another process that examines the messages to rank them by importance and places some in your Priority Inbox. These processes are constantly running, on multiple servers, operating on your behalf in the background.

Years ago, Tim O'Reilly was writing about software above the level of a single device as part of his Web 2.0 concept. Tim's classic example is the iPod-iTunes-iStore triumvirate. You have servers on the Internet, a desktop or laptop computer, and a small handheld device all coordinating your data for you.

As more devices have computers embedded into them, there are more opportunities for cross-device applications. And as more such applications emerge, users will expect applications to coordinate across devices like this. If you are designing a new application today, you'd better be thinking about it as a distributed system of cooperating processes.

2013-02-04

Evolving Systems vs Design Consultants - A Recurring Pattern

I often think of systems architecture as analogous to this word game I played as a child. I don't know if the game has a name, but it is begun by selecting two words, say "cat" and "dog". The goal is to begin with one word, and end with the other. The rules are, you can change only one letter each turn, and at the end of every turn, you must be left with a true word. Hence, one way the game might play out is CAT -> COT -> COG -> DOG. You might also get there through CAT -> COT -> DOT -> DOG. Either path is valid, but there is no direct "upgrade" from CAT to DOG.

This is an apt analogy for the problem of systems architecture when dealing with an operational system. The constraints of the system's operation almost always prevent you from changing more than one component at a time. Every change to any component must result in a system that continues to operate. Real life systems also tend to have far more components than the three-letter word, in fact they comprise sentences, paragraphs, even whole novels.

In my work, I have occasionally had the good fortune to work with some great outside consultants. To date, I have always found these interactions to be productive and educational on multiple levels. It is a remarkable luxury to pick the brain of someone who is truly an expert in their field, and I try to take advantage of such opportunities whenever I can. In those interactions, I have noticed a curious recurring pattern.

Because of my role, I am often dealing with a consultant who is a systems designer. This expert comes in to help us improve the design of our systems. Unfortunately for her (or me), evolving operational systems tend to be more organically grown than designed, and the consultant must infer a design intent from examining the system as built, because the original design intent is lost in the mists of time.

Invariably, a conversation will occur that goes something like this.

"I see that you are using a COT in this part of the system," the consultant will say, attempting to hide a smirk. "A DOG would be much more appropriate. Why don't you try using a DOG?"

Of course, the consultant is being tactful here. No person in his right mind would use a COT as a replacement for a DOG. We, who built the system, are embarrassed even to be showing anyone this particular mangled part of our system. My response, when I have sufficient presence of mind to compose a rational one, always has a similar pattern.

"Well yes, ideally you want a DOG there, but when we were building this aspect of the system, we didn't have enough budget left for a pre-built DOG component. It would have taken us several months to build a custom DOG, which would have caused us to miss our launch deadline. But we had a well-tested CAT component we had built for a different system, and that mostly did the job. We found we could use that if we made some adjustments to the FOOD component to accommodate the CAT, and we could do that faster than building a whole new DOG."

Pause for a breath. Here's where the explanation gets messy. "After we launched, we wanted to come back and fix this to use a DOG, as originally designed, but of course we couldn't switch from a CAT to a DOG without changing the FOOD component again. Since we can only change one component at a time, during the upgrade process either the CAT or the DOG would get the wrong FOOD at some point, breaking the system." Remember that constraint about changing only one component at a time?

"We can't afford to break the system, we have live customers to support now." Here's that other constraint, every change must result in an operational system. Paying customers enforce that pretty strictly. It's hard to say you're lucky if you don't have paying customers, but sometimes it feels that way.

"So instead, we have migrated to using a COT. It's obviously not very efficient, but it fits, and it eliminates the dependency on the FOOD component (a COT does not eat). We're planning to replace the COT with a COG in a future release, which should be a smooth transition, and free up some system resources. Once that's done, we can use those resources to re-engineer the FOOD component to support a DOG, assuming management signs off on the additional cost."

By this time, depending on the consultant's level of experience, she will either be staring at me like I'm a lunatic, or shaking her head with a sympathetic grimace (usually the latter). In either case, the response is usually some variant of "I see." And the final report will advise, "Upgrade from COT to DOG ASAP."

Sigh.

There is no aspect of an organically grown system that could not be better designed in retrospect. The shape of the completed system is not governed solely by the appropriateness of the design/architecture. It is largely shaped by convenience, the accessibility of specific tools or components, the cost-benefit trade-offs and time constraints imposed externally on the design process.

The line between sense and nonsense is squiggly, because it must be drawn through the whole history of the system. And it's not always obvious which side of the line you are on.

2013-01-22

Take heed, managers: your "best practices" are killing your company

If you are a manager, you need to understand the ideas of W. Edwards Deming. Deming wrote several books about management, in which he chastised American business schools and American corporate management for perpetuating a failed philosophy and failed management techniques.

Deming proposed a new philosophy of management motivated by quality and grounded in systems theory. The Deming philosophy is too deep, too broad, and too rich to be explained in a mere blog post. Volumes have been written about it, and as I read those volumes I am sharing my thoughts through this venue (with apologies to Mr. Deming if I misrepresent anything, I am still learning.)

Probably the best introduction to Deming and his theories is his Red Bead Experiment. The experiment is detailed in Chapter 7 of his book, The New Economics for Industry, Government, Education. The experiment is extremely educational, and I highly recommend you watch it play out in the video below (you’ll need about an hour).

Systems vs Habits: Why GTD Often Fails

In my previous post, I wrote about David Allen's Getting Things Done book and productivity system. If GTD has a weakness, it is that, although the book describes the system very well, it does a poor job of describing the change of daily habits you'll have to perform if you really want to implement the system. The major reason people fail at implementing a GTD-style productivity system in their lives is that, no matter how simple the system may be, it's a big change from what they are used to.

Leo Babauta is a self-made expert in changing and forming habits. His Zen Habits blog has changed the lives of many of its readers. So when I decided to try getting organized once again, there were two books on my reading list: David Allen's (the System), and Leo Babauta's Zen To Done, Leo's personal take on productivity.

Getting Things Done -- Productivity System

Workflow diagram from "Getting Things Done"

David Allen's Getting Things Done: The Art of Stress-Free Productivity is a phenomenon in the tech community. If you're reading this blog, you've probably already read the book, or at least know something about the productivity system that it defines. I read it years ago, but like many readers never put into practice more than a tiny portion of the system.

As 2012 drew to a close and I looked back on all the things I meant to accomplish, I decided that I should give this productivity bible another look, in the hopes of getting more things done in 2013. I won't bother to summarize the system that David Allen defines. The book is very readable and does a much better job than I could. Instead, I'm just going to note how I decided to apply the principles of his system in my own life, especially given the changes in technology and lifestyle since the book was originally published a dozen years ago.