A commonly overlooked area of many systems are the non-functional requirements and the design to meet those requirements. Patterns for Performance and Operability by Ford, Gileadi, Purba and Moerman provides everyone involved in the software life-cycle from development to support with a good foundation in understanding why non-functional requirements are important and real examples of how to capture, develop, test and operate with these requirements. Systems fail when non-functional requirements have not be considered and it is everyone’s role in the SDLC to consider them.
As a software architectI find it is always a challenge to ensure the non-functional requirements (NFRs) are considered. NFRsare typically seen as a tax or penalty for implementing functional changes. As hardware and software technologies have evolved and with the emergence of the cloud, the importance of non-functional requirements is even greater, particularly performance and operability. A central tenant of any system moving to the cloud is expect failures and design you applications to be ready to have to cope with infrastructure failures; this does not mean the cloud is not resilient but rather your applications must be designed to work in a scaled out design with instances potentially being unavailable or slow.
Patterns for Performance and Operability is aimed more at traditional (non-cloud) architectures for complex enterprise systems but many principles are still relevant to cloud systems. It is written for the business owner of a system, the architects, QAs, developers and support staff. This can lead to the writing potentially being obvious in parts to specific individuals but as a whole it provides a common view that all can relate to and understand.
I’ve summarised the book here; in the following sections I have added my views in italics.
Planning for the unexpected is what this book is about:
- Establish a comprehensive set of non-functional tests to minimise the set of unexpected system conditions. This requires CxO backing as there are costs that are challenged against the perceived risks.
- Enhance the system design to gracefully handle unexpected events. Mandatory in the cloud.
- Create a set of diagnostic to monitor and alert on unexpected conditions so that intervention and resolution can be swift if not automated. In my view there are tools that can be obtained to perform some of this i.e. APMs.
Key principles for Patterns for Operability in Application Design:
- Data and Transactions must never be lost or corrupted. Two phased commits where possible, or compensating transactions with clearly defined boundaries for any rollbacks.
- Exceptions conditions (expected or unexpected) must be captured and reported in a consistent fashion. This can be challenging when individual teams are responsible for their own components; an extension to this is using centralised logging and potentially tooling such as Splunk or LogStash. Classify based on alerts, monitors and reporting. Push alerts to the correct teams and business users.
- The application must recover in an automated fashion once the exception condition is removed. Recover when the exception condition is no longer present. Audit all stages of recovery. Tooling and documentation of manual recovery.
- Applications must provide visibility into the availability and health of their various components, with hooks to monitor the health of the applications as well as quickly detect and correct any issues. Potentially tooling such as an APM can play a key role to avoid building your own. Automate health checks.
Planning and Project Initiation
Testing of the non-functional requirements is key and needs to be carefully planned as there are potentially implications on the environments used and tools used. Be prepared to adjust the design based on the results of the testing. The minimum should be:
- On-line Performance i.e. response times needs to be verified under load conditions. If the design does not compensate for 3rd party issues and page loads are synchronous this can cripple your site so you need to test for this potentially with stubs / mocks.
- Batch Performance (Asynchronous services) how long does each process take and how long is recovery from failure going to take. Just because it is batch does not mean it does not need to be completed within certain windows including recovery time.
- Capacity Tests, how much load can the system cope with rather than how quick is it. Also how does the system behave under different load conditions, does one slow element bring everything down.
- Fail over Tests, identify all key points and initialise failure under load. Test each component individually and verify what happens to in-flight transactions.
Justifying the investment for some of the testing is not simple. Everyone understands the impact of slow systems but do not fully understand why the capacity needs to be tested in a world where hardware is cheap.Try and reuse any environments for more than NFT, and relate the need for NFT to previous outages. Automate the deployments. If this fails then ensure the risks of not performing the NFT are clearly understood by senior management.
These are still business requirements even if your business does not understand the need for them. The technical team must convey the need for the requirements in a way that the business can provide them.
- Serve as a basis for constructing a robust systems design based on the behaviour expected from the system.
- Pre-requisite for non-functional testing.
- Helps to define a usage contract with end users / product owner.
- Provides a basis for capacity planning.
The types of requirements to focus on are:
- Performance Requirements – capture throughput and response times.
- Operability Requirements – capture how robust the system must be and what should happen when failures occur. Five Nines is a target often set but the actual requirements of the system need to be captured.
- Availability Requirements – when should the system be available, although many are 24×7 there are always specific windows which are critical or potentially even seasons.
- Security Requirements – these can be driven by regulation / compliance such as PCI DSS. Security is typically cross cutting and impacts all aspects of the SDLC and subsequent life of the system.
- Archive Requirements – this also covers deletion of data.
One method to help capture the requirements is to establish a usage model for the system to identify human and machine inputs. These in turn allow for the above requirements to be captured at a lower level of granularity. For example at thetrainline.com we have login, registration, journey search, journey planning, payment and post payment usage scenarios. Each one has a specific set of non-functional requirements.
Operability Requirements can also be identified by analysing:
- Component Autonomy – systems are made of many components and 3rd party systems, but one single component should not cause non-related components to fail. For example if payments are failing then this should not stop the journey planning from functioning. If a service / component is removed for maintenance the entire platform should not require a restart.
- Trace Logging – it should be possible to trace the path of a business transaction through the components of a system to identify where it failed, ideally logging should be configurable or asynch.
- Exception Logging – should be consistent, these requirements should help define how to realise this in the system.
- Fault Tolerance – what should be the user experience or system behaviour if a fault occurs.
- Fail over – availability is achieved through increasing quality in the solution and redundancy in the software but there will still be failures and in some cases fail over will be required. For example, if a certain ticket issue server is not functioning the system should fail over to another server.
- Communicating Outages and Maintenance Windows
Designing for Operability
Ensure your systems have standard error severities – this is a pattern as ancient as the mainframe:
- Fatal / Critical – a service is down and needs immediate attention as it is unlikely to recover.
- Error – an individual transaction has failed, one may not be that important but if the frequency increases this is a sign of a larger problem.
- Warning – an early sign of potential issues, typically these are not relevant to operations staff as they cannot act on them but they can help during post mortems following an incident. These are the hard to code as they typically indicate a situation that is not well understood.
- Info – helps to identify the use of a system and for log scrappers to analyse baseline use.
Retrying for fault tolerance is a key pattern of a good systems design. If errors were simply sent back to the end user they would either stop using your systems or tie up valuable resources from multiple teams trying to identify where in the series of connected components the issue occurred. By building in retries the system is more fault tolerant. But it needs to be done taking the following into consideration.
- Are 3rd party systems idempotent.
- Is there sufficient time to complete the process with retries.
- What are the appropriate intervals between retries, i.e. no point retrying within milliseconds if the scenario being handled requires several minutes for a faulting component to recover.
- How long should you keep retrying before failing. In a web site your customer is not going to wait for more than 30 seconds for the payment page to complete processing.
- Settings for retries should be configurable as they will need adjustments.
- If queues are used to handle requests then make sure the queue lengths are monitored and potentially retries should stop if the queues are too large.
Software Fuses / Circuit Breakers are another key pattern for operability. Allowing the system to identify when there is a problem and prevent other transactions from failing allows for problems to be contained and hence less effort to recover failed transactions. The circuit breaker must not add additional load to the component and it must fail safe, i.e. always open if it is not possible for the breaker to operate.
Building in System Health Checks also greatly improves to operability of a system:
- Check connectivity to interdependent systems.
- Monitor availability of major subsystems.
- Monitor database and file systems.
- Monitor the performance of critical operations.
- Statistically roll-up transaction level errors.
Isolation of sub-systems / applications that require high availability on shared infrastructure improves the robustness of a system. Although using a shared platform saves money in terms of hardware it does increase the risk of systems impacting each other if the isolation has not been implemented.
As well as standard error severities the application logging design is important in all of your components:
- Ensure log levels are dynamically configurable. log4net / log4j is a widely used pattern.
- More is better, but ensure it does not impact systems performance.
- Debug, Trace and Performance logging require specific attributes to be logged and subsequently analysed. They should be kept separate from each other.
- Transparency of the system state is best achieved using XML rather than delimited or binary formats.
Agile / Extreme Programming assists in reviewing of the code as it is being written; alternatively use design reviews or show cases allows for specific non-functional requirements around operability to be checked / accepted by operations teams.
Designing for Performance
Apart from how fast a page loads or how quickly the transaction is processed through fulfilment systems there are other ‘-ilities’ that need to be designed for which are intertwined with performance and potentially conflicting:
- Scalability – based on the planned growth of the use of the system vertical scalability may be an option but for most systems horizontal scalability is preferable and with the use of load balancers it is possible to use commodity hardware.
- Usability – getting the balance between an enticing system and one that is easy to use can be difficult, error handling and how the feedback is provided to end users is important and needs to be designed into the system.
- Extensibility – this is one aspect of a system that can result in performance issues, the more configurable and customisable a solution the more complexity in the architecture, carefully choose how to implement customisations.
- Secureability – the need for encryption, access controls and non-repudiation will determine the architecture required. The internet has also resulted in a whole new source of patterns and technologies from DMZs, ALFs, IDS and DDoS protection. Your systems have to work with these technologies.
- Operability & Measureability – separate capturing of data from the analysis and presentation of the systems state. Modern tools such as APMs allow for real time instrumentation.
- Maintainability – software and test driven development can assist in the understanding of code but do not forget to document your code, why does an algorithm work in the way it does cannot be derived from the code or tests there needs to be documentation.
- Recoverability – if rerunning a process is quicker then recovering from failure in a process can be removed from the design. If data is transient or static then database transactional logging is not required, again simplifying the system and improving performance.
Identification of performance hotspots helps to ensure the architecture of a system is determined using a pragmatic approach. Ensure the non-functional requirements have been captured, map the input and output channels to identify potential volume or response time hotspots and ensure any potential bottlenecks are designed out of the system:
- Divide & Conquer – break your system up into separate components and services that allow the business process to broken down into smaller parts that can then be designed, built and tested separately.
- Load Balancing – helps to distribute load and isolate potential issues.
- Parallelism – if it is possible to run certain calculations in parallel this can improve performance but also adds to the complexity.
- Sync v Async – improves performance and also can aid in the resilience of a system BUT adds to the complexity and requires persistency between components.
- Caching – will be present across the entire technology stack from SAN through to your code. Caching improves performance but needs careful design to ensure stale data does not contribute to functional issues.
Anti-patterns to be wary of:
- Over Design – do not spend too much time designing the perfect solution, once you have considered the non-functional requirements then get early feedback from the actual system in use.
- Over Serialisation – you can break your system down into too small a size resulting in most of the time being spent serialising messages in and out of components rather than doing anything with it.
- Over Synchronisation – for example implementing locks and continually updating the data / object to allow for readers to access the most up to date data will not be required.
- User Session Bloat – try to avoid storing everything that may be required in session state and also ensure the same information is not persisted many times.
Selecting your programming language, database technology and hardware platforms all play a key role in the performance of your systems BUT these are whole books in their own right. The book provides general guidance on these subjects.
Test Planning, Preparation and Execution
Two chapters devoted to these topics, critical to the success of the design and implementation is the testing conducted to verify the requirements are met. Valuable information is gathered to help in testing of future changes to the system, i.e. variations against baselines.
Good deployment procedures that manage the risk to your business have the following characteristics:
- Minimal deployment, where possible only deploy what has changed or a functionally complete sub-system / component.
- Automated, using the same tools and process to deploy into test and production environments, and removes human error.
- Auditable, each stage of the deployment should be logged to help investigate issues and monitor progress.
- Reversible, there must be a backout process.
- Tested, both the deployment and backout must be tested.
Understand your rollout strategy as this will provide requirements for your deployment process but also helps you to manage the risk of the deployment:
- Pilot – small group of users selected to make use of the new system helps to obtain feedback from use of the system but does require your systems to support side by side deployment OR feature switches i.e. where a new feature can be enabled for selected users.
- Phased Rollout – the software is deployed and then gradually rolled out for example in a ticketing system enable a new fulfilment method for a small number of products.
- Big Bang – in some scenarios you may only be able to implement the changed system to all users at the same time, typically this is seen as too risky but it can also act as a focus to ensure all teams deliver the product.
- Leap Frog – deploy a new system to a new set of infrastructure and move all users to the new infrastructure.
Resisting Pressure from Functional Requirements
Consequences of not capturing and catering for non-functional requirements:
- Completely Neglected – inadequate architecture, design and code, could result in project failure in the worst case but will lead to system failures.
- Completed Separately – misalignment functional and technical requirements, will require more time to align the design.
- Parallel Start but Abandoned or ad-hoc Support – resources made to focus on functional requirements will result in system failures.
Where are the areas of contention:
- Human Resources – establishing clear roles and responsibilities early on and for shared resources across projects planning for their involvement is critical to success.
- Hardware Resources – set-up of non-functional test environments can have long lead times and in particular for a new project you may not actually know what hardware is required until well into the project. It is important to continually review the architecture and start to implement the hardware required as soon as possible.
- Software Resources – budgets are first cut for when obtaining specific software for testing. But for business where there is continual development the up front cost of such tools can be shown to have a business case.
- Issue Resolution – non-functional issues are usually given a lower priority again it is important to continually remind the whole team of the risks of not resolving the non-functional issues.
Establishing the success criteria early in the project is key for eliminating pressure from functional requirements. The project phases and outputs from the phases helps drive the continued buy-in from all stake holders:
- Plan – non-functional resource estimates completed and budgets secured for the hardware and software needed.
- Architecture and Design – non-functional test environments defined and software testing tools defined.
- Develop – non-functional requirements completed and development completed with attention to performance and operability.
- Test – Environments ready, automated testing ready, performance testing completed, and fail over/operability testing completed.
- Deploy – capacity model and capacity plan completed.
The best way to protect NFRs is to build in the activities required (above) into the SDLC. Establish clear milestones for when NFRs will be captured and reviewed by the relevant stakeholders. Make it clear to each of the stakeholders what their responsibilities are.
Operations Trending and Monitoring
There are two distinct objectives of monitoring: detect and alert as quickly as possible, and provide maximum diagnostics information. Effective monitoring has:
- Redundant Monitoring – have multiple ways to detect an issue with performance. Example if an SMTP gateway fails monitoring both that the application is sending emails and monitoring the gateway is running provides the detail required to identify issues.
- Monitors that Correlate – the application and SMTP gateway monitoring should both help identify that the gateway is broken.
- Detailed Alerts – having details of the transaction types impacted and specific hardware components failing help with identification of the issue and speedy recovery.
- Consolidation – providing a consolidated view of the monitors allows operations staff to identify issues quickly.
Key metrics to monitor:
- Server – availability (up or down), CPU available, memory available, swap memory in use, network i/o and disk i/o.
- Storage – availability (up or down), space available, i/o speeds read and write.
- Switches/Routers – availability (up or down), CPU available, error rates, error count by port, and network speed.
Capturing and trending the monitors is important and must feed into your capacity model. Both hardware and software performance as well as error rates should be included in the capacity model.
Troubleshooting and Crisis Management
Even though a system may be well designed, have high quality software and all of the NFRs have been captured and implemented there will still be issues. Reproducing an issue is key to long term resolution of the cause, hence the level of logging an diagnostics as well as the monitoring in use play a key role in being able to reproduce the issue.
Production outages occur because of:
- Software Defects in your application – this is usually the easiest to identify and resolve.
- Software Defects in vendor applications – requires your vendor to have built in sufficient logging / diagnostics and may need access to your environments.
- An illegal input – your users will do something you have not catered for; having bug hunts or destructive testing can help identify these.
- An illegal or mistaken procedural error – that is why it is very important to minimise human involvement in the operation of a system and also to make it clear what the process should be to identify mistakes.
- An infrastructure event – hardware will fail.
When trying to troubleshoot it is important to have the following available:
- Understanding the changes in the environment, changes are typically the root cause of most issues – have a change log, reduce undocumented changes, understand all those who have access to an environment and make sure they understand their role / responsibility in the change process. Identify if there have been changes in the usage patterns. Are external systems changing? Check administrative/BAU changes through audit logs. Has time played a role, i.e. over time a system can experience problems as it has more data to process or data types reach their limits.
- Gathering all possible inputs – what processes and user load was active at the time of the issue.Where were the errors experienced i.e. channels impacted, all users, all transactions. What do the application logs show? If the issues have been raised by end users get as much information as possible for as many users. Are there any application / memory dumps. Have there been previous failures with the same signature and what was done then to recover?
- Predicting related failures – if it is not possible to recreate the issue and no cause has been found then it is time to get creative. If other side effects have been seen then trying to reproduce them may help find the cause of the main issue.
- Discouraging bias – try and avoid team politics between development and support teams. Pride can cause individuals to not want to admit their error. Don’t base you investigations on the experts you have to hand they may being seeing a symptom not the cause. Not everyone will have the same understanding of the issue hence clear and regular communication is important.
- Work-arounds – sometimes you will need to find a work around to get the system back into a usable state.
Once the cause has been identified applying the fix will also require planning and you should weigh up applying the fix versus the continued mitigation in place versus tolerating the issue until the next release. The level of testing to be conducted on the fix is also important to balance versus implementing the fix and seeing the impact.
Post Mortem reviews should be conducted to help everyone learn. There may be changes required to development, testing or monitoring to avoid similar issues in the future, or a fundamental flaw in the architecture or design that needs to be raised.
Common Impediments to Good Design
In most cases a team will be seeking good design, not excellent and not bad. You should be able to measure your design and ensure they do not have these bad elements:
- Does not solve the business problem, it may meet NFRs but must also meet the business requirements.
- Difficult to modify for changing conditions.
- Cannot be built within the constraints available.
- Too expensive.
- No standards.
- Solves the wrong business problem or one likely not to materialise.
- Does not meet the functional or non-functional requirements.
IT is a combination of science and art so there will be human factors. Other impediments to good design:
- Confusing architecture and design
- Insufficient time
- Missing skills
- Lack of design standards
- Personal preferences
- Insufficient information
- Constantly changing technology
- Fad designs
- Trying to do too much
- 80/20 rule not applied
- Minimalistic viewpoint (too tactical)
- Lack of consensus
- Constantly changing requirements
- Bad decision making
- Lack of funding
To improve the odds of getting a good design:
- Reuse good designs
- Test as you go
- Use proven methods
- Set objectives and measure against them
- Competent Resources
- Set realistic expectations
- Communicate the design decisions, issues and considerations to all the stakeholders
The first requirement of good designs is that it meets the known functional and non-functional requirements. The second requirement of good design is to anticipate future needs and a roadmap to meet them.
The book provides a good overview to all stakeholders in the design of a system and provides architects / technical leads with a good reminder of what is important when building complex enterprise systems. The use of agile methodologies and cloud computing do alter some of the risks the book discusses, and they introduce new patterns which can be used to create a good design, but the core of the subjects discussed are still relevant.