When you talk web development, many people seem to think your audience must fall on one side of the fence – either business or technology. That business folks don’t want to hear a bunch of “technical jargon,” and that IT folks want to get down to the “nuts and bolts” of things without a care for business impact. Unfortunately, I believe this notion often leaves both sides at a bit of a disadvantage, creating a veil that hinders empathy and can cost everyone in the long run.
For products to succeed, technology and business have to be aligned on goals and a product roadmap. Business folks need to feel empowered, to expect the technology built will not only support product features but match business needs and style. On the other side, technical folks should think beyond the code needed to provide a feature and understand that business operations should heavily influence a product’s architecture.
For the past 18 months my team has been working with a client building a SaaS platform for social media management that we’ve helped grow from a startup to a grounded market leader. My team’s scrum master has talked before how consistent user testing and iteration have delivered mountains of valuable information about its customers and its competitors, but we also uncovered a lot about its technology.
When we started, we learned quickly that the company had outgrown its first iteration, essentially an MVP with which it had proved a market need and acquired thousands of customers. Yet, as the business scaled and the platform grew more complex, version 1.0 couldn’t keep up. The system would lose stability under load and adding new features was beyond cumbersome.
Since the client plays in the crowded, competitive social media space (and focuses on a niche audience), it had to stay on the leading edge of technology, while minimizing downtime and optimizing performance. As a result, we constantly pushed new features and changes to existing ones, balancing old systems with new features.
So we focused on building a 2.0 version, a new platform that iterated upon everything we’d learned with version 1.0, but with meticulously chosen technology that mapped directly to business objectives and user needs. We knew the need to be able to respond quickly was not just important, but essential to survival. During our Sprint 0, we discovered we had 6 major goals for our architecture.
- 2.0 needed to be able to scale to handle its growth for the foreseeable future.
- The architecture could not be cost prohibitive.
- 2.0 needed to be designed in such a way that adding features would be easy.
- 2.0 needed to remain online even if some particular features were unavailable.
- We needed to accommodate multiple teams working on different areas of the product, all at the same time.
- In the end, make sure it can be managed by a relatively small in-house team (5 or less people).
With these goals in mind our team set out to come up with an architecture that could meet our customer’s needs.
Dealing with Product Entropy
Developers and marketers have all been there. You begin with a glorious product vision – features mapped out, stakeholders appeased. Then as you start building, plans change and objectives shift. Your product slowly morphs into a monstrosity that barely resembles anything you had originally envisioned. If you’re working in waterfall web development, your project is about to be over-budget, over-deadline and feature-deficient.
If you’re working in some flavor of agile, it’s ok, even a good thing. It means you are reacting to your environment and adapting your product to the demands of the marketplace. But what about the technical side of things? Sure, being flexible with your product roadmap is great but how do you make your architecture flexible? Too often, tech stacks lag behind your product vision (hence the need for new technology), so how can you build something that remains tolerant to constant change?
Our team had spent time looking into Micro-Service architectures and believed they would be the key to solving this problem. In short, instead of building one big application to represent the product, we would build lots of smaller applications or “services” based on product features. In the same way you would group features together to make a product, we would group services together to make our application – an interconnected ecosystem, rather than a single organism with one function. This would give us the following benefits:
- New features could be added easily as new “services,” without interrupting other features (goal 3).
- Multiple teams could be working on different services at the same time without running the risk of stepping on each other’s toes (goal 5).
- Since the services were small in nature, replacing or changing them would be much less difficult and time consuming (goal 3).
- Since each service remains discrete from the others, we could eliminate downtime when adding/replacing/changing features (goal 4).
- If one of the services “crashed” the rest of the application could remain online (goal 4).
- Launching new features was unlikely to impact the performance of other services, minimizing bug fixes and optimizing the stability of the entire platform (goal 3).
- We could use much cheaper resources to run our services and scale them up easily and independently (goal 2).
- Future iteration becomes much easier, since the product no longer needs to be locked into a particular tech stack or a legacy codebase (goal 6).
After a few meetings (with pretty diagrams and smiley faces) to explain how all of this would work, our client eventually approved our plan, and we were on our way.
So, I know all of this stuff sounds great and all but what about the nuts and bolts? What are some real world examples of this actually making such a huge impact?
Did I “mention” we had some problems?
Not long into the project we encountered one of our first major problems. The application processes a massive amount of data, especially during live broadcasts. We had built a service responsible for talking to Facebook’s “firehose” (public feed API) in order to get real-time posts from all public pages,handling anywhere between 1,500–10,000 posts per second. Our first use-case with this data was auto-completing Facebook page names when a user wanted to “@mention” them, a common feature you’d expect when using Facebook or Twitter – but a thorny task to complete.
Unfortunately, the database we selected, MongoDB, early in the project just wasn’t cut out for this specific job. As we started to reach tens of millions of pages, the responses became slower and slower. We tried optimizing our queries, indices and reducing the record sizes, but to no avail. We knew the service would soon begin to stop working. This was a technical hurdle, but potentially a huge derailment for business. Every microsecond counts for user experience, and while this may seem like a small feature, it impacted nearly every user on a daily basis.
When we started building, these mentions weren’t a core feature – but they’d become a requirement for business. Had we been coding in a legacy fashion, fixing this database issue would have required a large-scale reworking of the entire application. Fortunately, because we built everything as services, swapping out our mentions service would not be an uphill battle. We quickly wrote a service that used a different database, ElasticSearch, making sure this new service looked just like the previous one as far as the rest of the system was concerned. Doing so allowed us to swap the old service out without the rest of the system needing any changes. Once we were confident, we flipped the switch and began using the new mentions service. This was all done without any downtime and without needing to change existing code (goals 3 & 4).
We Got a Bad Batch
As with any modern software development project, reporting and analytics are becoming more and more important, and this was no exception to the trend. With the volume of data processed and the need for in-depth user reporting, we had batch jobs running all over the place. The number of batch jobs rose from an unspoken rule when we originally started building that tied services to specific topics. So when we would say things like, “We want our app to talk to Facebook,” we knew we’d build a Facebook service. This approach inadvertently increased the size of our services beyond a manageable scope – eventually, you reach the point where you want to do so many things with the Facebook service, that it starts to turn into its own project.
Such was the case with batch jobs for reporting. A few of our services were constantly running batch jobs to manipulate their data for reporting and to populate dashboard metrics. This was beginning to slow down the performance of our primary application. We now had two options: scale up the servers running the services (by adding more CPU and RAM) or break out our batch jobs into their own services. The first option would be more costly and would only be a Band-Aid while the services continued to grow in scope and complexity. So, we opted for option 2 and broke out the batch jobs into their own services (goals 1 & 2).
Thanks to the modular design of our architecture, it only required setting some flags and making a couple of conditional changes in our existing code (less than 30 lines per service). Since we kept the internal wiring the same, we were able to deploy the new “batch” services while the original services were running. Again, we were able to make a major change to the system with zero downtime (goal 4).
Cache on Delivery
Near the end of our engagement, we already had beta testers using the system and responses were positive. It was time to begin migrating all the legacy accounts to the new system and re-routing the legacy application to use our shiny new backend.
Our first plan of attack was to migrate all of the traffic coming from the legacy application to our new system. Basically, users would still be using the legacy system but it would now leverage the new system to go out and retrieve data instead of having to do it itself. This was a good way to confirm that our new system was ready for prime time. We flipped the switch service-by-service, and we quickly realized that, at scale, some services caused our database to slow down. We were a bit surprised since all of our performance benchmarking led us to believe our current solution would meet performance needs. Unfortunately, our tests did not account for one major factor: the amount of data in the database. As our data grew, the response times became slower at a rate we had not predicted.
Luckily, our services were isolated and so was the data they were responsible for managing. This made it very simple to give each service the ability to cache frequently accessed segments of data. In fact, our first implementation of caching (using Redis) took just 2 hours to code, test and deploy (goal 3). Because of this quick turnaround, we were not forced to switch the legacy system back to handling its own traffic. We had bought ourselves enough time to create a more robust caching solution that would scale for the foreseeable future (goal 1). Again, thanks to keeping our internal wiring the same, all of these changes took place without affecting the other services in the system (goal 3) and without incurring any downtime (goal 4).
Unforeseen problems will always arise in any web or software development process – it’s just a fact of life. Infusing a culture that accepts change, and choosing the right architecture or development methodology allows teams and clients to mitigate the business impact of these problems and minimize the technical debt teams incur throughout the project. Whether it’s adding new features that have become business critical, dealing with scale you didn’t imagine, pushing technologies beyond their limits or just plain getting it wrong.
No matter how hard you try, these things will happen. It is everyone’s responsibility (not just engineers) to expect these types of principles to be applied. Rather than just telling the team to build a feature, business people must consider questions like:
- How hard will adding a new feature be?
- Will this continue to work if our product takes off?
- How hard is it to change the way X works?
- Does this feature or service track to a concrete business goal?
- Are we building this the right way or the quickest way? (Those aren’t always mutually exclusive, as we’ve seen here.)
These are not just technical problems, they’re everyone’s problems.
As for engineers, my hope is that we can start to have a better understanding that product development is a chaotic place. And while I love predictability in my projects, developers need to accept that chaos doesn’t mean things are without structure – the freedom to adapt when priorities, goals and direction change as often as the weather just requires building a structure tolerant of change. We need to architect solutions that can adapt in ways that don’t severely impact our time, our users or the bottom line.
And I’d encourage business and marketing folks to roll up your sleeves and dive into the meat of a project. Building successful projects requires more than architecture and elegant code – architecting the right strategy requires a team working in sync at every step of the way.