10 second bio...
Howdy folks. I'm Matt and I'm a Systems Engineer.
I grew up (Middle School, High School, College) in the US Military communities in Germany (Heidelberg mostly).
I started my programming/systems career in 1999 and have mostly been doing contract work, with several stints in leadership for startups (CTO, VP Engineering).
Currently I'm focused on training developers to be able to set up and manage their own systems.
Books and courses for developers to learn systems management.
Here are my first books:
My weekly newsletter about the systems orchestration tool Ansible:
The point at which I just list a bunch of braggy stuff (Hi Mom!)...
B.S. in Computer Science from University of Maryland: Europe
Magna Cum Laude
Smith Scholarship / University Scholarship
Projects Featured In
Basecamp: Customer data leakage
DigitalOcean: Inadvertant SSH key reuse
Carnegie Mellon University: SQL injection vulnerability
(and more for actual clients...)
ROI Project Sampler
Note: Where possible, I estimate the return on investment (ROI). Most of the cases are anonymized so I can share the ROI numbers more openly. I've used famous ships names (in quotes) for the anonymous companies.
Background: The University of Maryland (UM) has had contracts with the U.S. military to provide degree programs in Europe since just after WWII. Every 10 years the contract comes up for renewal and UM has to compete against heavyweight contenders like the University of Phoenix in order to win the new contract. The military had just issued its contract requirements for a $350 million, 10 year contract and the proposal was due in 2 months.
Problem: The military uses a points system to determine which contenders win the contract. In this new contract, the military allocated 7 out of 100 points to requirements for web applications UM didn't have yet. These were: 1) an online registration system for web-based courses, an online syllabi management system, and an online course scheduling system. The engineering team estimated 6-12 months to complete the applications. Meanwhile, our top competitors already had these applications in place.
Solution: The approach I took was: 1) identify the minimum scope for the apps and 2) bypass any bottlenecks. Once I reduced the scope for the apps to their bare minimum, it was clear that they could be done within the time frame. The main bottleneck I had to bypass was altering the database. UM had a very slow bureaucracy around modifying the database that would have taken months for any updates to be approved, so I chose to use email, existing manual data entry systems, and flat files for the data instead.
Results: All 3 apps were completed and in production use by the University by the time we submitted our bid for the contract. Those additional points made us the most competetive bid and we won the $350M contract.
ROI: UM paid me less than $20K for this project (I had lower rates back then). The ROI is difficult to calculate on this one, but it was insanely high given they won the $350M contract (yes, that's over 1/3 of a billion!).
Lesson: Ruthlessly reducing scope and bypassing bureaucratic bottlenecks can lead to great speed-ups.
Background: In 2007 Scribd was a very young and growing service. After the founders graduated from YCombinator, I pitched them on having me design and implement their API and they agreed.
Opportunity: Scribd was already growing rapidly, but they could accelerate that dramatically if other applications could integrate with the service directly.
Solution: I researched the top API's at the time and designed the API with the simplest implementation I could think of. Within 2 months I had finished the design, implementation, and documentation. Then I took another month and wrote the first API clients for Ruby, PHP, and JS.
Results: The API has been a great success and years later remains nearly identical to its original specification. Scribd is now a top 100 site and the #2 most trafficked Ruby on Rails site (after Twitter).
ROI: Scribd paid me less than $30K and the API hasn't required much further investment. Meanwhile, the API has been used to upload millions of documents into their service. A conservative valuation of Scribd puts them over $100M and the API would have contributed at least a few million to that. Therefore, their ROI would be over 100X.
Lesson: Spending a little extra time up front making something as simple and focused as possible can pay off big in the long-term.
Background: Beagle is a multi-billion dollar company. They created a new consumer web app that was storage intensive (several GB per user). The team for the new app included: 4 QA, 10 developers, 2 system admins, and several others in supporting roles.
Problem: The app was plagued with bugs and development was incredibly slow. Developers didn't have development environments and had to compete for a single staging environment to try out their code. The app was overly complex and extremely brittle. There was no documentation for the servers. Each server had been built by hand from the memory of one of the system admins. The app required 5 different server types in order to run. It took 2 days to bring up a new server, so it was going to be nearly impossible to scale the app quickly when load increased.
Solution: It was clear that the biggest problem was chaos in the systems. I used puppet to script all the servers and also documented all of the systems. It was just a matter of organizing, documenting, and monitoring everything. Soon each developer was able to have their own development environment. New servers took less than 5 minutes to create. I rebuilt the existing servers with the puppet scripts so that they would all be identical.
Results: Bugs caused by system inconsistencies disappeared. QA was finally able to focus on actual application QA. Developers were able to work about 3 times faster with their own development environments. The app was easily scalable. Morale on the team greatly improved. We were able to launch the app probably 6 months earlier than would have been possible otherwise.
ROI: If we very conservatively estimate each developer was $10K/month. For 10 developers that is $100K/month. At triple the speed, that's $200K/month of additional value. For the QA, if we estimate $8K/month for each and that half of their time was previously spent on systems bugs, that's $16K/month of additional value. The system admins were at least $10K/month and spent at least 3/4ths of their time putting out fires. Now they could focus on actually improving the systems, for additional value of $15K/month.
The value from increased team productivity was at least $2.7 million just in the first year. Beagle paid me about $60K for this project. Without even counting the value of releasing the app 6 months earlier, the ROI is at least 45X.
Lesson: Systems' chaos can slow a team and the project down drastically. Organizing and automating systems can almost magically solve problems and speed up teams.
Background: Calypso is a rapidly growing user-generated-content (UGC) company. Its users upload many thousands of images per hour.
Problem: As with any UGC company, dealing with adult content is a real problem. Most of the adult content isn't the attractive kind either - it's primarily old men exhibitionists. We wanted the content to be safely browsable and didn't want to have to rely on the user community to do all the filtering for us. The company developers wanted to attack the problem with software. However, that would have taken months of developer time and even then it might not be as accurate as we needed it to be.
Solution: The human brain has evolved over many thousands of years to detect other naked humans. So, I tried to think of the lowest cost ways we could use humans as the filters. Having the workers scan the images one-by-one would have taken ages, so I created a new workflow in the app that would allow the workers to view a page of image thumbnails 100 at a time. Then they could quickly scan the images, flag any adult images, then click "Next". I then put the job on oDesk and tested a bunch of applicants to see which ones were the fastest and most accurate. I found a person in Southeast Asia for $3/hr that was super at it. I hired him and also offered a small bonus for every 10,000 images he reviewed (for extra motivation to go quickly).
Results: The backlog of 1,000,000 images were reviewed in under 3 weeks and for less than $3000. Then we had our worker continue reviewing content as it came in, which was then a much less intense and cheaper job.
ROI: A software solution to the problem would have been a boondoggle. Google with its thousands of engineers still doesn't do that great at filtering adult images. If we very conservatively estimate that they would only have spent 3 months on a software solution, then that'd be a savings of $90K for 3 developers at $10K/mo. They paid me about $10K for this project, so the ROI just for the first year was at least 9X (but probably many times higher).
Lesson: For some problems, using human processing power is far cheaper than a software solution.
Background: Discovery is a billion dollar retail company and a household name in North America. They were launching a new type of retail store with a selection of products that were unfamiliar to many of its customers. The kiosks were going to be an informational resource for customers that explained the background and purpose of the new products.
Problem: They didn't realize how much they needed this early on, so they didn't recruit me to build it until 4 weeks before they launched the new store locations.
Solution: I simplified the functionality specification of the app until it was something that could be designed, developed, tested, and deployed in under 4 weeks. Then I built the app in collaboration with the designers and content experts.
Results: The app finished just in time and was deployed successfully with the launch of the new stores.
ROI: This is a bit more difficult to quantify. A rough guess is that it provided at least $200K of value to the store in the first year. They paid me $20K for the project, so that's about 10X ROI.
Lesson: Reduce the functionality scope of an app as dramatically as you can if you have a tight deadline.
Background: Endurance was a iPhone app development boutique with 4 engineers, 1 QA person, and 1 designer.
Problem: Development was very slow. The iPhone apps were all about the same complexity and it was taking 4 months for the team to complete 1 app.
Solution: There were 2 main things that needed to happen. First, organizing the projects so that it was very clear who was doing what and when. Second, augmenting the current staff with outside help so they could offload the tedious work and focus on their core work. For example, the developers were spending a lot of time slicing up design assets to use in the app. So we identified all the non-core tasks that would be easy to offload to others and hired additional staff on oDesk to do that work.
Results: The team was able to produce an app in 1 month instead of 4 months. Morale improved dramatically.
ROI: The monthly cost of the team was about $45K. With the oDesk employees that went up to $55K. However, now an app cost about $55K to produce instead of $180K. The savings alone for the first year were at least $1.5M. They paid me about $80K for the project, for an ROI of at least 18X.
Lesson: Sometimes you don't need more on-site staff to speed up a team. Instead, you can get them outside help for the tedious tasks so that they can move faster on their core job.
Background: Glomar is an active community site in a popular vertical niche.
Problem: The site was very slow (6 seconds per request) and crashed several times a week. The servers had all been deployed by hand and many intermittent bugs resulted from the inconsistencies. The parent company took months to provision new hardware. The site was losing traffic rapidly because of the poor user experience.
Solution: First was to script all the servers with Puppet and get them all consistent. Second was to identify the bottlenecks in the application and resolve those. It turned out slow database queries and lack of caching were the main problems. Next, I documented everything and set up monitoring for everything that mattered.
Results: Most site requests were now under 1 second. The site was far more stable. It stopped losing traffic and over the next year gained another 2 million uniques per month (from 4M to 6M).
ROI: The site's future was secured by these fixes. It also allowed them to grow quickly and get more lucrative ad campaigns. Because things were more stable, when several of the developers left over the next few months, they didn't need to replace them. I estimate the first year benefit of these changes to be at least $2M. They paid me about $60K for this project for an ROI of at least 33X.
Lesson: Chaos can cripple systems. Organization can save them.
Background: Hunley is a $100M per year revenue company.
Opportunity: They had a small team of workers that spent all day visiting different websites and collecting any updated information in spreadsheets. These were good capable people whose talents were wasted on browsing and copy-and-pasting from the web.
Solution: It took a week to automate the data scraping.
Results: The team of 3 were now able to focus on higher value work for the company.
ROI: The monthly cost of the whole team was about $12K/month. That's $144K per year saved. They paid me about $10K for a first year ROI of about 14X.
Lesson: Simple automation can deliver huge savings.
Background: Intrepid is a B2B company serving Fortune 500 companies.
Problem: Their entire business depended on their systems' uptime. However, the systems had no documentation and were a mystery to the entire development team. The previous system administrator had left and took all the systems knowledge with him. Due to the complexity and opaqueness of the system, if a server went down it would take at least a week to re-create it. An incident like that could kill the company since the Fortune 500 clients would probably quickly lose confidence and switch to another provider.
Solution: It took about a month to document and script all the systems.
Results: Not long after the systems had been scripted in Chef, one of the servers went down. Fortunately the systems were now replicatable and we were able to replace the server quickly. If the new systems scripts hadn't been there, it would have been a catastrophic scenario.
ROI: It's harder to calculate the ROI of this since it was a preventative measure. Being able to replace that one server in that one incident probably saved the company $200K at least. They paid me $20K for this project, so the ROI is at least 10X after just one incident.
Lesson: Scripted and documented systems can protect you from catastrophy.