#3 - Understanding the Data Solution Cost

and its components to minimize it

Jul 14, 2024

In the latest two newsletters we explored why knowing the Data Solution Profit is important and how to calculate the value generated by a data solution. You can review both the articles here:

#1 - Data Solution Profit

Francesco Marino

June 5, 2024

Read full story

#2 - Maximizing the Data Solution Value

Francesco Marino

June 22, 2024

Read full story

In the latest article we arrived at this definition:

Data Solution Profit = (New value generated + Cost reduced) x Impacting factors – Data Solution cost

Today we will see how to calculate the cost of a data solution and which factors impact it.

Cost of a data solution

The cost of a data solution can be calculated easily:

Data Solution Cost = Development cost + operations cost

Typically the Development cost accounts for around 30% of whole data solution cost. This means that often is better to spend more time in making a data solution more stable than investing time later in its maintenance.

Let’s jump now deeper on calculating the components of the Data solution cost.

Development cost

It is simple to calculate the development cost.

Development cost = Time spent on development x Cost of the time spent

But it is difficult to estimate the time required to complete the development of a solution.

There are different ways to estimate it. We will not go on the details here but you can check more info at these 2 links.

I’ve heard of many estimations going wrong, but they are essential, and we need to get better at making them.

What helped me in making more precise estimations were:

look at my past experiences with similar tasks
look at the different steps required to go live (without forgetting steps)
understand what I made wrong in my latest estimation.

Let me share an example on how these approaches help me.

Do you need to estimate the time to build a new pipeline?

Look at past experiences: I check how much time it required me to build the most similar pipeline and I adjust the estimation according to the differences with the latest project.
Look at the different steps: I list the steps required to understand which data I need, which authentication to the data source I should use, how much time I need to develop, to test, to deploy the solution to the different environments, to write the documentation, etc…
I then sum the estimation of all these tasks, plus a buffer, and I get the total time.
Understand what I made wrong in my latest estimation: I know that the previous time I was too optimistic… I now add more buffer time driven by common sense.

A mistake I've made in the past is forgetting some of the tasks required to go live. To help you avoiding the same error here a list of tasks that should be done when building a data solution:

Requirements collection
Technical analysis
Approvals on architectural/technical decisions.
Development
All the required test phases
Documentation
Training for users
Training for operations team
Bug fixing in the first weeks after the go-live
Communication with stakeholders and end-users.

I'll stop here with the suggestions, but I hope the message is clear 😊

Operations cost

Operations cost should include:

Operations cost = Human Resource cost + License cost + Infrastructure cost

Human Resource cost

The Human Resource cost component can still be divided into:

Human Resource cost = Production Support cost + Maintenance cost + Administration cost

Production Support cost

It includes the expenses of the technical staff who ensure the platform is stable and functional. They solve the issues when the platform is not working, or a user has a technical problem.

Maintenance cost

It is the cost of making housekeeping work to keep the data solution stable. To make few examples: renew a certificate that is expiring or clean the disk space when not automated.

Administration cost

It is the cost of making sure the application still satisfies business and security requirements. An example might be making sure to periodically perform cybersecurity tests to your systems or review the accesses granted to the data solution.

I want to highlight here the importance of automating operations and developing a maintainable platform to reduce the data solution cost.

It doesn’t bring any value to give authorization to each user one by one manually if this process can be automated.

It doesn’t bring any value to solve the same issue every day for the same reason. It should be solved once and disappear.

It doesn’t bring any value to explain how the application works or how to solve a recurrent issue again and again to new colleagues only because the documentation is missing.

Of course, automating tasks or write documentations requires time… But often this time is a good investment to do.

To keep it simple:

Data platform maintainability must often be a priority when you are building it.

License costs

This is the cost we pay for the licenses of the tools we use. Knowing the pricing and the licensing model is important for better overseeing this cost.

Let’s do again an example: I have a Business Intelligence platform that will be used by many users. In this case it might be better to pay a fixed price per year compared to a per user license.

We just need to remember to perform an analysis of the license price and model before we start the project.

Infrastructure cost

This is the cost we pay for the infrastructure we use (e.g. databases, Virtual Machines).

One thing to keep in mind here is:

A specific architecture or technical approach can bring to considerable costs or savings.

Let’s make an example also here: if I pay for the time of utilization of my ETL compute layer, it might be smart to reduce it by reducing the amount of data I am transforming. I can do this by applying transformation only on the new data I received and not on the whole dataset.

Honestly, often infrastructure cost is one of the least expensive components… but it also one where we can have more impact and that is often challenging to optimize once the data solution is live.

Short note: there are cases where infrastructure cost can be significant, especially when handling Big Data.

Factors that impact the data solution cost

We saw the components of the Data Solution cost but we need to consider the factors that impact it. Let’s review a few of them:

Business decisions driven factors:

Functional requirements:

The more complex to fulfil a requirement is, the more it costs. Sometimes the complex requirement is the only solution possible to get the desired result to our business. But sometimes it is not.

Simplifying a requirement that brings a similar value to our business is one of the most powerful tool we have to improve the value of our data solution.

For example, users wants to see the data hosted in three different data sources. One of the data sources is complex to onboard for an XYZ reason and its usage increases the cost of the solution by 30%. We also discover that the data from this data source are used only once every quarter and only for 5% of the cases in scope.

It might be still important to get this data and, in any case, the final users’ representative decides but at least we should stop and discuss if we can, at least for now, put this data source out of scope of the project.

Data solution availability requirement:

Keeping a data solution available for 99,99% of the time requires great error handling practices, better infrastructure (e.g. redundancy, more frequent backups) and more resources invested in application monitoring. Satisfying this requirement has a cost… a bigger cost than satisfying a 99% platform availability. But also having a data solution down might be expensive.

We just need to perform a good assessment according to our business requirements and make sure we respect it.

Other platform requirements that impact the data solution development:

There are other factors that influence the development of our data solution. We will not go on the details but I want you to keep these factors in mind.
The factors are: geolocation requirements of the data center, number and location of users, performance of the systems.

Technical factors:

Quality of code:

The quality of the code is a fundamental factor that impact the data solution. Just imagine a data pipeline with zero error handling that processes excel files provides by different users… I cannot imagine a different result than IT support team trying to make every day the pipeline working.

Keeping good code quality means developing a solution that works, that is maintainable, scalable, secure and efficient.

My data pipeline should run securely, efficiently and should not fail. When it fails, I should understand why quickly, and I should be able to fix it easily. This is easy to say, but it requires a proper set up on the whole development process and good developers.

Good architecture:

A perfect data platform architecture doesn’t exist, but I can have one that satisfies my requirements.

Making a wrong decision at the architecture level of a data solution can impact its value.

A simple example might be the setup of a data platform that hosts a big amount of data without proper performance or archival best practices. This means poor performance of the dashboards that means lower adoption rate that means lower value of our data solution. Or it means bigger cost due to a huge amount of data that I simply do not use.

Summary

The cost of a data solution is composed by the Development cost and the Operations cost. Often Operations cost accounts for the 70% of the Data Solution cost. This means that the maintainability of a data solution is often important to reduce its overall costs.

Just to remember:

Data Solution Cost = Development cost + Operations cost

Development cost = Time spent on development x Cost of the time spent

Operations cost = Human Resource cost + License cost + Infrastructure cost

As we discussed, there are also factors that impact the cost of the data solution: functional requirements, data solution availability requirement, other platform requirements, code quality and data solution architecture. We need to keep these factors into consideration when building a new platform to minimize its cost.

Series Summary

Consolidating what we saw today with the latest article we can say

Data Solution Profit = Data Solution Value – Data Solution Cost
Data Solution Value = (New value generated + Cost reduced) x Value Impacting Factors
Data Solution Cost = (Development cost + Operations cost) x Cost Impacting Factors

For more details about the factors and the components you can still see the information on the respective articles using these links:

#1 - Data Solution Profit

Francesco Marino

June 5, 2024

Read full story

#2 - Maximizing the Data Solution Value

Francesco Marino

June 22, 2024

Read full story

The next chapter of this article is here:

#4 - The business case of a data solution

Francesco Marino

August 3, 2024

Read full story

If you enjoyed this newsletter and haven’t subscribed yet, it might be a good idea to

In the next newsletter, we'll discuss a potential real life example of Data Solution Profit calculation. It will be the last newsletter about this topic, at least for now.

If there's something specific you'd like to hear about, let me know.

To the next one,
Francesco

Do you want to give a feedback about Better at Data? 📝Do it here

Better at Data

#3 - Understanding the Data Solution Cost

and its components to minimize it

#1 - Data Solution Profit

#2 - Maximizing the Data Solution Value

Cost of a data solution

Development cost

Operations cost

Human Resource cost

License costs

Infrastructure cost

Factors that impact the data solution cost

Business decisions driven factors:

Technical factors:

Summary

Series Summary

#1 - Data Solution Profit

#2 - Maximizing the Data Solution Value

#4 - The business case of a data solution

Discussion about this post