Mark Van de Wiel

APIs are a great way to share data, and – especially in Software-as-a-Service (SaaS) environments – APIs have become a de-facto “standard” to integrate data. I put “standard” within quotes because there aren’t necessarily a lot of standards around APIs. Of course there are the architectural properties defined by REST that most APIs adhere to, but beyond that there is a sheer unlimited amount of freedom to define what the API looks like and does. For example, an API for a SaaS company may allow an organization to push/pull data as desired, effectively allowing access to any and all of the organization’s data in the service, or the API (or maybe a different API) may be defined to automatically provide change data since the last time the organization pulled data.

Both from the perspective of the API provider (e.g. the SaaS company) and from the perspective of the organization using the API (the consumer) there are some important considerations that I will discuss in this blog post in which I focus on pulling rather than pushing data.

Last month I discussed different approaches to perform Change Data Capture (CDC). As data volumes grow and the requirement to access data continuously and in real-time grows, it becomes more important that changes are captured incrementally.

For an organization that consumes data through an API the trade-off is quite simple: does the API provide change data in a usable format, or not? If the API doesn’t provide change data, then it is up to the organization to decide what to do next:

  • Always pull all data and simply overwrite the copied data each time new data is retrieved.
  • Implement a strategy to perform change data capture beyond the API. Maybe there are last_updated fields (or similar) that can reliably be used to identify change data, or in the ultimate case – also to take care of any data deletion – all data may be pulled out into a staging environment on a regular basis in order to process CDC beyond pulling it through the API.

In either case, how frequently the refresh should be done is a business requirement, but how frequently this realistically can be done is an implementation requirement that depends on numerous factors including the data volume, the performance of the API and in some cases the cost of using the API. From the API consumer’s point of view it would be better/easier if the API provided incremental change data in an easy-to-consume format.

As an API provider you can take the relatively easy route to allow any access to all data and leave it up to the consumer to perform change data capture as needed, or you may go the extra mile to develop the API to deliver CDC. In the long run the latter approach may be much better. Consider this:

  • For consumers to pull all their organization’s data to perform change data capture as needed outside of your environment may result in a significant extra load on the system. This challenge increases with growing data volumes and real-time requirements. If you are a SaaS provider this may mean that you have to invest more in your hosted solution in order to provide acceptable performance to your clients, which translates into extra costs.
  • You customers may end up being unhappy if they cannot retrieve the data they want in a timely manner which may cause them to look for alternative solutions. This could result in lost revenue.
  • Providing a change data capture interface allows you to control the implementation so you get to decide what works best and scales well for the environment(s) you manage. And whatever this implementation is may in fact change over time, which is a major benefit of APIs: the underlying implementation does not matter so long as the API stays the same.

API – Always Pretty Inefficient? Absolutely not! APIs are a very powerful way to share data but I think they should be well thought-through and in the end provide what the consumers are looking for. My – strongly biased – opinion says they are looking for change data.

To discuss your environment, use case and if HVR is a fit for your organization–contact us.

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Test drive
Contact us