From Big Data to Long Data in Asset Management

One of the aspects that differentiates humans from other living species is our understanding and acceptance, albeit reluctantly, of the passing of time. We picture our lives in terms of a horizontal time axis ranging from the past to a current time of now and finally to the future. That future time is not infinite, quite to the contrary. It is the finality of our existence that defines our human condition. Since the beginning, humans therefore traced their experiences in time, using diaries to record where they were and what they did.

All major strands of human thought studied the concept of time. Religion, philosophy and of course science have attempted to grant us a more accurate, circular and relative understanding of the meaning of time. Time traverses all of the concepts we have created to explain our reality. It is the great constant and the ultimate judge of the sustainability of all human action and creation.

Computer science has not been absent from that debate and continues to produce meaningful and often controversial research on how the concept of time should be replicated in software. The assumption is simple: if time is so prevalent in our lives, then it would logically follow that software would not only reflect that reality but digitalize it (analog to digital).

Computer science like all sciences builds models to mirror our reality. This is why the consideration of time as a dimension of data is so critical. Building a data model without taking into account the temporal nature of data will fall short of what users and consumers of that data will expect.

The objective of any database model (Relational or Graph) is to organize data in such a way that it can be queried by a human or another system. The purpose is to provide information, insight, knowledge and potentially meaning to a person who is searching for that data. Anyone tasked with designing a database should ask themselves a question that is unfortunately unanswerable, yet needs to be asked: What question could a human ask of the dataset? What potentially could they want to know? You will agree with me that there is no exhaustive answer to that question. But, whatever we do, time must be one of the concepts that a database should try to replicate. I emphasize the word try because the complexity of temporality to data is such, that it has never been solved perfectly through computer science.

Let’s start with some theory.

Scientists use technical languages to dialogue within their scientific community. Call it a glossary with agreed definitions of concepts. The first important concept on time and data is the concept of valid time (VT). “Valid time of a fact is the time when the fact is true in the modern reality. A fact may be associated with any number of instants and time intervals”.[1]

An example of a fact in asset management is the Net Asset value of a fund on the 18th of January of this year. That value represents a concept and has meaning. But the concept implies that the NAV has a valid time from the 18th of January until a new NAV is computed (with high probability on the 19th of January).

A database with a temporal focus will be able to chart the change of the NAV on a vertical axis with time on a horizontal axis.

In this example we have only one dimension of time. The valid time can be in the past, current time as in now or from to now to infinity and finally a future valid time. Future effective dates are very common in our industry.

We can now introduce a second dimension of time which is called Transaction Time (TT).

“A database fact is stored in a database at some point in time, and after it is stored, it is current until logically deleted”[2]. This means that the transaction time is the time when the data (the fact) was entered or serialized into the database.

A database model that considers two-time dimensions is called a bi-temporal database model.

In some cases, VT and TT will be identical. A NAV is calculated today and valid from today and serialized in the database today. However, an investment strategy will be inserted into the product database today but will only be applicable next week. Hence VT is Today + X Days and TT is Today.

So a relevant question would be the following: why did computer scientists feel it would be necessary to build data models with two time dimensions? One answer to that question is that humans are prone to making mistakes. A system that makes a mistake is just the direct consequence of a conceptual error and its execution by a human being. Because we make mistakes, we then retroactively correct data (facts) in our database (the repository of facts).

A real example will make this much clearer. Asset Managers distribute data to vendors and platforms on a daily basis. So let’s imagine that on the 18th of February an asset manager distributes an ongoing charge of 1.67 % for a specific share class. On the 20th of February this value is corrected from 1.67% to 1.85 % retroactively to the 18th of February. A distributor will have received the wrong value, but the right value will be in the database. Operations will need to understand not just why this happened but also to evidence it and potentially re-send corrected data to the distributor.

At this stage the need for bitemporality for data management becomes highly relevant. Time travel needs to be enabled – a form of back to the future. The database will need to answer the following question: please show me all the ongoing charges values valid on the 18th of February with my knowledge of the world on the 18th of February. Only two dimensions of time linked to data can allow organizations to travel back in time: Valid Time and Transaction Time.

The overwhelming number of data systems used by asset management today will not be able to answer the above question.

Today, because computer storage cost has decreased so significantly with the parallel increase of computing power and the cloud revolution, the bi-temporal approach to data management not only makes sense (it always did), it is actually affordable.

The advances in computer science will allow asset managers to overcome past system limitations to finally understand data in a historical context. This development is of huge significance for our industry.

Current and future compliance requirements applicable to investment managers and their service providers justify the need for bi-temporal data management. Errors are a symptom of human activity and regulators will demand data provided to reflect the world as it was when the error occurred or when the decision was taken.

Data has a temporal dimension. This is what I am trying to express by bringing forward the concept of long data as opposed to big data. The term long data was first coined by Samuel Arbesman in the magazine Wired in 2013. [3]

To become data savvy will require asset management leaders to develop a deeper understanding of the concept of time linked to data. Data is not just current. It is created, recorded, altered, changed for the future, retired and even deleted. In its temporal dimension, data is akin to our human existence. But yes, data can be current from today to infinity - alas!

[1] Sigmod Record , Vol 23, No1, March 1994, page 53 [2] Sigmod Record , Vol 23, No1, March 1994, page 53 [3]