Adṛśya — What if you could simplify Data Sharing Challenges?

Mondweep Chakravorty
5 min readMar 19, 2019

--

If you are a technology innovation company, you would relate to challenges getting your hands on the right data necessary to fuel the innovation drive. If you are a major institution who wants to promote innovation, you can relate to the various regulatory restrictions (for the right reasons) in place to prevent sensitive data from being compromised. Even if data sharing agreements are signed, it still isn’t always possible to maintain full control, which ensures data is often masked or redacted prior to sharing with a third party.

Example of redacted data
Example of masked data

As you would expect, masking data limits the degree of product refinements — this is particularly true where a key product output may be analytics, which is increasingly becoming relevant in a world of Cognitive AI applications. There have been significant research across the world to optimise the relationship between data integrity (ie retaining the property of the underlying data) and data privacy. Right from our high school mathematics lessons, we have learned that the key properties of a data set are mean, median and mode. Statistical principles allow us to proxy or represent qualitative data like categorical variables (eg. gender, direction, years of certain events etc) too — for example, by use of ‘dummy variables’. If we have a representative sample, we can utilise the data properties to build analytical toolkit with the right level of predictability. When sharing dataset, algorithms could replace sensitive data with the underlying properties (eg averages/ means) describing the entire data set. So, why isn’t it really used more widely in practice?

A face is exposed……

To answer this, let me tell you the story of AOL from 2006. “For the greater good”, AOL ‘anonymised’ and released 20 million search queries from 650k users covering 3 months of activities. Researchers used the data set and using triangulation techniques covering a series of search enquiries - ‘numb fingers, 60 landscapers in Lilburn, Ga and dogs that urinates on everything’ — of a specific user (#4417749) exposed a face — that of Thelma Arnold, a sixty-two-year-old widow! Read more about the story here. Over the following 13 years, and the proliferation of our data over social media; it has arguably become much easier for data from different places to be triangulated and individual privacy compromised. Hence, it is justified that there is increasing concerns and restrictions on data sharing.

I now take the opportunity to introduce Adṛśya, a product innovation that simplifies the data sharing challenge — allowing to organisations to build controls over shared data sets to protect and preserve privacy concerns but still allowing meaningful, relevant insights to be drawn from data analytics. Before I describe how Adṛśya achieves this and the use cases it fits into, I would like to expand on three key data concepts that protect or compromise individual privacy.

Personal identifier — Attributes that uniquely identifies an individual: eg someone’s name, phone number

Quasi-identifier — Attributes which if combined with other publicly available datasets can uniquely identify an individual — eg address, race

Sensitive attribute — Results, conditions related to an individual — eg specific disease, salary, search queries, online ratings

If someone is able to triangulate publicly available data to gain access to such information, they can exploit (“inference based attacks”) without necessarily having the individual's consent to use that information. In the example on the left, quasi- identifiers age, sex and postcode from a patient’s database were correlated with public datasest to identify that Bob suffers from Bronchitis! This would allow for example a drug company to advertise respiratory treatment directly to Bob -without his realising how they managed to get to him!!

Adṛśya has been developed to prevent shared data from being subjected to such attacks. Adṛśya utilises its proprietary ‘overlapping bucket’ algorithm to allows organisations protect the personal, quasi-identifiers and sensitive attributes of data prior to sharing the data set with a third party. Adṛśya’s ‘overlapping buckets’ algorithm minimise the ‘distance’ between the quasi-identifiers of a data set; whilst at the same time ensuring that the possibility of identifying sensitive attributes are managed within strict probabilistic limits. Adṛśya also provides a selection of other industry algorithms for organisations to use to suit their specific anonymisation needs. Some of these algorithms include k-anonymisation, l-diversity, t-closeness, beta-likeness. Tests have demonstrated that Adṛśya’s proprietary algorithm performs better in reducing the loss of data utility from anonymised data sets.

Key features Adṛśya:

  • Anonymise the same data set for different confidentiality levels — eg Internal Departments, Suppliers, Customers, Regulators
  • Allow meaningful analytics for intended purposes only — within the desired security clearance levels and expected collaboration levels
  • Balance data utilisation and data privacy concerns — prevent misuse of sensitive attributes, in particular, less frequent ones

Below I include a few test results of how Adṛśya’s proprietary ‘overlapping bucket’ algorithm (referred to as ‘Incognito’ in the results below) performed/ scaled against other standard algorithms. The comparison covers time taken by various algorithms to anonymise large data sets, how the anonymisation process scaled, and what the resultant data utility loss looked like.

Algorithm Legend
Time taken by various algorithms to anonymise large data sets
Scalability — using a 5GB data set
Loss of data utility against different anonymisation algorithms

The Adṛśya team has found t-closeness to be the poorest in preserving data utility. This is because the t-closeness algorithm prevents semantically similar sensitive attributes of a data set from being bucketed/ group together. In the process, it limits close quasi-identifiers from being grouped together too. Quasi-identifiers are the ones that analytics algorithms rely on the most to generate insights. Adṛśya’s proprietary ‘overlapping bucket’ Incognito algorithm performs much better as can be seen by its closeness to analytics results when using original data…

To know more about Adṛśya or if you have requirements to consider data set anonymisation, please feel free to contact me at mondweep@bridgeconnect.biz

Thank you for your attention! Feel free to leave a comment or ask if any questions.

--

--

No responses yet