Data Warehousing Ethical Concerns: Security, Access and Control

 

Published in TDAN.com April 2004


1.0 Introduction

There are ethical consequences when we apply learning algorithms to large data sets and generate patterns and models. In order to understand how the use of a data mining tool can cause ethical
concerns and security breaches, the project manager must understand the basics of what data mining can do. Data mining provides correlations, market basket analysis, neural networks, and other
advanced artificial intelligence (AI) – allowing discovery of patterns and relationships where none existed before. Data mining works because it produces higher levels of confidence with
higher volumes of information at its disposal.

For instance – it would be difficult to learn anything new from 5 rows of customer demographic data; for instance: where these people lived in relation to others or what the most densely
populated areas of a city are. With 100,000 rows of customer demographics it’s a little easier to spot the trends and patterns. The inverse is also true: it’s also easier to spot
oddities, abnormalities and outliers beyond the patterns. The chance of receiving false positives also increases.

The project manager is responsible for providing the tools that the business uses to gain new insights. They must consider whether or not the statement of work raises ethical issues in the use of
information, according to Dr. Donald Burton, Executive Director, The International Import Export Institute.

“The project manager should worry about what uses the data will be put to within the organization, they have a need to establish different layers/gatekeepers and qualifications on who has
access to the information,” says Burton. “The task of deciding what is ethical usage and what is not falls on focus groups of business users to look at nomenclature, access and
security.” [1]

Without these considerations – there is a chance that end-users may have access to information that they should not be examining. Without knowing it the end-user may break federal
regulations, state laws, or worse.

For example: Let’s say customer A has prescriptions filled at drugstore D. They authorize the doctor and the prescription company to know about each other, and for the drugstore to know
where they live and what their co-pay for insurance is. They have insurance with company C but have not authorized company C to know what their prescriptions are. Company C acquires drugstore D and
now the executives want customer segmentation by prescription and drugstore. Did customer A authorize company C to have medical information about prescriptions that they filled at drugstore D?


2.0 Ethically Speaking…

The above situation can and is happening in our world today. The implementers of the technology are simply told to integrate the data, and the project manager builds a project to make it happen
(with the support of the business). In the future, as ethical concerns become a hot topic in Washington, DC it will be more important that they begin to ask the business users to supply the
documents that outline access, roles, and ethical uses of the information they will receive.

There are also ethical considerations around the use of basic ETL processes and BI tools in the small data set arena. Ethical considerations abound with small data sets being moved from source
systems to target systems for testing purposes. It doesn’t have to be a large data set to be an ethical concern, although large data sets lend themselves to a particular host of ethical
problems such as profiling and segmentation: users are learning things they shouldn’t know, and in some cases aren’t allowed to know (especially in classified areas).

They are also faced with the requirements to gather outside or public information, and integrate it into these already large warehouses. In some cases, end users may begin to ask the warehousing
team to integrate external data sources such as stock trades, financial portfolio information, newsletter, and yahoo subscription information. All of which is public (to a degree). The PM must now
decide of the publicly available information, which is acceptable to integrate and which is potentially a risky proposition (once integrated, may raise ethical concerns).

The argument that “just because it’s publicly available” shouldn’t be a valid statement on the ethical considerations of the information being utilized; the ethics should
focus on how the information is utilized, and by whom.

A company might ask at this point:

“Ok, so what are we to do? We’re a large organization and we intend on using our information with the utmost regard for the customer. Anything we learn will help us compete and
offer better deals to the customer. We’re trying to help our customers not hurt them.”

There is a delicate balance between acting on what’s morally right and ethically acceptable, and what’s devious and destructive. There is no single answer, but there is a plea to all
executives: consider yourselves as one of your own customers, and if someone else discovered or learned X about you, would it make you upset? How would you respond? If we personalize the outcomes –
we might begin to glimpse the impact of the business decisions we’re about to make and the ethics that go with them.


3.0 Summary

It is a challenging quest to maintain balance, control and security over our ever growing data sets. It’s also our duty to examine the ethical consequences of the business decisions we make
through the use of that information. Finally we must consider the quality of the information we are basing our decisions on. Incorrect information can harm more than it can help.

Here is a checklist of items that project managers and technology implementers may consider when embarking on VLDW and desiring to manage ethical concerns:

  • Develop SLA’s with end users that define who has access to what levels of information
  • Have end-users involved in defining the ethical standards of use for the data that will be delivered.
  • Define the bounds around the integration efforts of public data, where it will be integrated and where it will not – so as to avoid conflicts of interest.
  • Do not use “live” or real data for testing purposes – or lock down the test environment; too often test environments are left wide-open and accessible to too many individuals.
  • Define where, how, and who will be using Data Mining – restrict the mining efforts to specific sets of information. Build a notification system to monitor data mining usage.
  • Allow customers to “block” the integration of their own information (this one is questionable) depending on if the customer information after integration will be made available on
    the web.
  • Remember that any efforts made are still subject to governmental laws.
  • Nothing is sacred. If a government wants access to the information, they will get it.

References

  1. [1]Dr. Donald Burton, Executive Director, The International Import
    Export Institute.
  2. Company ABC – Not allowed to release their corporate information
  3. Bob Terdeman – CTO and Senior VP of Rogers Medical Intelligence Solutions

InnerCore Users – Paraphrased comments regarding the ethical use of test data.

© Copyright 2002-2003, Core Integration Partners, All Rights Reserved. Unless otherwise indicated, all materials are the property of Core Integration Partners, Inc. No part of this document
may be reproduced in any form, or by any means, without written permission from Core Integration Partners, Inc.

Share

submit to reddit

About Dan Linstedt

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Top