How Technology can Find, Manage and Protect Personal ... Engines.pdfStep 1: Classify and Clean– Provide Insight into Unstructured Data Understanding what data exists, determining

How Technology can Find, Manage and Protect Personal Content in Unstructured Data

www.indexengines.com [email protected] 732-817-1060

Copyright 2017. All rights reserved. Index Engines Inc. All products mentioned are trademarked

by their respective organizations.

Personal data exists in obvious documents and files, for example, spreadsheets created by specific users, such

as members of the finance team, located on specific servers. Finding and managing these files may be a

somewhat easy task, however, there may be personal data hidden deep in files that you may not know exists.

This is the content that will keep Data Protection Officers (DPO’s) up at night and will create risk for fines and

sanctions by the European Union regulatory body.

Complying with the GDPR will be a monumental task for most organizations. The sheer volume of data,

existing on unmanaged networks and file servers or clustered away in archives, will require technology and

intelligence to fully comply with this new regulation. The upside to this new regulation is that it will enforce a

new governance ready strategy that will not only streamline corporate data centers but also recoup costs by

managing data more intelligently and optimizing the data storage footprint.

Overview

Technology is the foundation that will help support GDPR compliance. Technology can classify data into

manageable groups, provide comprehensive search to find personal data, manage the disposition of this

content, implement automation to ensure an auditable workflow, and secure and protect sensitive data

ensuring it does not fall into the wrong hands.

Technology, combined with an intelligent workflow is the only approach to ensure that all personal data is

protected, managed and secured. Index Engines has developed a workflow that will not only allow personal

data to be found in support of the regulation, but to also ensure an auditable and defensible process that will

allow DPO’s to sleep well at night.

Workflow

Finding, restricting, rectifying and

deleting personal data will be the

new normal in 2018. Since

organizations have amassed

significant stockpiles of

unstructured user data over

decades, searching and finding

the relevant personal content can

be a challenge unless you

implement a smart workflow that

breaks down tasks into achievable

segments.

Index Engines’ software provides

comprehensive classification,

search, disposition, protection

and preservation capabilities all in

an integrated and automated

platform that scales to petabyte-

class data centers.




Step 1: Classify and Clean– Provide Insight into Unstructured Data

Understanding what data exists, determining its value, and classifying it to be in scope for the GDPR provides the

foundation for compliance with the regulation. Data classification is performed by scanning data sources at a

metadata level and capturing high-level information about the files.

Metadata fields such as owner and Active Directory membership (ie. Legal or HR department) enable you to

determine if these users typically work with personal data. Fields such as file type or folder/location can help you

determine if files such as presentations or spreadsheets contained on a shared marketing server would contain

personal data.

Additionally, you can utilize metadata to help classify data of value versus data that has long outlived its

usefulness. Using metadata fields such as last accessed over three years ago, or multimedia file types such as

photos and movies can help determine if the content has business value and then take the appropriate action.




ROT Data: Leveraging metadata classification, content can then be organized into a few manageable categories.

The first category can be defined as data that no longer has any business value, but is clogging up corporate

servers and complicating GDPR requests. This data could include redundant, obsolete or trivial (ROT) files or

simple content that has not been accessed in more than a few years. This data can easily be defensibly deleted,

so it will no longer get in the way of any future queries or data requests. This process of data minimization will be

discussed in more detail in the following section.

Out-Of-Scope Data: This category includes data that is out of scope for the GDPR - mainly content that, based on

the classification, would not contain personal data. This could be word documents created by members of the

engineering department contained on a specific set of servers. In many cases, traditional data mapping interviews

provide valuable information that will complement the metadata properties captured during the classification

process. In addition to leveraging metadata to organize files, you can enhance the process of classifying content

with information obtained from traditional data mapping interviews. Together this would provide you with a

comprehensive profile of your data assets based on detailed intelligence.

In-Scope Data: Finally, what will remain based on the classification process is data that is in scope for the GDPR.

This is content that would contain personal data. For example, spreadsheets created by members of the finance

department located in a specific set of servers or folder. This in-scope data set will become the target for future

GDPR queries and requests. This would typically encompass a much smaller set of data, typically around 10% to

20% of what exists on corporate networks.

Data classification will take the challenge of managing hundreds of terabytes or even petabytes of data in support

of the GDPR, down to a focused data set of in scope data that can amount to only 10 to 20% of the entire

environment. This will ensure you are finding all personal data using a more efficient and targeted approach.




Data Minimization

Organizations have amassed significant stockpiles of user data over decades. Leveraging the above method of

classifying data, it can be organized into meaningful groups simplifying the management of user files. One of the

larger classes of data that exists on corporate networks will be aged data that no longer has business value, or

redundant, obsolete and trivial files commonly referred to as ROT. This class of data can easily be found, tagged,

and migrated to low cost cloud storage or even purged.

Clearing out this useless content from the data center will simplify requests for personal data, making the

management of it a less complex task. Additionally, this data center housekeeping will generate a cost savings that

will help fund support for the GDPR. For organizations that have not budgeted sufficiently for technology to

manage personal data, data minimization will generate a return on investment that will recoup storage capacity

and data center resources thus funding the acquisition of software and services.

Typical classification categories that support data minimization include redundant data that has not been accessed

in years, files that are owned by ex-employees (based on Active Directory) that current users are not accessing,

trivial files such as logs, photos, movies, iTunes libraries, etc. Going further into the metadata analysis, you can find

file types owned by specific departments that have outlived their business value, such as old marketing

presentations or other obsolete content.

Many organizations realize 40% or more savings in storage capacity by executing a data minimization project. This

will result in a significant reduction in data center expenses as shown in the example below.




Step 2: Search and Find - Discover Personal Data

The GDPR provides citizens with the right to access, rectify, erase or restrict their personal data. Therefore,

search is core to any technology that will be implemented in support of the regulation.

Many search products will claim support for the discovery of personal data, however, there are many types of

data hidden in diverse types of files and locations. The search technology needs to find data you know exists, as

well as personal content you do not know about.

Beyond common keyword search, to find specific known

personal content in files and email, the use of Boolean

search (proximity, OR, AND, etc.) will allow for more

refined queries. More sophisticated pattern search is key

to finding personal data such as country identity or

driver’s license numbers. Every country in the EU may

have specific patterns that may not be included in the

search engine, therefore regular expressions (RegEX) will

allow for the custom definition of even the most obscure

patterns.

However, personal data that you don’t know exists can be

very difficult to find and the above techniques will prove

useless. To ensure compliance with the GDPR you will

need to find everything with confidence. Conceptual

search technology builds sophisticated search queries

based on machine learning algorithms and will find data

even when you don’t know what you are looking for.

Concept search learns what personal data looks like,

where it is contained, and will develop queries that you

can use to search across large data environments. When

you know personal data is typically hidden inside specific

finance documents, train the query engine to find similar

documents and these documents will be easily found.

Based on the definition of this regulation, you will need all

the search capabilities available to efficiently find and

manage personal data. Index Engines provides

comprehensive search technology, including conceptual

search, enabling you to find personal data with confidence.




Step 3: Manage and Protect – Disposition and management of sensitive data

Manage: Once personal data is discovered it needs to be managed according to the data owners request.

Deleting, migrating, archiving, restricting and correcting content will become the new normal in 2018.

Index Engines delivers integrated disposition, empowering users to search and find content then manage it

appropriately. This is critical to a defensible workflow. Once queries are executed, data can be tagged and

organized and then deleted, moved, secured or simply monitored as needed. All data actions will be logged to

maintain a defensible and auditable process. If disposition is not integrated, then the workflow will not be

defensible and personal data can easily be mismanaged.

Integrated archiving is also critical to preserving and securing personal data. Capturing sensitive data that is no

longer needed on the primary storage network, but must be maintained for long-term retention requirements,

should be moved to an archive. The archive can be easily managed and will ensure sensitive data is not left

unprotected on the

network. Retention

policies can be

defined, and

compliance teams

can easily search and

manage the content.

Leveraging cloud

storage to archive

data is not only cost

effective but is more

secure than leaving

this data on servers

and user shares

available to potential

rogue employees.

Protect: Protection of personal data against rogue employees and data breaches is a core aspect of the GDPR.

No longer can organizations accept the fact that sensitive data be left unprotected on networks and hope for the

best. As this regulation is enforced a typical data breach will become much more serious and it could impact a

business reputation as well as levy significant fines.

Indexing file properties including activity logs, who has accessed what, and ACLs, who has read/write/browse

permissions to specific files, facilitates a proactive approach to data protection. Many security tools will notify

you after the fact that some unusual behavior has occurred, but this regulation requires proactive assessment on

data security to secure data assets.




Step 4: Monitor - Ongoing management and defensible audit trails

Monitor: Organizations will need to provide a defensible and auditable workflow when the regulators

come to review their support of the GDPR. Implementing a fully integrated platform to manage all aspects of

personal data protection, as outlined above, will empower organizations to not only comply but ensure a

repeatable and reliable process.

Data Protection Officers will need to depend on these tools to understand how data is protected while

refining and improving upon processes based on the data management logs and activity. Without an

integrated approach there will be too many aspects to the workflow and too many areas that can fail when

managing significant volumes of personal data.

Index Engines provides an enterprise index that is incrementally refreshed as the data changes. This allows a

current view into data and actionable insight into new and sensitive data that must be proactively managed.

A dashboard with pre-defined reports and queries will provide visual knowledge of data assets allowing

organizations to refine and audit data policies as needed. Additionally, notifications can be defined, and

emails delivered based on stored queries that are customized to support compliance with the GDPR.

Combined, these tools provide a defensible and auditable workflow that will provide confidence to any data

protection officer that personal data is protected and managed.

Audit: When the EU regulatory body comes to audit the workflow and processes, it is imperative that all

activity be logged and accountable. Index Engines maintains a defensible log of all disposition activity,

allowing for an audit and review of citizens requests.

Detailed reports can be run on the data, to visually review content and determine if there are breaches in

policies and if sensitive data is vulnerable to a breach. Once these policies are defined, simply storing them

and scheduling execution will deliver a detailed report via email for review and management.




Legacy Data on Backup Tapes

And what about your legacy backup tape? This data was initially preserved for disaster recovery but now is an

archive of all your long-term retention data.

Personal and sensitive data exists on backup tapes. In order to comply with the regulation, this data must be

managed as well. Simply managing your network data would leave a legacy instance in your backup data that

would cause risk and liability in the future.

The easiest approach would be to profile the backup data once it is no longer used for disaster recovery,

determine what is required

for long-term retention, and

migrate this content out of

backup and into a policy-

based archive. Once this is

accomplished, the backup

data can be purged.

By eliminating the practice of

saving old backup data,

specifically backup tapes,

companies eliminate the

need to manage - and go to -

these repositories in the

future. By leveraging the

GDPR to clean up and

remediate legacy tapes,

offsite tape costs and ad-hoc

eDiscovery costs can be

recouped.

Finding the data that has

value, and selectively

migrating this content makes

for cost-effective

management of these

records. The policies that are

defined to support GDPR on

the network data content

can easily be applied to

legacy tape data in order to

streamline the migration.




Index Engines Support for the GDPR

The GDPR is a significant information management challenge. Organizations must embrace technology to help

meet and exceed the requirements of the regulation. It is critical to deploy a solution that can meet the

following requirements:

Petabyte-class indexing platform

Support for all classes of data, including legacy data on backup tapes

Comprehensive search and reporting

Integrated management and disposition capabilities

Secure archiving to support secure preservation and access

Automated processing and monitoring of data policies

Index Engines is architected to support management and governance of sensitive data in support of corporate

policies. Only Index Engines can deliver a workflow as outlined above on all classes of data, including legacy

data on backup tapes.

Getting started in your efforts to support the GDPR can be accomplished today. Index Engines enables

organizations to deploy technology in stages. Start with some critical data servers, understand what exists,

and then develop a policy that will provide sound support for personal data management. Then use this

footprint to expand the environment and incorporate more data sources, including legacy backup data.

Index Engines delivers a number of deployment options, and services that make easy work of compliance with

the GDPR.

Contact us today at [email protected] or learn more at www.indexengines.com/gdpr

How Technology can Find, Manage and Protect Personal ... Engines.pdfStep 1: Classify and Clean– Provide Insight into Unstructured Data Understanding what data exists, determining

Documents