Top Banner
TECHNOLOGY FOR EDD CONSULTANTS, NOT DUMMIES TOPIC: PRE-PROCESSING By Sara Emami
23

Sara's technology presentation pre processing

Jan 14, 2015

Download

Technology

Sara Emami

Pre-processing in E-Discovery. Educational PowerPoint for attorneys and paralegals.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sara's technology presentation pre processing

TECHNOLOGY FOR EDD CONSULTANTS, NOT DUMMIES

TOPIC: PRE-PROCESSING

By Sara Emami

Page 2: Sara's technology presentation pre processing

THE EDD CONSULTANT

AND E-DISCOVERYThe ultimate goal of electronic discovery is this:

to obtain information that is truly relevant to the case.

There is a demand by legal entities and corporations to process, review and ultimately produce this information in the most cost-effective manner.

Page 3: Sara's technology presentation pre processing

WHY PRE-PROCESSING

MATTERS TO THE DISCOVERY PROCESS?Enables clients to greatly reduce

electronic data sizes at the earliest stages of e-Discovery! This alone is very significant and enables cost-efficiency!

Pre-processing is intended to de-nist, de-dupe and apply data filters to efficiently cull (remove) large sets of data which may be deemed unnecessary or irrelevant to each case.

Page 4: Sara's technology presentation pre processing

A SUCCESSFUL PRE-PROCESSING PROGRAM

SHOULD DO THE FOLLOWING:

Data cataloging (think of organization of files for quick and efficient access)

File extension filtering

Document level and date/time filtering

File type identification

MD5 HASH

NIST FILTERING

De-Duplication

Page 5: Sara's technology presentation pre processing

DeNISTingWhat is “NIST”?

Why is “NIST” and De“NIST”ING significant?

How does this apply to electronic discovery and how does this influence technology?

Page 6: Sara's technology presentation pre processing

What is a NIST? It is not a laying

ground for eggs!!!

The proper method for excluding these files is known as DeNISTing.

The National Institute for Standards and Technology (“NIST”) publishes the National Software Reference Library (“NRSL”).

The NRSL is basically the Library of Congress of software files. Its comprehensive listing of files includes all of the files known to be distributed with software packages such as Microsoft Office.

Page 7: Sara's technology presentation pre processing

WHYS IS de“NIST”ing IMPORTANT?In the typical electronic discovery case,

DeNISTing alone will reduce the volume of information to be examined by 20%.

From the perspective of large corporations, NIST comes in handy as reduced volume means less money spent on the discovery process and cost-efficiency comes into play.

Page 8: Sara's technology presentation pre processing

INTERSTING FACTS ABOUT

“NIST”The NIST list contains over 28 Million file signatures. 

It is used regularly by the FBI and other law enforcement entities to identify files with no evidentiary value. 

The list is free. 

Many e-Discovery companies take advantage of this free list and incorporate it into their software. 

Page 9: Sara's technology presentation pre processing

CONCERNS WITH de-“NIST”

While the NIST list is updated four times per year, it may not include important files.

In past, significant system files were not being removed during the “DE-NISTing” process on workstations using Windows 7 and the latest release of Microsoft Office.  Historically, this was a problem in 2011 so at times, de-nisting is not full proof.

Page 10: Sara's technology presentation pre processing

CONCERNS (continued)

In past, the NIST list does not yet include Windows 7 files, despite the fact that there are more three hundred million workstations that run Windows 7.  Additionally, it did NOT include Microsoft Office 2010 files yet either.

Supplementing the NIST list by removing system files such as EXE and DLL files is a clearly documentable method to reduce the number of files in the review set. 

This method doesn’t depend on HASH values and, assuming that these file types are not responsive (which is usually the case) can be an effective method for eliminating files to review.

Page 11: Sara's technology presentation pre processing

What is “Hash”?When we think of “hash”, we are not referring to McDonald’s breakfast potatoes.

From a technology perspective, we must think of hash as an individual file’s digital fingerprint. The listing includes the names of the files, their typical file sizes and the “hash” value for the file..

Page 12: Sara's technology presentation pre processing

HASH – THINK ALGORITMS!!

When we think of hash we should think: Encryption algorithm that forms the mathematical foundation of e-discovery.

Hashing generates a unique alphanumeric value to identify a particular computer file, group of files, or even an entire hard drive.

Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “de-duplication” that is essential to all e-discovery document processing.

Page 13: Sara's technology presentation pre processing

EXAMPLELet us say for instance, the hash values of a Word document I am working on now are:

MD5: 588BCBD1845342C10D9BBD1C23294459SHA-1: C24AE3125BFDBCE01A27FDDA21B3A7E83FAFF69E

If I only change one comma in this multipage document, all else remaining the same, the hash values are now:

MD5: 5F0266C4C326B9A1EF9E39CB78C352DCSHA-1: 4C37FC6257556E954E90755DEE5DB8CDA8D76710

Although the two files have only this trivial difference, there are no similarities in these hash values, proving that hashing will detect even the slightest file alteration.

Page 14: Sara's technology presentation pre processing

HASHING (cont.)

Hashing can also be used to determine when fields or segments within files are identical, even though the entire file might be quite different (may require software).

For instance, you can hash only the body of an email, the actual message, to determine whether it is identical with another email, even when the “reference” or the “to” and “from” fields are different. This allows for an important filtering process called “near de-duplication.”

Page 15: Sara's technology presentation pre processing

THE MD5 AND WHY IT IS

SIGNIFICANT! An MD5 message hash helps e-discovery professionals both verify the integrity of transferred files and check the digital signature of those files.

When hash functions are applied, legal teams can quickly locate documents in different formats within a sizeable data collection.

Additionally, through the use of pre-culling hashing tools, they can rapidly identify duplicate documents by comparing hash values.

Page 16: Sara's technology presentation pre processing

NEAR DE-DUPE? WHAT IS THAT?

When the near de-duplication occurs, is will reveal if there are two documents that are similar. Think of having created a Word document entitled, “mydog.doc”. Let us imagine that we have revised “mydog.doc” and saved it as “mydog_revisedversion1.doc”. Even though the Word filed was saved with the intent of being the same document in revision mode, what makes this a near de-dupe is the fact that though the draft versions are the same, they are not identical.

Here is another example Near De-Duplication: One file exists “File AV1.0″. This file is then opened, spell checked, and then saved as “File AV1.1″.  These files are very similar and are classed as  near “de-duplicates”.

Page 17: Sara's technology presentation pre processing

NEAR DE-DUPLICATION (continued)With our knowledge of de-duplication, the creation of

near-de-duplication programs and software allow for an even higher level of data de-duplication as it identifies files that are similar and are not bit-level duplicates.

These near-de-duplication technologies help identify and group/tag electronic files with “near duplicate” similarities, however there are differences with regard to the content or metadata, or even both.

Example of near de-duplication can include document versions, emails sent to multiple custodians, different parts of email chains, or similar proposals sent to several clients.

Page 18: Sara's technology presentation pre processing

IT IS ALL ABOUT ORGANIZATION!!!Imagine you are digging through years of files in your local storage shed in the attempt to find one significant document related to a past law-suit. You dig through 20 boxes of paper and can’t find that one document!? Wouldn’t it be nice to have an efficient way of eliminating a full day of search to just under an hour? Finding and grouping of documents does this!

In recent years, the finding and grouping of documents in e-discovery has also been enhanced by new pre-culling tools that go beyond query methodology in concept and fuzzy searching.

Historically, document sets were compiled with keyword searches and then narrowed by using fewer search terms.

Page 19: Sara's technology presentation pre processing

ALL ABOUT ORG. (cont.)

Now, with the advent of concept clustering (i.e., foldering), advanced document analysis can help organize information more effectively by subject.

This clustering capability greatly facilitates the review process by showing attorneys which subjects warrant the greatest attention or relevance to a particular case

Page 20: Sara's technology presentation pre processing

DATA MAPPINGData mapping software is one of the most powerful pre-culling tool.

Provides the framework for visual analysis, showing users the different “points” across their continent of data.

Extract and index metadata and text from native files, create clusters based on any combination of attributes and allows users to search and analyze document collections prior to full EDD processing.

Page 21: Sara's technology presentation pre processing

DATA MAPPING (cont.)

Data mapping applications should be able to remove duplicates in advance and can help legal professionals reduce documents by as much as eighty. This is why mapping is significant.

One huge benefit of data mapping is that it can provide litigators direct control over the document collection. They can manipulate data themselves, in real time, without the need for vendor assistance or external processing.

Page 22: Sara's technology presentation pre processing

EXAMPLE OF WHAT A PRE-PROCESSING

TOOL LOOKS LIKE

Page 23: Sara's technology presentation pre processing

CONCLUSION ON PRE-PROCESSINGThe goal of the pre-processing stages enables clients to greatly reduce electronic data sizes at the earliest stages in the e-Discovery lifecycle.

Targets the relevance of the data

Organizes

Cost efficiency – equals happy clients!!!