Top Banner
Caradoc: a Pragmatic Approach to PDF Parsing and Validation IEEE Security & Privacy LangSec Workshop 2016 Guillaume Endignoux Olivier Levillain Jean-Yves Migeon École Polytechnique, France EPFL, Switzerland ANSSI, France Thursday 26 th May, 2016 1 / 29
35

Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Jan 19, 2017

Download

Documents

phungnhan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Caradoc: a Pragmatic Approach to PDF Parsingand Validation

IEEE Security & Privacy LangSec Workshop 2016

Guillaume Endignoux Olivier Levillain Jean-Yves Migeon

École Polytechnique, FranceEPFL, Switzerland

ANSSI, France

Thursday 26th May, 2016

1 / 29

Page 2: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Portable Document Format ?

A commonly used format, but many security issues:500+ reported vulnerabilities in Adobe Reader1 (since 1999).Discrepancies between implementations.Syntax facilitates polymorphism2 (PDF+ZIP, PDF+JPEG,etc.).

In our work, we aim at verifying PDFs from syntactic level.

Two approaches to validate files:Blacklist: does not detect new malware...Whitelist: higher rejection rate, but accepted files are clean.

1http://www.cvedetails.com2See for example PoC||GTFO

2 / 29

Page 3: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Portable Document Format ?

A commonly used format, but many security issues:500+ reported vulnerabilities in Adobe Reader1 (since 1999).Discrepancies between implementations.Syntax facilitates polymorphism2 (PDF+ZIP, PDF+JPEG,etc.).

In our work, we aim at verifying PDFs from syntactic level.

Two approaches to validate files:Blacklist: does not detect new malware...Whitelist: higher rejection rate, but accepted files are clean.

1http://www.cvedetails.com2See for example PoC||GTFO

2 / 29

Page 4: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

3 / 29

Page 5: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

4 / 29

Page 6: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

PDF syntax 101

A PDF document is made of objects:null

booleans: true, falsenumbers: 123, -4.56strings: (foo)names: /bararrays: [1 2 3], [(foo) /bar]

dictionaries: << /key (value) /foo 123 >>

references: 1 0 obj ... endobj and 1 0 R

streams: << ... >> stream ... endstream

5 / 29

Page 7: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Structure of a PDF file

HeaderObject

Object...

Reference tableTrailer

End-of-file

%PDF-1.7

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

xref0 60000000000 65536 f0000000009 00000 n0000000060 00000 n...

trailer<< /Size 6 /Root 1 0 R >>

startxref428%%EOF

Organization of a simple PDF file.

6 / 29

Page 8: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Structure of a PDF file

More complex structures:incremental updates,object streams,linearization.

HeaderObjects

...Table + trailer #1

End-of-file #1

Objects...

Table + trailer #2

End-of-file #2

%PDF-1.7

xref0 60000000000 65536 f0000000009 00000 n0000000060 00000 n...trailer<< /Size 6 /Root 1 0 R >>

startxref428%%EOF

xref0 30000000002 65536 f0000000567 00001 n0000000000 00001 f6 10000001234 00000 ntrailer<< /Size 7 /Root 1 1 R /Prev 428 >>

startxref1347%%EOF

Original file

Incrementalupdate

Incremental update.

7 / 29

Page 9: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Logical structure of a PDF file

Document of 17 pages (about 1000 objects).

8 / 29

Page 10: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Graph organization

The graph of objects is organized into sub-structures, especiallytrees.

Page tree.Catalog Root of the page tree

Page 3Node Page 4

Page 1 Page 2

9 / 29

Page 11: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Graph organization

The table of contents uses doubly-linked lists.

Table of contents.

CatalogOutline root

ChapterChapter Chapter

SectionSection Section

10 / 29

Page 12: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Problematic structure

An attacker may write an invalid structure.

Invalid table of contents.

CatalogOutline root

ChapterChapter Chapter

SectionSection Section

11 / 29

Page 13: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Demonstration

Demonstration: two examples

Loop in the outline structurehttps://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/outlines/cycle.pdf

Polymorphic filehttps://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/polymorph/polymorph.pdf

These files were reported to software editors.

12 / 29

Page 14: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Demonstration

These problems may lead to several attacks:Attacks on the structure (denial of service).Evasion techniques (attacks taking advantage ofimplementation discrepancies).

13 / 29

Page 15: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

14 / 29

Page 16: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Solution proposals

Caradoc verifies a document at three levels:File syntax.Objects consistency (type checking).Higher-level verifications (graph, etc.).

15 / 29

Page 17: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Syntax restriction

At syntax level, guarantee extraction of objects without ambiguity:Grammar formalization3 (BNF).Structure restrictions (no updates, no linearization, etc.).Systematic rejection of “corrupted” files.

When a conforming reader reads a PDF file with adamaged or missing cross-reference table, it mayattempt to rebuild the table by scanning all the objectsin the file.

— ISO 32000-1:2008, annex C.2

3https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar16 / 29

Page 18: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Syntax restriction

At syntax level, guarantee extraction of objects without ambiguity:Grammar formalization3 (BNF).Structure restrictions (no updates, no linearization, etc.).Systematic rejection of “corrupted” files.

When a conforming reader reads a PDF file with adamaged or missing cross-reference table, it mayattempt to rebuild the table by scanning all the objectsin the file.

— ISO 32000-1:2008, annex C.2

3https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar16 / 29

Page 19: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

At object level: guarantee semantic consistency.

For this purpose: type checking algorithm.

17 / 29

Page 20: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Example on a Hello World file.

18 / 29

Page 21: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 22: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 23: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 24: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 25: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 26: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Type checking

Types of a 17-page document.

actionpagedestinationannotationresourceoutlinecontent streamfontname treeother

20 / 29

Page 27: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

More complex verifications

At a higher level:

Verification of tree structures (page tree, outlines, etc.).Other verifications easily integrable in the future (fonts,images, existing analyses, etc.).

21 / 29

Page 28: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

22 / 29

Page 29: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Implementation

Implementation in OCaml from the PDF specification4.

Validation workflow.

PDF

strict parser

relaxed parser

objects

graph ofreferences

extraction ofspecific objects

typechecking

list oftypes

graphchecking

other checksto develop

no errordetectednormalization

4https://www.adobe.com/devnet/pdf/pdf_reference.html23 / 29

Page 30: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Real-world files

10K files collected from random queries on a web search engine.

Some files are directly accepted.

Direct validation.

PDF

10000 files

strictparser parsed

1465 files

typechecking

typechecked

536 files

graphchecking

no errorfound

536 files

24 / 29

Page 31: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Normalization

Many files do not pass the first stage... But they can be normalizedbeforehand.

The relaxed parser supports common structures: incrementalupdates, object streams, etc.

Normalization.

PDF

10000 files

relaxed parser parsed

8993 files

cleaning objects normalized

8993 files

Some files were not normalized: encryption, unrecoverable syntaxerrors, etc.

25 / 29

Page 32: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Normalization

Validation after normalization.

normalized

8993 files

type checking type checked

1429 filestype error

1391 files

graph checkingno errorfound

1427 files

Our type-checker detected typos:/Blackls1 instead of /BlackIs1,/XObjcect instead of /XObject.

We identified incorrect tree structures in the wild.

26 / 29

Page 33: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Future work

What remains to be done:Complete the set of types.Check compression filters.Check graphic content.Check fonts, images, etc.

27 / 29

Page 34: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Conclusion

Summary of our contributions:We identified novel issues in PDF parsers.We proposed and formalized a simplified syntax for PDF.We implemented Caradoc to parse and validate PDF files.

Project page: https://github.com/ANSSI-FR/caradoc

28 / 29

Page 35: Caradoc: a Pragmatic Approach to PDF Parsing and Validation ...

Questions ?

https://xkcd.com/1301/

29 / 29