This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) DiagnosticsSEI staff:Mark Sherman (PI)Lori Flynn (Tech lead)Chris Alberts (Assurance SME)
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Distribution StatementsCopyright 2016 Carnegie Mellon University
This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.
References herein to any specific commercial product, process, or service by trade name, trade mark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute.
NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.
[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.
This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected].
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Preparation: Defining the ProblemFocus assurance questions around information generated during source code evaluationThree internal experts in secure coding and assurance spent ~2 weeks in preparation:
• Focus on building a corpus of CERT Secure Coding Standards and MITRE’s CWEs
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Google Results from Example Query
• Entire document found, no subcomponents
• “Recall” is spread across irrelevant documents- (No match quality provided)
• Imprecise excerpt produced
INTC33-C. Ensure that division and remainder operations do not result …https://www.securecoding.cert.org/…/c/INT33-C. =Ensure+that+dividion+and+remaind…
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Corpus Training
Config.zipCustomized settings for Apache Solr (including schema)
Data.json
ground_truth.csv
IBM WatsonLoads user configuration and data onto Apache Solr. Watson then trains a machine learning model based on known relevant results, then leverages this model to provide improved results to end users in response to their questions
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Database Schema Definition
XML file used by Watson and passed thru to SolrDefines the key concepts (keywords) for the domain of discussionWatson extends keyword types (e.g., “Watson text”)18 fields defined (16 significant, 2 system used)
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Training Materials“csv” formatted file used by Watson to establish context and cognitive baseline
• Collection of questions and “answers” (Solr documents)• Each question has 3 to 4 ranked answers• Each schema keyword needs to be exercised in 50-62 questions• Each original (unfragmented) document needs to be an answer in100-200
questionsWhat are the common consequences of CWE-125?, 4ab71dc7be9b63aa9fc56e66e75dbeec6a98ec9074b8f427f592dc19fe3a2201,5,448e86d7880eaa27d71b4810d7bfcf776f8ff434f257a7fb4b6dd2c5e47899a8,3,69e7b7e822c017e611d008cb16a9de996535661ca8e9e1fdfdb26a7c27634b01,1,16a76232b606883d19fe71946d72b439676cebeac8293e8d024e59d04f4ad56b,2,,What are the common consequences of CWE-127?, 874b072810a74d4dc9583449ad2f873d6f5c0f0065e4aaecbed60c1240abe251,5,fab02c56385917ab1808912649abcd9f3d5db63ecd451ac382ef786c51d4440a,3,dea8ae7d527ee0c53741b5a69535332c288b7fe735bb125358aa2198dadff946,1,f1844eace7f0a7f84fa3c16156aff1c16c111def85dc167b63092cadfc21bfb0,2,,What is the risk likelihood of CWE-267?, d552947662106e8a191d944f270c9044173528092967f3ce32d18d06c8bffd93,5,855ce52ec9156bc7e272a1a02df53ee850e21f1d665ac97aeb90ef3c1763010a,3,fb6b4baf311808b97333ac39a6cd7949603e75de782a6a84c6aeacec3dd89e97,1,f16964c6b2a4e2dad3d52c07f8432a0a7476a04e886e39cc41aa45db71d723a8,2,,Give me the programming languages that apply to CWE-234., f6c1ed915e63dcd0a425d96fc4b4596e8952996532b835dd3cdcbb3a67465f03,5,ff56bc00960e1a86f14fe36f55e3d667245c58ede48d033aae06943343ab3085,3,9ec316aa9d7670e5787397f1cdc18c26a01217ef3162d21472988f2bff3fb4df,2,,,,
Over 150,000 questions generated –needs to be mechanical
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Lessons Learned: System Behavior
Answers must be found in a single documentRetraining means reloading – feedback is collected externally to application and then used to modify or augment original training material
• Challenging when reload runs overnightReconciling fields names when combining multiple corpora requires significant effort in reconciling training
• Field names become a key architectural consideration when designing a system
Recall – getting a good set of answers – is relatively easy; precision – getting set in the best order – is difficult
• Best received advice was to carry out user studies to help guide training dataPlan for technology evolution
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.
Government use rights apply. IBM Watson software (and any dependencies) must be licensed from IBM.
SparkCognition is an IBM Watson business partner (independent software vendor) and has licensed the project materials from CMU for use in their products.