Top Banner
Information note (Draft – final numbers subject to verification) Data and Analysis Requirements in Scanning Probe and Electron Microscopies S.V. Kalinin, 1,2 A. Belianinov, 1,2 A. Lupini, 1,3 S. Somnath, 1,2 E. Strelcov, 1,2 S. Jesse 1,2 1 Institute for Functional Imaging of Materials, 2 The Center for Nanophase Materials Sciences, and 3 Materials Sciences and Technology Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831 1.1 Science Use Case Scanning probe (atomic force, scanning tunneling, etc) and scanning transmission electron microscopies now form the mainstay of nanoscience by providing capabilities for local characterization and manipulation of matter on the nanometer and atomic scales. Until very recently, development of these techniques was based on synergy of instrumentation platforms (stability/noise/environment), improved probes and detectors, measurement modalities, and mathematical tools for extraction of materials specific parameters from imaging data. However, in almost all cases the information provided to researcher is in the form of 2D images and (in the last decade) 3D spectroscopic imaging data sets, whereas full information flow within the microscope was unavailable for the operator and end users and the internal analytics (required for e.g. feedback systems) was limited. Furthermore, the generated data sets are usually manually sub-selected for subsequent detailed analysis by the researcher, further limiting information generation capability of these tools and obviating data re-use. This paradigm differs significantly from that in e.g. large scattering and synchrotron facilities. Here, we analyze the information aspects of probe and electron microscopy imaging as a first step for developing systematic solutions for full data utilization and reuse in imaging. Given the traditional gap between the fields, we perform the analysis separately for SPM and STEM, and elucidate commonalities when possible. We also note that SPM operates with scalar (single data stream) excitation and detection signals over 2D scanning area, whereas STEM allows for much broader variability of detection schemes (0D, 1D, 2D) and scanned areas (2D or 3D). Hence we analyze SPM first and STEM second. SPM: The SPM group at the CNMS is actively working on the development of scanning probe microscopy techniques for probing bias-induced (ferroelectric polarization switching, electrochemical reactions) and thermal (glass transition, melting) transformations on the nanoscale. In these experiments, the SPM tip focuses an electric or thermal field in a small (5 – 30 nm) region of material, inducing local transformations. In parallel, measured dynamic strain, resonance frequency shift, or quality factor of the cantilever (piezoresponse force microscopy, electrochemical strain microscopy) or tip-surface current (conductive AFM) provides information on processes in the material (polarization, domain size, ionic motion, second phase formation, melting) induced by local stimulus. In the future, the detection strategies can include microwave, Raman, focused X-ray, electron microscopy, and other high-bandwidth local (~10 nm and below) structural and chemical probes.
52

Use Case: Data and Analysis Requirements in Scanning Probe and ...

Jan 09, 2017

Download

Documents

vuonghuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Information note (Draft – final numbers subject to verification)

Data and Analysis Requirements in Scanning Probe and Electron Microscopies

S.V. Kalinin,1,2 A. Belianinov,1,2 A. Lupini,1,3 S. Somnath,1,2 E. Strelcov,1,2 S. Jesse1,2

1 Institute for Functional Imaging of Materials, 2 The Center for Nanophase Materials Sciences, and 3 Materials Sciences and Technology Division,

Oak Ridge National Laboratory, Oak Ridge, TN 37831

1.1 Science Use Case

Scanning probe (atomic force, scanning tunneling, etc) and scanning transmission electron microscopies now form the mainstay of nanoscience by providing capabilities for local characterization and manipulation of matter on the nanometer and atomic scales. Until very recently, development of these techniques was based on synergy of instrumentation platforms (stability/noise/environment), improved probes and detectors, measurement modalities, and mathematical tools for extraction of materials specific parameters from imaging data. However, in almost all cases the information provided to researcher is in the form of 2D images and (in the last decade) 3D spectroscopic imaging data sets, whereas full information flow within the microscope was unavailable for the operator and end users and the internal analytics (required for e.g. feedback systems) was limited. Furthermore, the generated data sets are usually manually sub-selected for subsequent detailed analysis by the researcher, further limiting information generation capability of these tools and obviating data re-use. This paradigm differs significantly from that in e.g. large scattering and synchrotron facilities. Here, we analyze the information aspects of probe and electron microscopy imaging as a first step for developing systematic solutions for full data utilization and reuse in imaging. Given the traditional gap between the fields, we perform the analysis separately for SPM and STEM, and elucidate commonalities when possible. We also note that SPM operates with scalar (single data stream) excitation and detection signals over 2D scanning area, whereas STEM allows for much broader variability of detection schemes (0D, 1D, 2D) and scanned areas (2D or 3D). Hence we analyze SPM first and STEM second. SPM: The SPM group at the CNMS is actively working on the development of scanning probe microscopy techniques for probing bias-induced (ferroelectric polarization switching, electrochemical reactions) and thermal (glass transition, melting) transformations on the nanoscale. In these experiments, the SPM tip focuses an electric or thermal field in a small (5 – 30 nm) region of material, inducing local transformations. In parallel, measured dynamic strain, resonance frequency shift, or quality factor of the cantilever (piezoresponse force microscopy, electrochemical strain microscopy) or tip-surface current (conductive AFM) provides information on processes in the material (polarization, domain size, ionic motion, second phase formation, melting) induced by local stimulus. In the future, the detection strategies can include microwave, Raman, focused X-ray, electron microscopy, and other high-bandwidth local (~10 nm and below) structural and chemical probes.

Page 2: Use Case: Data and Analysis Requirements in Scanning Probe and ...

The uniqueness of this approach is that transformation can be probed in material volumes containing no or single individual extended defects, paving a pathway for studying phase

transformations and electrochemical reactions at the single defect level (as opposed to volume averaging for typical materials science methods; compare to the impact of molecular unfolding spectroscopy in biomolecular chemistry), the target of crucial importance for material science to link defect structure to its functionality. The hardware platforms for these studies can be realized on 30,000+ SPMs worldwide and necessitate a classical development path of minimizing by noise level, improving drift stability, and introducing proper chemical and thermal environments. However, these studies require drastic improvement in capability to collect and analyze multidimensional data sets, well beyond state of the art (2D imaging or 3D spectroscopic imaging) in the field. This can be demonstrated as follows:

The spatial scanning necessitates data acquisition over 2D dense grid of points The probing local transformation requires sweeping local stimulus (tip bias or

temperature) while measuring the response All first order phase transitions are hysteretic and hence are history dependent. This

necessitates first order reversal curve type studies, effectively increasing dimensionality of the data (e.g. probing Preisach densities)

First order phase transition often possess slow time dynamics, necessitating probing kinetic hysteresis (and differentiating it from thermodynamics) by measuring response as a function of time

The detection of force-based SPMs necessitates probing response in a frequency band around resonance (since resonant frequency can be position dependent and single-frequency methods fail to capture these changes).

These simple physical arguments illustrate that complete probing of local transformations necessitate 6D (space × frequency × (stimulus × stimulus) × time) detection scheme, as compared to 1D molecular unfolding spectroscopy. To date, we have realized 5D and tentative 6D detection schemes (first order reversal curves, time relaxation within hysteresis loop methods). The development of these techniques is illustrated in Table I. Figure 1 shows the evolution of information volume for selected scanning probe microscopy techniques since their invention.

Page 3: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 1. Evolution of information volume in multidimensional scanning probe microscopies. STEM: Scanning (transmission) electron and associated focused ion beam microscopies (S(T)EM & FIB) spectroscopies are well established, robust imaging tools that have proved to be powerful for the visualization of structure and functionality of materials with atomic resolution.1 2 The ultimate goal of localized imaging and spectroscopy is to observe and quantitatively correlate structure-property relationships with functionality – by evaluating chemical, electronic, optical and phonon properties of individual atomic and nanometer-sized structural elements.3 Historic improvements in the underlying instrument hardware and data processing technologies has allowed determination of atomic positions with sub-10 pm precision4, 5 which enabled the visualization of chemical and mechanical strains,6 and order parameter fields including ferroelectric polarization7-10 and octahedral tilts.11-15 Ideally, complete studies have to be performed as a function of global stimuli, such as temperature or uniform electric field applied to the system, as well as local stimuli that are induced by additional probe or ionic interactions.16-18 Furthermore, this technical combinatorial instrumentation challenge is exacerbated by a wealth of extracted information at both global and local scales necessitating a drastic improvement in capability to transfer, store and analyze multidimensional data sets.

Page 4: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 2. Scientific Data sizes on the processing and generation ends. Laptop and Workstation capabilities are estimated by average machines available on the market today. 25 GB G-mode STEM is a single “small” (see Table 1) 4D (200×200×400×400) data set. Full Detector & probe data size is for a single 4D hyperspectral data set where the output of all electron or ion probe positions (2048×2048) is captured at an average size detector array (2048×2048). Large Hardon Collider ATLAS detector output for 2012-2013. A 50 frame movie captured at 2048×2048 probe positions with a 2048×2048 detector array. Modest G-mode movie captured at 200×200 probe positions on a 768×768 pixel detector (per 1 exposure), with 256 Ronchigram Energy Channels over 40 frames. 1.1.1 Present or Near Term (SPM and STEM)

Traditionally, data analysis, storage, and distribution efforts in the scanning probe, electron and ion microscopy domains are the responsibility of an individual staff, or user; with whatever limited data analysis knowledge and capability available to them. Only insignificant fraction of data is analyzed (based on initial screening during acquisition process), and fraction of analyzed data is published and becomes available for community-wide examination. Many analytical tools are custom designed, and are rarely traceable. The delayed use or re-use of data is common for a single PI, but is highly unlikely outside the groups by broader community. At the same time, value of complete utilization and re-use of data are obvious, and both domain specific and synergistic opportunities enabled by it can be easily envisioned. In the last year, Scanning Probe Microscopy (SPM) group at CNMS, in association with Oak Ridge Leadership Computing Facility (OLCF) and the Institute for Functional Imaging of

Page 5: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Materials (IFIM) made significant strides in implementing a High Performance Computing (HPC) infrastructure called BEAM - Bellerophon Environment for Analysis of Materials.19 (BEAM) enables instrument scientists to leverage the integrated computational and analytical power of ORNL’s Compute And Data Environment for Science (CADES) platform with HPC resources at the OLCF and at the National Energy Research Scientific Computing Center (NERSC) to perform near real-time, scalable data analysis via a web-deliverable, cross-platform Java application. At the core of this cluster based computing system is a web and data server located in CADES that enables multiple, concurrent users to securely upload and manage data, execute materials science workflows, and interactively engage analysis artifacts. BEAM’s long-term data management services utilize CADES large-scale storage system and enable users to easily manipulate remote directories and upload/download new and processed data in their private data storage area as if they were browsing on a local workstation. Additionally, this framework accepts custom data analysis algorithms (developed by mathematicians, computational scientists, and material scientists) in order to enable user defined workflow needs; and allows post-authentication, “push button” execution of dynamically generated workflows on multiple DOE HPC platforms and CADES compute clusters (a.k.a., the “DOE HPC Cloud”). Currently a custom set of algorithms that area broadly described as multivariate analysis, curve fitting and image feature recognition are being implemented on BEAM. These algorithms are being used to process staff and user data for the Band Excitation suite, 20 atom finding and local crystallography analysis,21 ptychography22, and large spectral datasets.23, 24 The overall workflow is shown in Figure 3.

Figure 3. The BEAM workflow and infrastructure.

Page 6: Use Case: Data and Analysis Requirements in Scanning Probe and ...

The workflow process can be succinctly summarized as the following: a. Data is generated at the “Scientific Instrument Tier” on an appropriate microscope

platform. b. The data is transferred via the “BEAM User Tier” using a local in-house connection

(SCP) from the microscope control resource, or a personal staff/user machine via HTTPS to the CADES resource

c. CADES resource affords multi-tier architecture that simultaneously serves as data storage repository, BEAM Web and meta-data server, and the CADES Cluster Computing resource that executes parallel user workflows.

d. BEAM can then allocate jobs to additional DOE HPC platforms and allows post-authentication, “push button” execution of dynamically generated workflows

As of now, only a small percentage (1-5%) of the data is analyzed on BEAM, due to such a short lifetime of the project. The expected use of such an infrastructure would ideally be 95% and more, with a small subset of data reserved for customized processing and algorithm development. 1.1.2 Future (SPM and STEM)

As illustrated in Table 1,2, current data volumes are already approaching the capacity for analysis on a local compute resource – like a workstation computer. In the near future computational clusters will be necessary in order to handle even the simplest of operations in data visualization. It is important to note here, that unlike physical probe microscopies (Atomic Force Microscopy (AFM), Scanning Tunneling Microscopy (STM)) data generation time for S(T)EM and FIB are at least three orders of magnitude faster. On average, a single high quality image on a physical probe microscope is collected in approximately five minutes at 512×512 pixels; whereas in a STEM/FIB a single 4k×4k image is captured well under a second. With images being perhaps the most basic, easy to handle and process data types. These data generation volumes extend beyond issues in processing and storage, but also in data transfer – particularly in experiments that rely on real time feedback to the tool operator. This problem is complicated even further by the fact that many of the experiments summarized may happen concurrently with parallel data flows coming from independent detectors. Combined tilt-focal series; time series spectra; or through-focus Ptychograms25 as well as movies that are even minutes in length will take a lot of space and require massive throughputs that necessitate livestreaming capabilities from the microscopes in order to efficiently transfer this information. It is immediately apparent from the near future trends in Figure 2 and Table 2 that these problems are only expected to get more severe in the near future. We envision that operating within an HPC environment will provide the key interface for intimate interaction of experiment and theory. The multimodal, hyperspectral data collected in these new generation microscopy techniques is an amalgam of that is typically independently processed by well-established, theoretical techniques that utilize self-contained approaches; but are rarely cross-validated. These independent analysis workflows are well-understood, and widely utilized in a high-intensive computational environments by theoreticians today. We expect that combining storage, preprocessing, and theoretical efforts will intertwine experiment and theory into a single, streamlined analysis process; enabled by an HPC environment. Naturally, the grand goal of uniting these efforts is to enable true theoretical feedback to guide experiment and discovery in near real-time.

Page 7: Use Case: Data and Analysis Requirements in Scanning Probe and ...

1.1.3 Data Lifecycle (SPM and STEM)

The data lifecycle follows a familiar cyclical pattern commonly found in Data Life Management literature and replicated in Figure 4.

Figure 4. Data lifecycle process in imaging

Creating Data: The microscope generates the data almost entirely. There is some additional descriptor metadata associated with the sample, operator and the microscope state, but it is rather infinitesimal compared to the size of the detector output. The number of detectors can vary, but currently will rarely cross over into double digits. The raw data output is uncompressed and is currently typically at 32 bit integers (software limited), with older microscope detectors clamped at 16 bits. Data transfer mechanism from the detector to the storage media depends on the manufacturer, but the most commonly used interfaces are USB, Ethernet and PCI/PCI-e.

Processing Data: Classical processing methods utilize binning and averaging to improve the signal to noise, however over the last decade the detection hardware has improved significantly, with the processing methods largely utilized to control data volumes. At the very first stages of the analysis workflow we are interested in collecting full detector response at fastest meaningful rates in order to assess tool performance and adjust parameters on the fly. Additionally fast visualization schemes would be of use, to monitor sample and the quality of the output signal.

Page 8: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Analyzing Data: The analysis framework has to be scalable, parallelizable and flexible. Due to the maturity of the field, large number of instrumental configurations, and breadth of scientific interests, the analytical backend has to have the capability to adapt to either completely new analysis library software, or have the flexibility to combine analysis workflows in an unconstrained fashion. Currently the analysis requirements include standard packages for plotting, 2/3D visualization, File IO, mathematical and multivariate statistical processing libraries. In the near future more exotic neural network, image and hyperspectral registration and segmentation, compressed sensing, and robust file sharing packages will be necessary.

Preserving Data: As a part of the user center service, data stewardship has to be included with the data collection and processing services. Data quotas, length of storage, redundancy, and encryption are only some of the important details that have to be discussed at the proposal preparation stages. However, some form of basic, short-term storage and access over the lifetime of the user project for pertinent analyzed and raw data sets have to be available.

Accessing Data: In the current framework of the user Nano-centers, the data belongs to the Principal Investigator (PI) on the user proposal. Associated students, staff, and co-PIs have access to the data with the permission of the main PI. As such we envision data access during its lifetime on the ORNL compute resources to be limited to the individuals with proper training and security credentials vetted through the ORNL system that are a part of the user project for which the data is being collected.

Re-using Data: BEAM framework at CNMS was conceived with data re-usability and re-analysis in mind. Due to the nature of the experiments and the statistical framework to analyze and refine the data, recurring analysis of the same data set is vital for understanding the underlying physics and chemistry, as well as validity of proposed theoretical models. We expect that users will re-analyze their data sets multiple times, and have in fact built the capability to capture and contain the results of periodic re-analysis into the data file structure. Additionally, the results of such a persistent approach to analysis will be cross-correlated and form some of the very basis of scientific arguments enabled by the BEAM framework.

1.2 Data-centric Requirements: Capabilities, Speeds, and Feeds

SPM: The data generation speed in SPM is presently limited by the bandwidth of optical detector (~10 MHz) multiplied by the data capability of the DAQ card (16 - 32 Bit). Typical information content in the data stream is limited by specific imaging mode, etc, but can be estimated based on typical oscillation amplitude (~0.1 nm in contact modes, 50-100 nm in non-contact modes) and magnitude of thermal noise in the system. Traditionally, the first step of data utilization is heterodyne filtering (lock-in or phase-locked loop) that compresses the ~10 MHz data stream from photodetector to ~1 kHz data stream of amplitude/phase or

Page 9: Use Case: Data and Analysis Requirements in Scanning Probe and ...

frequency/amplitude data (compression is chosen to match acquisition time of single spatial pixel, which in turn is controlled by speed of topography feedback). In band excitation mode (developed in 2007), excitation is performed at multiple frequencies, effectively multiplexing data stream to ~100 kHz. In recently developed G-mode (developed 2015), full data stream is captured. The functional imaging of materials is achieved by scanning time, voltage, etc parameter space at each spatial location, giving rise to multidimensional data sets as summarized in Table I below.

Table I. Development of multidimensional SPM methods at CNMS

Technique Dimensionality Target data set* Target data size***

References****

Band Excitation PFM (BE-PFM)

3D, space and ω (256×256)×64 32 MB 1, 2, 3, 4, 5, 6, 7

Switching spectroscopy PFM ( SS-PFM)

3D, space and voltage

(64×64)×128 4 MB 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18

Time relaxation PFM (TR-PFM)

3D, space and time

(64×64) ×128 4 MB 19, 20, 21

AC sweeps 4D, space, ω, voltage

(64×64)×64×256 512 MB 22, 23

BE Polarization Switching (BEPS)

4D, space, ω, voltage

(64×64)×64×128 256 MB 24, 25, 26, 27, 28, 29, 30

BE thermal 4D, space, ω, temperature

(64×64)×64×256 512 MB 31, 32, 33, 34

Time relaxation BE (TR-BE)

4D, space, ω, time

(64×64)×64×64 64 MB 35, 36, 35,37

First order reversal curves (FORC) BEPS

5D, space, ω, voltage, voltage

(64×64)×64×64×16 2 GB 37, 38, 39, 40, 41,42

Time relaxation on sweep, BE

5D, space, ω, voltage, time

(64×64)×64×64×64 16 GB 43, 44

FORC Time BE 6D, space, ω, voltage, voltage, time

(64×64)×64×64×16×64

128 GB (Lower resolution realized) 45

FORC IV BEPS 5D, space, ω, voltage, cycle

(64×64)×64×64×16 4 GB 42,46

FORC IV and FORC IV-Z

4D, space, voltage, cycle

(64×64)×64×20 200 MB 47

Time-resolved Kelvin Probe Force

3D, space, time (60×20)×1·106 8 MB 48, 49,50

Page 10: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Microscopy (KPFM) Open loop (OL) BE KPFM

4D, space, ω, voltage

(256×256)×32×16 256 MB 51,52

General-mode PFM (G-PFM)

3D, space and voltage

(256×256)×1.6·104 4 GB 53

G-mode Voltage Spectroscopy (G-VS)

ND, Space, voltage#,¶

(256×256)×1.6·106 400 GB In development

* Dimensionality is given as (space × space) × frequency × (parameters). Note that the signal can be multimodal (e.g. collect phase and amplitude of response or three vector component of the signal). Highest number of measured variables to date is 8 (phase and amplitude in the on/off state for vertical and lateral signal). The collection of multimodal data multiplies file size by N/2. ** Not realized yet due data acquisition and processing limitations, but is the ultimate goal for data acquisition and analysis developments *** Current data acquisition times are limited by the eigen frequency of the cantilever in the contact mode. However, expected introduction of fast DAQ electronics and small cantilevers is expected to push these by ~ factor of 10 in next 2-4 years. **** Applications for ferroelectric, electrochemical, and biological/macromolecular systems

# Additional output / spectroscopy channel e.g. – photothermal and electrical excitation. ¶ Voltage resolution determined by analysis of complete spectra The acquisition of these compound data sets brings the obvious challenge of data storage, dimensionality reduction, visualization, and interpretation. While the analysis is tailored for new materials systems/detection sequences, we can summarize the typical procedure for Band Excitation (BE) Piezoresponse Force Microscopy operation. In these, the first step of data analysis of 5D data set includes simple harmonic oscillator fit along the frequency dimension, reducing dimensionality by 1 and giving rise to 4D amplitude, quality factor, and resonant frequency data set (vs. position and stimulus). For time measurements, the data can be analyzed to yield time delay hysteresis loops (e.g. using proper relaxation function fits). Resultant 3D data sets are fitted using phenomenological models to give 2D images of polarization dynamics (and e.g. their time dispersion). For FORC type measurements, we typically convert to the Preisach type plane and then study spatial variability of Preisach parameters. For 5D and 4D data set we regularly use the multivariate statistics methods such as principal component analysis (PCA) to explore the variability of materials responses and its relationship to surface morphology. The use of more complex multivariate statistical analysis tools, such as endmember extraction using Bayesian method, has also been applied to 4D and 5D datasets. Similarly, the use of independent component analysis and k-means clustering for specific problems may also be applied. Note that despite complexity of analysis procedure, in some cases e.g. 5D datasets can be reduced to 2 2D images with readily identifiable physical meaning (e.g. separation of reaction and diffusion in electrochemical systems, or components of relaxation in a ferro-relaxor arising from field-induced phase transitions and ordinary polarization switching). However, some of the multivariate methods, such as the Bayesian unmixing approach, require HPCs to be realized on

Page 11: Use Case: Data and Analysis Requirements in Scanning Probe and ...

the 5D and 6D datasets, due to memory and computation requirements. General-mode (G-mode) data is typically processed in sections or chunks to alleviate the handling of large datasizes. Data sections are transformed to the frequency domain via a fast Fourier transform and multiple signal processing routines such as low-pass filters, noise-thresholds are applied to reject noise in the signal. Alternatively, multivariate statistical analysis methods such as PCA may be applied to statistically filter the signal. Upon filtering the data, aforementioned statistical methods are used as in BE data to extract relevant material properties.

Figure 5. Variation in computational time for BE, G-mode datasets for a range of datasizes Figure 5 depicts the time required to process data acquired in BE and G-mode techniques. The processing capability of a computer is generally limited by the memory and the speed, parallel processing capability of the processor. Consumer laptops are feasible only for processing BE data smaller than 1 GB. Though workstations and desktops can process large BE and G-mode data sets, processing times can near 24 hours for datasets exceeding 4-10 GB. High performance computing clusters can provide a 100-500× improvement in processing time and can potentially enable real-time processing

Page 12: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 6. Limitations in realizing the target BE and G-mode datasizes The parameters in Table I illustrate that even for 4D methods data sampling is insufficient to ensure high temporal and energy resolution, whereas for 5D these problems are presently critical (e.g. 8 FORC sets is insufficient for Preisach map sampling, and ~ 100 are typically required) and so far preclude 6D imaging, although a low resolution form has been conducted. Figure 6 illustrates the bottlenecks that currently limit the data generation in select experimental methods and they can be broadly classified as limitations in:

Data acquisition – Our current instrumentation software is capable of transferring only a limited number of data samples (106) to and from the data acquisition hardware. This limitation results in tradeoffs between the number of FORC cycles, voltage steps in BEPS measurements, and the number of measurements that can be averaged to improve the signal-to-noise ratio. These limitations preclude larger data sizes in Tr-KPFM, BEPS than those possible currently.

Microscope drift – Scanning probe microscopes suffer from drift in the scanner position with respect to the tip. Though the drift can be neglected in measurements that span over just a few minutes (e.g. - BE PFM or G-PFM), the drift can be substantial in slower measurements such as Tr-KPFM, BEPS, and FORC-IV BEPS. The drift is exacerbated by the sudden and jerking motion of the scanner as the tip moves between measurement points. The drift forces tradeoffs in the spatial, or voltage resolution in the measurements and thereby precludes acquisition of larger datasets than those currently possible.

Tip wear - Many techniques such as BEPS, FORC IV BEPS and G-VS apply large biases between the tip and the substrate which can result in electrochemical reactions at the tip that can erode the conductive metal layer that covers the tip. Mechanical abrasion, due to friction when scanning, can also result in tip damage which deteriorates the spatial resolution.

Analysis – As described earlier, the analysis time scales linearly with the data size and our current techniques generate data that can require more than 24 hours of processing

Page 13: Use Case: Data and Analysis Requirements in Scanning Probe and ...

time with existing workstations. This limitation precludes analysis of G-mode data sets larger than 8 GB.

Data storage and transfer – Until 2014, data generated by most experimental techniques occupied only a small portion of the data storage drive and others such as FORC time BE generated larger data but at a slow rate. The newly developed, G-mode techniques are capable of generating as much as 38 MB of data, per information channel, every second. It is possible to fill existing storage drives with just a few such G-mode data sets in a day. Access to vast, fast, cloud-based storage drives is necessary to enable storage of such large data sets.

STEM: Table 2 summarized various common experiment types and the data sizes for a given number of probe positions (x, y) coordinates and higher dimensional energy channels. Looking beyond proposed values to a 5-10 year outlook the data sizes will continue to grow. Realistically, for each electron, we could record the probe position (x, y), scattering angle (u, v), and the energy loss (E, here we assume that you record the energy loss as a single value, not as a complete spectrum); resulting in a 5-dimensional dataset as the most basic data unit. Additionally this data could be a function of frame or focus or some physical parameter (Ptychogram, Focal series, Tilt series, etc.) adding dimensionality and size. Recording (x, y, u, v, E) gives roughly 20 bytes of data (using 32 bit integer values) per electron, providing a data rate of 4 Gb/sec, for a standard imaging case of ~32pA current, which is roughly 200 electrons per microsecond per detector.

Experiment Type Probe

positions

Pixels or

channels

Estimated size

(32 bit int values)

Near future outlook

(32 bit int values)

Spectrum* 1 1k 4kB - Ronchigram26 1 1k x 1k 4MB 16MB Line spectrum 128 1k 512kB 16MB Image 1k×1k 1 4MB 64MB (4k×4k) ×

channels Spectrum image 64×64 1k 16MB 16GB (1k×1k×4k) Ptychogram25 64×64 256x256 1GB 4TB (1k×1k×1k×1k) Focal series 512×512 160 167MB 1GB (2k×2k×160) Tilt series 1k × 1k 100 400MB 1.6GB (4k×4k×100) Time series 512×512 100 100MB/frame Many 4k×4k×100

frames (hours) Table 1** Current trends and short term outlook for data generation and sizes in electron and ion beam microscopies.*Spectra could be Electron Energy Loss Spectra, X-ray spectra or other detector feed. **These are intended to be ‘typical’ values based on the ORNL systems that people are currently using, rather than the maximum possible. Depending on the sample, set-up or particular microscope hardware, values could easily increase by a factor of 2-4.

Page 14: Use Case: Data and Analysis Requirements in Scanning Probe and ...

1.3 Impediments, Gaps, Needs, Challenges (SPM and STEM) The following specific impediments can be classified: a. Data capture/compression/storage:

1. Real time processing and high data throughput to support real/near real time experimental feedback. With data transfer rates of at least 4GB/s required to address this challenge

2. Large, accessible data repository for archival and data sharing, with annual storage capacity of 5 PB.

3. Limited access to sufficient computing resource for the initial processing of generated data; with a dedicated ~2000 (64 per microscopy tool) node system with GPU capability required to address this challenge.

While each of the afore mentioned roadblock is a challenging issue at even the current rates of data generation and analysis, what hinders the future of STEM/FIB can be summarized as the lack of resources to move, store, share and process scientific data. According to even modest near future estimates outlined in Table 1,2, continuous, secure data transfer rates approximately 4 GB/s per microscope are required to sustain tool operation and adequately fulfill center’s obligation to the user community. Alternative are efficient data compression at the generation point. Furthermore, global data accessibility is an important preamble to analysis – as input from various experts in the field is vital to achieve real scientific progress. Therefore, a flexible data repository that allows fast & secure access to data by a handful on individuals to advise and oversee the analysis process is critical. Finally, the ability to receive real time, or near real time feedback to the operator requires a dedicated computational resource to each microscope. In the near future, satisfying both data processing and theory requirements for the data stream off each microscope is likely unrealistic, however previewing the analyzed data and getting it ready for serious theoretical effort is an excellent near future goal. 2. Data analytics/and visualization:

Data visualization for high dimensional data sets (note that given highly regular spatial grids, its likely to be less complex then for more abstract data sets)

Infrastructure to support, verify, and reuse custom codes for mapping to physical models (e.g. recognoton and classification based neural nets algorithms, Bayesian endmember extractions, and other component analysis methods)

Image registration (for mapping multiple data set over time, e.g. aging of battery or upon changing gas pressure or global temperature when consecutive images can be shifted spatially and must be aligned e.g. using topography as a reference)

Development of physics-based statistical tools (e.g. constrained unmixings, etc) 3. Workforce training:

Many of the data curation problems and extraction of materials specific responses from instrument data require close interaction between domain scientist and data scientist. Generally, such dual background is exception and some amount of professional training (boot camps, intensive courses) is required to fill this gap. 4. Infrastructure for data re-use and integration across user facilities:

Page 15: Use Case: Data and Analysis Requirements in Scanning Probe and ...

The development of universal framework for full data capture within individual facilities further brings forward the consideration of their integration, providing integrated data environment for subsequent re-use, data mining, etc. This necessitates discussion of integrability between chosen architectures, data formats, etc. In conclusion, the emerging trends place heavy emphasis on combinatorial imaging that correlates spatial, chemical and physical information. Serious challenges in processing these data have been slowly, but steadily addressed by the scientific community on many fronts, however scaling, validating and cross-correlating these independent efforts is a serious roadblock that can only be addressed by close collaboration with the high-performance computing and data handling experts. We foresee close ties with information technology community would usher in a technical revolution in the scientific fronts by providing the infrastructure for truly close-knit multidiscipline collaboration. References

1. Pennycook, S. J.; Nellist, P. D., Scanning Transmission Electron Microscopy: Imaging and Analysis. Springer: New York, 2011. 2. Pennycook, S. J.; Chisholm, M. F.; Lupini, A. R.; Varela, M.; van Benthem, K.; Borisevich, A. Y.; Oxley, M. P.; Luo, W.; Pantelides, S. T., Materials Applications of Aberration-Corrected Scanning Transmission Electron Microscopy. In Advances in Imaging and Electron Physics, Vol 153, Hawkes, P. W., Ed. Elsevier Academic Press Inc: San Diego, 2008; Vol. 153, pp 327-+. 3. Mody, C., Instrumental Community: Probe Microscopy and the Path to Nanotechnology. MIT Press: 2011. 4. Yankovich, A. B.; Berkels, B.; Dahmen, W.; Binev, P.; Sanchez, S. I.; Bradley, S. A.; Li, A.; Szlufarska, I.; Voyles, P. M., Picometre-Precision Analysis of Scanning Transmission Electron Microscopy Images of Platinum Nanocatalysts. Nature Communications 2014, 5. 5. Kim, Y. M.; He, J.; Biegalski, M. D.; Ambaye, H.; Lauter, V.; Christen, H. M.; Pantelides, S. T.; Pennycook, S. J.; Kalinin, S. V.; Borisevich, A. Y., Probing Oxygen Vacancy Concentration and Homogeneity in Solid-Oxide Fuel-Cell Cathode Materials on the Subunit-Cell Level. Nature Materials 2012, 11, 888-894. 6. Kim, Y. M.; Morozovska, A.; Eliseev, E.; Oxley, M. P.; Mishra, R.; Selbach, S. M.; Grande, T.; Pantelides, S. T.; Kalinin, S. V.; Borisevich, A. Y., Direct Observation of Ferroelectric Field Effect and Vacancy-Controlled Screening at the Bifeo3/Laxsr1-Xmno3 Interface. Nature Materials 2014, 13, 1019-1025. 7. Chang, H. J.; Kalinin, S. V.; Morozovska, A. N.; Huijben, M.; Chu, Y. H.; Yu, P.; Ramesh, R.; Eliseev, E. A.; Svechnikov, G. S.; Pennycook, S. J., et al., Atomically Resolved Mapping of Polarization and Electric Fields across Ferroelectric/Oxide Interfaces by Z-Contrast Imaging. Advanced Materials 2011, 23, 2474-+. 8. Nelson, C. T.; Winchester, B.; Zhang, Y.; Kim, S. J.; Melville, A.; Adamo, C.; Folkman, C. M.; Baek, S. H.; Eom, C. B.; Schlom, D. G., et al., Spontaneous Vortex Nanodomain Arrays at Ferroelectric Heterointerfaces. Nano Letters 2011, 11, 828-834.

Page 16: Use Case: Data and Analysis Requirements in Scanning Probe and ...

9. Jia, C. L.; Nagarajan, V.; He, J. Q.; Houben, L.; Zhao, T.; Ramesh, R.; Urban, K.; Waser, R., Unit-Cell Scale Mapping of Ferroelectricity and Tetragonality in Epitaxial Ultrathin Ferroelectric Films. Nature Materials 2007, 6, 64-69. 10. Jia, C. L.; Urban, K. W.; Alexe, M.; Hesse, D.; Vrejoiu, I., Direct Observation of Continuous Electric Dipole Rotation in Flux-Closure Domains in Ferroelectric Pb(Zr,Ti)O(3). Science 2011, 331, 1420-1423. 11. Jia, C. L.; Mi, S. B.; Faley, M.; Poppe, U.; Schubert, J.; Urban, K., Oxygen Octahedron Reconstruction in the Srtio(3)/Laalo(3) Heterointerfaces Investigated Using Aberration-Corrected Ultrahigh-Resolution Transmission Electron Microscopy. Physical Review B 2009, 79. 12. Kim, Y. M.; Kumar, A.; Hatt, A.; Morozovska, A. N.; Tselev, A.; Biegalski, M. D.; Ivanov, I.; Eliseev, E. A.; Pennycook, S. J.; Rondinelli, J. M., et al., Interplay of Octahedral Tilts and Polar Order in Bifeo3 Films. Advanced Materials 2013, 25, 2497-2504. 13. Borisevich, A. Y.; Chang, H. J.; Huijben, M.; Oxley, M. P.; Okamoto, S.; Niranjan, M. K.; Burton, J. D.; Tsymbal, E. Y.; Chu, Y. H.; Yu, P., et al., Suppression of Octahedral Tilts and Associated Changes in Electronic Properties at Epitaxial Oxide Heterostructure Interfaces. Physical Review Letters 2010, 105. 14. He, J.; Borisevich, A.; Kalinin, S. V.; Pennycook, S. J.; Pantelides, S. T., Control of Octahedral Tilts and Magnetic Properties of Perovskite Oxide Heterostructures by Substrate Symmetry. Physical Review Letters 2010, 105. 15. Borisevich, A.; Ovchinnikov, O. S.; Chang, H. J.; Oxley, M. P.; Yu, P.; Seidel, J.; Eliseev, E. A.; Morozovska, A. N.; Ramesh, R.; Pennycook, S. J., et al., Mapping Octahedral Tilts and Polarization across a Domain Wall in Bifeo(3) from Z-Contrast Scanning Transmission Electron Microscopy Image Atomic Column Shape Analysis. Acs Nano 2010, 4, 6071-6079. 16. Chang, H. J.; Kalinin, S. V.; Yang, S.; Yu, P.; Bhattacharya, S.; Wu, P. P.; Balke, N.; Jesse, S.; Chen, L. Q.; Ramesh, R., et al., Watching Domains Grow: In-Situ Studies of Polarization Switching by Combined Scanning Probe and Scanning Transmission Electron Microscopy. Journal of Applied Physics 2011, 110, 052014. 17. Nelson, C. T.; Gao, P.; Jokisaari, J. R.; Heikes, C.; Adamo, C.; Melville, A.; Baek, S. H.; Folkman, C. M.; Winchester, B.; Gu, Y. J., et al., Domain Dynamics During Ferroelectric Switching. Science 2011, 334, 968-971. 18. Belianinov, A.; Iberi, V.; Tselev, A.; Susner, M. A.; McGuire, M. A.; Joy, D.; Jesse, S.; Rondinone, A. J.; Kalinin, S. V.; Ovchinnikova, O. S., Polarization Control Via He-Ion Beam Induced Nanofabrication in Layered Ferroelectric Semiconductors ACS Nano 2015, Under Review. 19. Lingerfelt, E.; Jesse, S.; Belianinov, A.; Endeve, E.; Ovchinnikov, O.; Okatan, M. B.; Symons, C.; Shankar, M.; Archibald, R. Near Real-Time Scalable Analysis of Multi-Dimensional Nanophase Materials Imaging Data with Beam; http://sc15.supercomputing.org/, 2015. 20. Jesse, S.; Vasudevan, R. K.; Collins, L.; Strelcov, E.; Okatan, M. B.; Belianinov, A.; Baddorf, A. P.; Proksch, R.; Kalinin, S. V., Band Excitation in Scanning Probe Microscopy: Recognition and Functional Imaging. Annu. Rev. Phys. Chem. 2014, 65, 519-536. 21. Belianinov, A.; He, Q.; Kravchenko, M.; Jesse, S.; Borisevich, A.; Kalinin, S. V., Identification of Phases, Symmetries and Defects through Local Crystallography. Nature communications 2015, 6.

Page 17: Use Case: Data and Analysis Requirements in Scanning Probe and ...

22. S., J.; Chi, M.; Belianinov, A.; Kalinin, S. V.; Borisevich, A.; Lupini, A., Big Data Analytics in Scanning Transmission Electron Microscopy Ptychography. In preparation 2015. 23. Belianinov, A.; Vasudevan, R.; Strelcov, E.; Steed, C.; Yang, S. M.; Tselev, A.; Jesse, S.; Biegalski, M.; Shipman, G.; Symons, C., Big Data and Deep Data in Scanning and Electron Microscopies: Deriving Functionality from Multidimensional Data Sets. Advanced Structural and Chemical Imaging 2015, 1, 1-25. 24. Belianinov, A.; Kalinin, S. V.; Jesse, S., Complete Information Acquisition in Dynamic Force Microscopy. Nature communications 2015, 6. 25. Hoppe, W., Beugung Im Inhomogenen Primärstrahlwellenfeld. I. Prinzip Einer Phasenmessung Von Elektronenbeungungsinterferenzen. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography 1969, 25, 495-501. 26. Ronchi, V., Forty Years of History of a Grating Interferometer. Applied optics 1964, 3, 437-451.

Page 18: Use Case: Data and Analysis Requirements in Scanning Probe and ...

NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials 973

The accelerating progress of technology over the past cen-tury has driven the need for a broad spectrum of materials and functionalities. For over 70  years, synergistic progress

in growth, theory, and fabrication of semiconductor materials and devices have laid the foundation for modern civilization1–3. Now, semi conductor technology is being extended to the atomic-level design of materials and devices4. However, progress has been remarkably slower for materials with complex functionalities or coupled behaviours, including ferroelectric relaxors and morpho-tropic systems5,6, multi ferroics7, spin and cluster glasses8, nanoscale phase separated oxides9,10, and energy storage and conversion systems. Despite the fact that ultimately desired functionalities are well defined — room-temperature superconductivity, giant electro mechanical or magnetoelectric couplings11, efficient room- temperature oxygen reduction reactions12, and so on — the material systems are unlikely to directly yield the desired properties since the structures underpinning them, much less the pathways to syn-thesize them, are not well understood. Therefore an ever-increasing spectrum of functionalities required for developing and optimiz-ing materials for energy, information technology, and other appli-cations requires efficient paradigms for materials discovery and design that goes beyond serendipitous discoveries and the classical synthesis–characterization–theory approach.

Recently, the confluence of advanced theoretical and simulation methods and the exponential growth of computation capabilities have formed the foundation of the materials genome approach13. Here it is recognized that theory and simulation can provide the ability to enhance the efficiency (time to solution) for materials discovery and design. Indeed, exploring large numbers of materi-als and predicting desired properties have enabled the creation of searchable databases for rapid selection of candidates for experimen-tal studies (often termed high-throughput screening)14. However, this theory-driven approach has at least two limitations. First, it needs closer coupling with experiment to better enable feedback, so as to improve the theoretical models used to make predictions. Second, many interesting material functionalities are defined on length scales well beyond the atomic scale, where the number of possible atomic configurations and computational cost both grow exponentially. This is especially the case for materials with spatially inhomo geneous ground states — from relaxors to high-temperature super conductors  — that exhibit particularly useful functionalities.

Big–deep–smart data in imaging for guiding materials designSergei V. Kalinin1,2*, Bobby G. Sumpter1,2,3 and Richard K. Archibald1,3

Harnessing big data, deep data, and smart data from state-of-the-art imaging might accelerate the design and realization of advanced functional materials. Here we discuss new opportunities in materials design enabled by the availability of big data in imaging and data analytics approaches, including their limitations, in material systems of practical interest. We specifically focus on how these tools might help realize new discoveries in a timely manner. Such methodologies are particularly appropriate to explore in light of continued improvements in atomistic imaging, modelling and data analytics methods.

Similarly, purely theory- or simulation-based approaches are often difficult to realize due to complications in synthesis, existence and production of defects, and the rich set of possible metastable (dynamic) states. To bridge that gap will require seamless integration of all aspects of theory and simulation with experiment and data.

In this Progress Article, we discuss new opportunities for dis-covery, design and optimization of novel functional materials ena-bled by the progress of atomically resolved imaging techniques and high-performance theoretical approaches, and potential pathways to achieve this goal via synergy of imaging, theory and simula-tions, and big data, as highlighted in Fig. 1. We suggest that imaging data now offers considerably more than merely the illustration of a system’s behaviour; in fact, it contains quantitative structural and functional information. This information is spatially distributed and often has a complex multidimensional nature. Hence, the use of statistical unsupervised learning, decorrelation, clustering and visu-alization techniques, generally referred to as ‘big data’ approaches, is the first step towards harnessing it. The proper use of big data, or the vast amount of data that can be measured and simulated, can act as a bridge between theory and functional imaging. It can further be extended to what we term ‘deep data’, by fusing scientific knowl-edge of the physics and chemistry of the system to big-data analysis, and ‘smart data’, interactive knowledge discovery of big and deep data made possible by machine learning technologies implemented interactively at all stages of the scientific discovery process, from instrument operation to data analysis.

Imaging as a quantitative toolRecent progress in high-resolution, real-space imaging techniques such as scanning transmission electron microscopy (STEM)15–17, scanning tunnelling microscopy (STM)18,19, and (non-contact) atomic force microscopy (AFM)20 has allowed direct and efficient imaging of atomic columns and surface atomic structures. For dec-ades, these techniques allowed direct visualization of the structure of matter, providing information on structural motifs underlying low-symmetry crystals, grain boundaries, dislocation cores, and quasicrystals, providing insight into the chemistry of these systems. In the last decade, the resolution (more specifically, the informa-tion limit21 or precision22) of these methods has improved enough to quantify the picometre-level23 displacement of atoms from ideal-ized high-symmetry positions, thereby providing direct insight into

1Institute for Functional Imaging of Materials, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA. 2Center for Nanophase Materials Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA. 3Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA. *e-mail: [email protected]

PROGRESS ARTICLEPUBLISHED ONLINE: 23 SEPTEMBER 2015 | DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 19: Use Case: Data and Analysis Requirements in Scanning Probe and ...

974 NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials

chemical, electrochemical and physical behaviour. Examples in the field of aberration-corrected (S)TEM include direct imaging of ferro-electric polarization24–27, octahedral tilts28,29, and chemical expan-sion strains30. Continued progress to even higher resolutions will enhance the precision of these measurements as well as reveal new properties, for example thermal vibration amplitudes. These oppor-tunities will be enabled by both the development of high- stability instrumentation, as well as the development of mathematical tools for quantification of structure from STEM and STM data based on parameter estimation methods, as well as blind and physics-based reconstructions21,31,32. Such progress will be further assisted by the development of mathematical and simulations tools for the descrip-tion of image formation mechanisms, which will enable achieving limiting precision levels for structural analysis. Other examples are high-resolution STM and non-contact AFM, which provide real-space atomic and electronic structures of material surfaces, visual-izing structures of molecular vibration levels and phonons, complex electronic phenomena33,34, and even chemical bonds. This progress now allows quantitative measurements of local bond lengths and bond angles, which are structural parameters directly related to local chemical reactivity and electronic and magnetic properties.

In parallel, scanning probe and electron microscopies enable a broad range of spectroscopies, providing information on local func-tionality including electronic, dielectric, and chemical properties (for example, electron energy-loss spectroscopy in STEM or force–distance curves in AFM). This functional imaging necessarily gives rise to multidimensional datasets, with 3D and 4D imaging becom-ing common and higher dimensional modes being actively devel-oped. Jointly, structural and functional imaging capabilities offer a unique opportunity to explore structure–property relationships

on the level of single atoms and chemical bonds, linking local elec-tronic, magnetic, and superconductive functionalities to local bond lengths and angles. Similar to structural imaging, the development of physical models for image formation mechanisms and associated theoretical tools will enable transitions from experimentally meas-ured spectra to materials-specific functional behaviour defined at each location.

Beyond static structural and functional imaging, both probe and electron microscopies allow direct observation of the dynam-ics of atomic and molecular species and point defects, providing insight into the dynamic evolution of materials35,36–38. Finally, we can even arrange atoms in desired configurations, controlling mat-ter via current- and force-based scanning probes39–41 and electron beams35. This completes a full loop in the materials discovery and design cycle, from observation to (theory-based) prediction and to atomic-level control of matter.

Big data in imaging and simulationProgress in high-resolution structural and functional imaging opens the floodgates for quantitative information on structure and properties at the atomic level, observed in either static or dynamic regimes. Remarkably, similar progress has been achieved on the theory side, via density functional theory, molecular dynamics, and mesoscale and multiscale modelling. However, to date, capabilities of understanding and harnessing experimental information have been limited. In most cases, theory–experiment integration pro-ceeds through multiple, off-line iterative cycles involving interac-tions between theorists and experimentalists. Typically, we look for only expected phenomena (that is, our opinions are biased), and serendipitous behaviours are identified only if their experimental signatures are clear. Even on a more basic level, the vast majority of high-quality experimental data is never analysed beyond a cur-sory examination, and an even smaller fraction ever gets published, rendering it inaccessible to the scientific community. Similarly, the dearth of physical models often precludes direct reduction of the multidimensional spectroscopic data to material-specific information (property maps).

The first step towards full information recovery from high- resolution structural and functional imaging data is the community-wide adoption of big-data analytics42,43, which generally refers to the unsupervised learning, dimensionality reduction, and clustering techniques that have found broad applicability in many aspects of everyday life in the last decade44. The development of unsupervised image analysis tools targeted to high-performance computing plat-forms have demonstrated the ability to allow full analysis of atomic configurations in high-resolution imaging data in 2D in real time45. With some recent developments opening pathways for partial infor-mation on predominant ordering in 3D and the emergence of the focal-series-based methods for 3D structure reconstructions21,22, the need for high-performance computing environments will only grow.

The capability of measuring atomic positions, and hence bond angles and bond lengths, with picometre precision23 opens a pleth-ora of exciting physics involving local phases and ferroic variants, interplay between dissimilar structural distortions and functional properties, and necessitates development of local descriptors for local materials structure, as well as definitions of local point and translational symmetry. Whereas physics of materials as studied by scattering is intrinsically linked to the symmetry theory, this bottom-up approach to materials structure will be based on multivariate sta-tistics and graph theory. The combination of dimensional reduction and registration methods should allow direct matching between spectroscopic (that is, functional) and imaging data, as illustrated in Fig. 2. Note that utilization of both structural and spectroscopic data will strongly benefit from the development of physical mod-els of image formation that will allow improvement in precision of

Imag

ing

Big

data

Theo

ry

(i) (ii) (iii) (iv)

(v) (vi) (vii) (viii)

Multiscale

Functionalmatter

Dynamicmatter

Controlledmatter

Electronicstructure

Ab initiodynamics

Moleculardynamics

Static matter

Unsupervisedlearning

Correlativelearning

Imagerecognition

In situcontrol

Figure 1 | Bridging theory and imaging for understanding materials structure and functionalities. Imaging techniques allow direct measurements of atomic positions and hence bond lengths and angles (i), local functionalities including chemical states, dielectric properties, and superconducting gap (ii), visualizing atomic, molecular, and defect dynamics in real time (iii), and offers possibilities to control matter on the molecular and atomic level (iv). In parallel, theoretical methods allow detailed studies of atomic (v) and electronic structure (vi), and dynamics of matter (vii) along with prediction of their properties on mesoscopic scales (viii). Lacking, however, are the pathways to bridge theory and experiment. The new advances in data analytics and scientific inference are capable of treating large volumes of data/information and hence linking theory to experiment via microscopic degrees of freedom. For example, efficiently matching imaging information about static structure to theoretical simulations on the same material or, similarly, matching the dynamics of oxygen vacancies in materials to corresponding molecular dynamics simulations. Figure adapted with permission from: (i), ref. 91, ACS; (ii), ref. 92, ACS; (iv), ref. 35, Nature Publishing Group; (v), ref. 93, ACS; (vi), ref. 94, ACS; (viii), ref. 95, ACS.

PROGRESS ARTICLE NATURE MATERIALS DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 20: Use Case: Data and Analysis Requirements in Scanning Probe and ...

NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials 975

atomic position determination and extraction of materials-specific parameters. This combination opens the pathway to the creation of libraries of extant materials configurations and for the development of structure–property relationships on atomic levels.

There are several challenges to overcome on the way towards this goal. First and foremost, while both the physics and data analytics communities have achieved a considerable degree of sophistication in the use of respective experimental, theoretical and computa-tional tools, the overlap between these communities is small. This often extends to the level of basic language and philosophy, and will necessitate extensive cross-disciplinary training, as discussed below. Second, all statistical big-data approaches greatly benefit from the availability of universal centralized or distributed databases and repositories. This requires information exchange between active research centres, necessitating development of associated infra-structures, adoption of compatible and potentially universal data formats with traceability, and addressing inevitable intellectual property and socio-cultural issues.

That said, the broad adoption of big-data approaches to imaging data might offer a significant step towards the design and creation of better materials. Full information retrieval from imaging data could significantly increase the efficiency of imaging studies across the scientific community, and also enable unbiased and traceable discoveries of unusual and serendipitous behaviours. Facile infor-mation exchange between research groups and imaging facilities will significantly broaden the knowledge base available to each sci-entist and minimize redundant studies. Ultimately, large volumes of searchable imaging and functional data will allow the building of a database of atomic configurations and associated functionalities in solids, revealing the local physics and chemistry underpinning structure–property relationships in materials.

Notably, data analytics have long been a part of computational materials modelling, starting with the early efforts of chemo-metrics to the computation and comparison of specific correlation or response functions with those from experimental data. In the 1990s, stemming from rapid advances in quantum density func-tional theory46, the new age of computational-based materials by design began to look promising. This optimism was corroborated by efforts to utilize the power of supervised and unsupervised

neural networks, genetic and evolutionary algorithms, along with graph theory and statistics-based methods that demonstrated new capabilities, even though the data available at that time was very modest compared with that available today47,48. While theory and intuition have been used to help successfully guide experiment to new materials49,50, the emergence of practical applications of density functional theory and data analytics (new capabilities emerging in neural nets, genetic algorithms, machine learning) in the late 1980s and early 1990s gave initial hope for accelerating the frequency of successful materials-by-design cases.

Now, with advances in experimental imaging and availability of well-resolved information and big data, along with the notable advances in high-performance computing51, there is a clear conflu-ence in the required capabilities for advancing ‘materials by design’52. Indeed, there has been striking examples of recent progress such as that highlighted by the Materials Project53–57 as well as the recently published properties and structures of 134,000 organic molecules by Ramakrishnan and colleagues58. All of these advances are pushing towards the goal of realizing materials by design. However, we feel that materials by design still requires further advances to integrate big data in imaging with computational-based methods for materi-als modelling. In particular, much needed are improved capabilities for inference of the physics contained in imaging data, giving rise to the concepts of deep data, as discussed below.

Deep data in imagingA well-known dictum states ‘correlation does not imply causation’, meaning that a statistically significant coincidence between two observations does not necessarily imply a causative relationship between the two. This recognition necessitates transition from big data to deep data, which synergistically combines the physical knowl-edge of the system and data analytics approaches. The simplest form of such an approach is using theory to fill in missing experimental data. Indeed, a theoretical approach for studying materials offers the advantage that individual interactions can be tuned at will and the full information on structure and computable functionality can be obtained, allowing for direct cause and effect studies in predefined systems. Atomically resolved structural imaging can yield (partial) information on atomic configurations and local bond lengths, but

Figure 2 | Big data in image analysis. Big data and image analytics in atomically resolved imaging and spectroscopy opens a pathway to systematically extract spatially distributed information from structural images, decorrelate and visualize spectroscopic imaging, and ultimately register the two to get structural and functional data for each spatial location. The incorporation of physical models of imaging in the form of proper deconvolutions and parameter extraction can improve precision of these measurements and extraction of materials-specific functionalities. Shown are pathways for analysing the structural and functional images and cross-correlation of resultant data to yield structure–property relationships on the single-defect and atomic-configuration levels. Broad incorporation of this approach across multiple research teams could allow the building of a library of structure–property relationships. The latter could be validated for completeness by scattering methods (both mesoscopic, including focused X-ray, and macroscopic).

Structureanalysis

Stru

ctur

eIdentify and

classify

Deconvolute

3D

4D

5D

Spectra

Prop

ertie

s Multivariateanalysis

Physics andchemistry onsingle-defect

level

Register and deconvolute

Genomiclibrary

Image Positions Physics

Phase 1

Phase 2

PROGRESS ARTICLENATURE MATERIALS DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 21: Use Case: Data and Analysis Requirements in Scanning Probe and ...

976 NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials

not parameters such as chemical reactivity or local thermodynamic properties. Theory can be used to provide the remaining informa-tion based on the experimental inputs as a starting structural model. For example, chemical bond lengths in carbon networks are directly correlated to chemical reactivity. Similarly, bond lengths and angles in perovskites are good indicators of magnetic and transport prop-erties. Alternatively, theoretical modelling can be performed using experimentally available atomic configurations as inputs, essentially using experimental data to freeze some of the degrees of freedom in the system.

Despite simplicity, such an approach requires extreme care and expertise in interpreting the experimental data, to correct for instru-mental artefacts (for example, calibration, giving systematic errors in measured structural parameters), partial character of informa-tion (for example, missing atoms of light elements poorly visible in electron microscopy), and systematic errors in theoretical models. Practically, this requires close interactions between theorists and experimentalists who are well aware of the strengths and limitations of their respective methods. Despite the limitations, this ‘theoretical microscope’ approach allows significant insight into material struc-tures and properties. The addition of a big-data component enables researchers to establish statistical significance and veracity of the data, as well as to introduce a means for systematic correction. In some sense, this approach is a direct problem of materials design, that is, prediction of materials functionality from known atomic con-figurations (as compared with the inverse problem of establishing atomic configuration that will yield desired functionality).

A much more interesting and significantly less explored aspect of theory–experiment matching is if we can improve a theoretical model given experimental observations (Fig. 3). On the qualitative level, this is what imaging provides, namely direct observation of atomic configurations that give more information on local struc-tures, defects, interfaces, and so on. On a semi-quantitative level,

the numerical values of observed atomic spacings can indicate the incompleteness of the model — for example, the presence of (invis-ible) light atoms59 or vacancies30. However, virtually unexplored is the potential of the quantitative studies, that is, whether the param-eters of the mesoscopic or quantum theory can be improved based on these types of high-quality experimental observations and infor-mation on the multiple spatially distributed degrees of freedom they yield. The examples of areas where such analytical capabilities would be heavily impactful are manifold. The questions we can ask include, but are not limited to: can we establish the parameters of quantum theory (for example, Hubbard U or parameters of the chosen pseudopotential60–65) given experimental observations of the atomic structure and functionality for given materials? Can the parameters of Ginzburg–Landau free energy be improved given the observations of the ferroelectric domain structure and topology in a solid? Can the size-dependent thermodynamic properties of nano-particles be determined from in situ observations of shape evolution during the growth process? Can elementary reaction mechanisms and reaction rates be obtained from the observation of the dynam-ics of step-edge motion during deposition or dissolutions of solids in liquids, or direct observation of molecular motion on surfaces?

A systematic analysis of these problems is possible via the use of Bayesian inference methods for parameter estimation. Material properties can be derived from ensembles of computational simula-tions that are fit to observational data by parameter estimation66,67. Bayesian inference (fiducial inference) is one of the two dominant statistical inference methods, the other dominant method being fre-quentist inference, and has the ability to incorporate prior informa-tion about known aspects of the underlying physical system being simulated68. Bayesian inference is consistent with the philosophy of science69, in the sense that computation and experimentation require prior knowledge to extract a more complete meaning from their results. It provides a statistical structure to use prior informa-tion in prediction and inference and an approach to update pre-dictions of computational models based on additional data and refinement of knowledge of the modelled system. This method of inference fits well with materials-by-design goals and provides a well-developed statistical and mathematical framework to merge experimental data with a range of theoretical models, and predic-tions that become more accurate to the range of possible phenom-ena with more experimental results, computational simulations, and better experimental and theoretical insights of scientists. The computational resources needed for the analysis of functional imag-ing data and connecting this analysis to theory will greatly benefit from high-performance computing and workflow software stacks70 that seamlessly integrate experimentation, computational simula-tion, and machine learning into near-real-time feedback. Recent work regarding inference exemplifications of molecular energies, molecular properties, transmission coefficients in nanoribbons, densities of states, cohesive energies of crystals, and molecular thermo chemical properties combining several levels of theory, have been demonstrated with good success71–77.

Finally, a completely new spectrum of opportunities for predic-tive materials design is enabled by a combination of new approaches, such as the Fischer matrix-based parameter space compression78,79. Sethna and co-authors78 have developed the mathematical formal-ism that allows systematic compression of the microscopic degrees of freedom accessible in statistical physics models to derive macro-scopic variables describing the systems, verified for simple statisti-cal physical models. The application of this parameter compression approach will clearly be of interest for more complex cases, includ-ing molecular dynamics simulations and models of disordered sol-ids (for example, spin glasses or relaxors) to establish the nature of the relevant macroscopic variables, including order parameters and other state descriptors. It is naturally of interest whether this approach can be extended to incorporate the experimental data,

Continuumsimulation

Atomisticimaging

Mesoscaleimaging

Atomisticsimulation

Mesoscalesimulation

(i)

(v)(iv)

(ii) (iii)

Figure 3 | Deep-data approaches allow scientists to establish or improve the link between theory, simulation and experiment. These include filling in missing experimental properties from theory, refining theoretical models based on experimental data, noise reduction or elimination of outliers via theoretical predictions, and building the pathways to determine macroscopic functionalities from measured and modelled microscopic descriptions via coarse graining, parameter space compression, and machine learning methods. Shown here is such possible discovery for ferroic perovskites, as exemplified by an atomistic molecular dynamic simulation of SrTiO3 (i), polar nanodomains in relaxors (ii), phase field modelling of ferroelectric domains (iii), snapshot of atomically resolved image of vacancy dynamics in ferroelastic (LaSr)CoO3 (iv), and mesoscopic domain patterns in morphotropic perovskites (v). Using deep-data approaches, we aim to match theory and experiment on the level of microscopic degrees of freedom, improve theoretical models with experimental knowledge via Bayesian inference, interpret the mesoscopic imaging data, and ultimately understand the emergence of macroscopic functionality from microscopic degrees of freedom.

PROGRESS ARTICLE NATURE MATERIALS DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 22: Use Case: Data and Analysis Requirements in Scanning Probe and ...

NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials 977

narrowing down the scope of experimental possibilities and veri-fying theoretical trajectories versus experiment by introducing experi mentally observed behaviours into the coarse-graining pro-cedure. Such a deep-data approach illustrated in Fig.  3 will allow merging of this knowledge with physical models and provide input into the Materials Genome Initiative80 by enabling a new paradigm of materials research based on theory–experiment matching at the level of microscopic degrees of freedom.

Smart data in imagingThe incorporation of big and deep data further opens the path-way for the use of machine learning methods in physical materials research81, an approach we refer to here as ‘smart data’. Several recent examples from areas such as computer vision82 potentially integrated with control83, biological imaging84, and cancer research85 demon-strate tremendous potential of machine learning methods as applied to solving scientific problems. It can be anticipated that a similar or larger impact can be achieved in the field of materials research, especially for exploring large databases of structural and functional imaging data and correlating them to macroscopic properties and synthesis conditions.

The machine learning methods can benefit researchers at all stages of the research process (Fig. 4), including operation of spe-cific microscopes and data analysis, computational synthesis of expertise between different groups, and tapping into collective past knowledge in the field by exploring the (often poorly structured) archival records and data.

On the level of the individual scientist, machine learning meth-ods offer enormous benefits both when acquiring and analysing data. Indeed, the revolution in modern electron and scanning probe microscopy was largely enabled by the introduction of the personal computer, which can automatically tune microscope con-trols and store and analyse data86. Many modern algorithms for tuning high-level instrumentation already use limited versions of machine learning. Similar opportunities exist for the analysing

step — ranging from routine data analysis and automating repeated and time- consuming operation to providing ‘second opinions’ in detecting anomalies and outliers within the data. Tremendous opportunities for data analysis and interpretation are offered by expert systems that can be trained either by direct algorithm devel-opment or through the observation of domain specialists. These new opportunities could be enabled by the coalescence of such expertise across multiple groups and the facilitation of training for the next generation of scientists. Currently, data analysis, interpre-tation, and training are typically centred within individual groups and are transferred via long-term training within them. The online interaction between different groups can significantly facilitate data interpretation and scientific discovery and reduce knowledge lost when group members leave.

Finally, smart-data systems could play a large role in the context of accessing extant information. At present, data interpretation and analysis by an individual scientist is based on individual experi-ence augmented by access to the scientific literature in one’s field. However, the basic principles of literature searches remain essen-tially similar, utilizing a set of keywords and names as a basis for search queries, often formulated based on an individual’s experi-ence. Both approaches limit the researcher to the domain of known interest. In many cases, keywords evolve with time, or different key-words are often adopted by communities developing in parallel (as an example, Kelvin probe force microscopy, which is also known as scanning surface potential microscopy). Of significant interest is the analysis of text data based on the semantic context of publica-tions87,88 or automated searches within professional social networks89 (unsupervised clustering)90. This will enable development of inte-grative context- and data-centred search engines that allow putting knowledge in the context of published experimental and theoretical results; development of existing structural and property databases; general search and incorporation of information that went unused or unnoticed when published; establishment of links between past and present knowledge.

Expert Control

Automaticexpert system

Decision making

Usermodel

Experimental data

Timeline–10 –5 0 5 10

–2.8

–1.4

0.0

1.4

2.8

Piez

ores

pons

e fo

rce

mic

rosc

opy

sign

al (a

.u.)

Bias (V)

(i) (ii) (iii)

Figure 4 | Illustration of an envisioned smart-data approach to materials discovery. Here the active feedback between the expert and the experimental instrument (i) provides the inputs for training of the automatic expert systems (ii) that subsequently can perform automatic data interpretation (iii). Such a system can also explore information contained in the extant analyses via correlative analysis of citation networks and semantic text content with subsequent knowledge integration. Smart-data systems can provide a gateway to enable training and image interpretation for a large number of entry- and mid-level users of the instrument, and assist advanced users in research. Ideally, we envision an image acquired by a microscope in real time being interpreted in terms of relevant functionalities, and the smart-data system offering suggestions based on exploration history of the potential impact of observed features on materials functionalities, possible origin, and so on.

PROGRESS ARTICLENATURE MATERIALS DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 23: Use Case: Data and Analysis Requirements in Scanning Probe and ...

978 NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials

Towards the materials genomeTo date, materials discovery follows a classical synthesis–charac-terization–theory–computation paradigm driven by individual researchers and strongly affected by serendipity (chance discover-ies) and institutional memory. This approach is extremely limited by finite expertise and personal biases of individual researchers (for example, materials synthesized as a potential oxygen conductor will unlikely be tested for superconductivity) and available instru-mental capabilities. From the instrumental perspective, large-scale scattering, electron microscopy and nanoscale centres worldwide address the latter limitation for experimentalists, whereas super-computer centres address it for theorists. However, knowledge integration across the fields and disciplines still remains an issue. As a result, many materials discoveries that could have been made were not made, and many accidental discoveries went unnoticed or are forgotten.

The incorporation of the big-, deep-, and smart-data approaches in imaging coupled with computational-based simulations could potentially enable breakthroughs in the rate and quality of materials discoveries. The use of big-data approaches could enable full infor-mation retrieval and exploration of correlations in structural and functional imaging to develop structure–property relationships on the atomic level, and creation of libraries of atomic configurations as well as associated properties. Whereas macroscopic properties yield information on single points in chemical space, local measurements can provide similar information for multiple points in chemical space. This information, when suitably processed, can be directly linked to theoretical simulations to enable effective exploration of material behaviours and properties. Furthermore, knowledge of extant defect configurations in solids can significantly narrow the spectrum of atomic configurations to be probed theoretically, pre-cluding exponential growth of number of possible configurations with system size. These approaches can further be used to build experimental databases across imaging facilities worldwide, estab-lish links to X-ray and other structural databases, and enable imme-diate in-line interpretation of information flows from microscopes and simulations (experiment-informed theoretical modelling).

The deep-data analytics could allow merging of this knowledge with physical models, providing input into the Materials Genome Initiative by enabling a new paradigm of materials research based on theory–experiment matching at the level of the microscopic degrees of freedom. This deep data will allow for verification of theoretical models and hence improvement of theory-based predictive capabilities.

Finally, the smart-data approach might enable a new spectrum of paradigms for control of matter and scientific discovery. At the instrument level, it could significantly change the instrumental operation and enable realization of real-time feedbacks to control matter. At the community level, algorithms for data identification and expert assessment, as well as development of machine-learning-based contexts and information search engines, could enable train-ing of a new generation of scientists. It also has the potential to limit the number of repetitive experimental studies and allow research groups to focus their attention on exploratory work and scientific discovery rather than routine analytics.

OutlookThe pathway towards achieving the vision of accelerating materi-als by design is clearly nontrivial. The use of big-data techniques relies on well-established image analysis and unsupervised learning tools, with the primary difficulties lying in the need for synchroni-zation of languages between physics, microscopy and big-data ana-lytics communities. This vision could be facilitated by broadening the scope of academic education, developing new university cur-ricula and degrees in undergraduate and graduate studies, which combine statistical learning, programming, physical chemistry, and

materials science, and highlighting the intrinsic links between disci-plines (for example, statistical physics and information theory). This effort will strongly benefit from the availability of the appropriate infrastructure. On the deep-data level, the appropriate mathemati-cal algorithms are becoming available. However, their implementa-tion at large-scale computing platforms and accessibility to a broad scientific community is still a challenge. Finally, a smart-data level will require broad acceptance of machine learning across multiple disciplines. While at first glance this is daunting, changes in our life-style over the last decade, towards a reliance on technology, clearly suggests that this is possible.

Received 9 March 2015; accepted 24 July 2015; published online 23 September 2015

References1. Hoddeson, M. R. a. L. Crystal Fire: The Invention of the Transistor and the Birth

of the Information Age (W. W. Norton & Company, 1998).2. Sze, S. M. Physics of Semiconductor Devices 2nd edn (Wiley-Interscience, 1981).3. Shockley, W. Electrons and Holes in Semiconductors: With Applications to

Transistor Electronics (D. Van Nostrand, 1950).4. Fuechsle, M. et al. A single-atom transistor. Nature Nanotech. 7, 242–246 (2012).5. Woodward, D. I., Knudsen, J. & Reaney, I. M. Review of crystal and domain

structures in the PbZrxTi1–xO3 solid solution. Phys. Rev. B 72, 104110 (2005).6. Vugmeister, B. E. Polarization dynamics and formation of polar nanoregions in

relaxor ferroelectrics. Phys. Rev. B 73, 174117 (2006).7. Fiebig, M. Revival of the magnetoelectric effect. J. Phys. D

38, R123–R152 (2005).8. Binder, K. & Young, A. P. Spin-glasses—experimental facts, theoretical concepts,

and open questions. Rev. Mod. Phys. 58, 801–976 (1986).9. Dagotto, E. Complexity in strongly correlated electronic systems. Science

309, 257–262 (2005).10. Dagotto, E., Hotta, T. & Moreo, A. Colossal magnetoresistant materials: The key

role of phase separation. Phys. Rep. 344, 1–153 (2001).11. Spaldin, N. A. & Fiebig, M. The renaissance of magnetoelectric multiferroics.

Science 309, 391–392 (2005).12. Adler, S. B. Factors governing oxygen reduction in solid oxide fuel cell cathodes.

Chem. Rev. 104, 4791–4843 (2004).13. Fischer, C. C., Tibbetts, K. J., Morgan, D. & Ceder, G. Predicting crystal

structure by merging data mining with quantum mechanics. Nature Mater. 5, 641–646 (2006).

14. Curtarolo, S. et al. The high-throughput highway to computational materials design. Nature Mater. 12, 191–201 (2013).

15. Crewe, A. V. Scanning electron microscopes—is high resolution possible. Science 154, 729–738 (1966).

16. Pennycook, S. J. & Nellist, P. D. (eds) Scanning Transmission Electron Microscopy: Imaging and Analysis (Springer, 2011).

17. Ardenne, M. v. Das elektronen-rastermikroskop. Praktische Ausführung. Z. Tech. Phys. 19, 407–416 (1938).

18. Binnig, G., Rohrer, H., Gerber, C. & Weibel, E. 7 × 7 reconstruction on Si(111) resolved in real space. Phys. Rev. Lett. 50, 120–123 (1983).

19. Binnig, G. & Rohrer, H. Scanning tunneling microscopy. Helv. Phys. Acta 55, 726–735 (1982).

20. Gerber, C. & Lang, H. P. How the doors to the nanoworld were opened. Nature Nanotech. 1, 3–5 (2006).

21. Pennycook, S. J. & Kalinin, S. V. Microscopy: Hasten high resolution. Nature 515, 487–488 (2014).

22. Van Tendeloo, G., Bals, S., Van Aert, S., Verbeeck, J. & Van Dyck, D. Advanced electron microscopy for advanced materials. Adv. Mater. 24, 5655–5675 (2012).

23. Yankovich, A. B. et al. Picometre-precision analysis of scanning transmission electron microscopy images of platinum nanocatalysts. Nature Commun. 5, 4155 (2014).

24. Jia, C. L. et al. Atomic-scale study of electric dipoles near charged and uncharged domain walls in ferroelectric films. Nature Mater. 7, 57–61 (2008).

25. Chang, H. J. et al. Atomically resolved mapping of polarization and electric fields across ferroelectric/oxide interfaces by Z-contrast imaging. Adv. Mater. 23, 2474–2479 (2011).

26. Nelson, C. T. et al. Spontaneous vortex nanodomain arrays at ferroelectric heterointerfaces. Nano Lett. 11, 828–834 (2011).

27. Chisholm, M. F., Luo, W. D., Oxley, M. P., Pantelides, S. T. & Lee, H. N. Atomic-scale compensation phenomena at polar interfaces. Phys. Rev. Lett. 105, 197602 (2010).

28. Borisevich, A. et al. Mapping octahedral tilts and polarization across a domain wall in BiFeO3 from Z-contrast scanning transmission electron microscopy image atomic column shape analysis. ACS Nano 4, 6071–6079 (2010).

PROGRESS ARTICLE NATURE MATERIALS DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 24: Use Case: Data and Analysis Requirements in Scanning Probe and ...

NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials 979

29. Jia, C. L. et al. Oxygen octahedron reconstruction in the SrTiO3/LaAlO3 heterointerfaces investigated using aberration-corrected ultrahigh-resolution transmission electron microscopy. Phys. Rev. B 79, 081405 (2009).

30. Kim, Y. M. et al. Probing oxygen vacancy concentration and homogeneity in solid-oxide fuel-cell cathode materials on the subunit-cell level. Nature Mater. 11, 888–894 (2012).

31. Van Aert, S., Van Dyck, D. & den Dekker, A. J. Resolution of coherent and incoherent imaging systems reconsidered—Classical criteria and a statistical alternative. Opt. Express 14, 3830–3839 (2006).

32. Van Aert, S., den Dekker, A. J., Van Dyck, D. & van den Bos, A. High-resolution electron microscopy and electron tomography: Resolution versus precision. J. Struct. Biol. 138, 21–33 (2002).

33. Pan, S. H. et al. Imaging the effects of individual zinc impurity atoms on superconductivity in Bi2Sr2CaCu2O8+δ. Nature 403, 746–750 (2000).

34. Roushan, P. et al. Topological surface states protected from backscattering by chiral spin texture. Nature 460, 1106–1109 (2009).

35. Lin, J. H. et al. Flexible metallic nanowires with self-adaptive contacts to semiconducting transition-metal dichalcogenide monolayers. Nature Nanotech. 9, 436–442 (2014).

36. Ishikawa, R. et al. Direct observation of dopant atom diffusion in a bulk semiconductor crystal enhanced by a large size mismatch. Phys. Rev. Lett. 113, 155501 (2014).

37. Huang, P. Y. et al. Imaging atomic rearrangements in two-dimensional silica glass: Watching silica’s dance. Science 342, 224–227 (2013).

38. Zheng, H. M. et al. Observation of transient structural-transformation dynamics in a Cu2S nanorod. Science 333, 206–209 (2011).

39. Eigler, D. M. & Schweizer, E. K. Positioning single atoms with a scanning tunnelling microscope. Nature 344, 524–526 (1990).

40. Garcia, R., Knoll, A. W. & Riedo, E. Advanced scanning probe lithography. Nature Nanotech. 9, 577–587 (2014).

41. Balke, N., Bdikin, I., Kalinin, S. V. & Kholkin, A. L. Electromechanical imaging and spectroscopy of ferroelectric and piezoelectric materials: State of the art and prospects for the future. J. Am. Ceram. Soc. 92, 1629–1647 (2009).

42. Runkler, T. A. Data Analytics: Models and Algorithms for Intelligent Data Analysis (Vieweg, 2012).

43. Bonnet, N. in Advances in Imaging and Electron Physics Vol. 114 (ed. P. W. Hawkes) 1–77 (Elsevier Academic Press, 2000).

44. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction 2nd edn (Springer, 2009).

45. Belianinov, A. et al. Big data and deep data in scanning and electron microscopies: Deriving functionality from multidimensional data sets. Adv. Struct. Chem. Imaging 1, 1–25 (2015).

46. Parr, R. G. & Weitao, Y. Density-Functional Theory of Atoms and Molecules (Oxford Univ. Press, 1994).

47. Sumpter, B. G. & Noid, D. W. On the design, analysis, and characterization of materials using computational neural networks. Annu. Rev. Mater. Sci. 26, 223–277 (1996).

48. Sumpter, B. G., Getino, C. & Noid, D. W. Theory and applications of neural computing in chemical science. Annu. Rev. Phys. Chem. 45, 439–481 (1994).

49. Phillips, J. C. & Rabe, K. M. Transport anomalies and internal structural models of stable quasi-crystals. Phys. Rev. Lett. 66, 923–925 (1991).

50. Villars, P., Phillips, J. C. & Chen, H. S. Icosahedral quasi-crystals and quantum structural diagrams. Phys. Rev. Lett. 57, 3085–3088 (1986).

51. Dongarra, J. et al. The International Exascale Software Project roadmap. Int. J. High Perform. Comput. Appl. 25, 3–60 (2011).

52. Materials Genome Initiative; http://go.nature.com/Rkw2mj53. The Materials Project; https://www.materialsproject.org54. Jain, A. et al. A high-throughput infrastructure for density functional theory

calculations. Comput. Mater. Sci. 50, 2295–2310 (2011).55. Jain, A. et al. Commentary: The Materials Project: A materials genome

approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).56. AFLOW; http://materials.duke.edu/aflow.html57. Setyawan, W. & Curtarolo, S. High-throughput electronic band structure

calculations: Challenges and tools. Comput. Mater. Sci. 49, 299–312 (2010).58. Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum

chemistry structures and properties of 134 kilo molecules. Scientific Data 1, 140022 (2014).

59. Sohlberg, K., Rashkeev, S., Borisevich, A. Y., Pennycook, S. J. & Pantelides, S. T. Origin of anomalous Pt-Pt distances in the Pt/alumina catalytic system. ChemPhysChem 5, 1893–1897 (2004).

60. Bachelet, G. B. & Schluter, M. Relativistic norm-conserving pseudopotentials. Phys. Rev. B 25, 2103–2108 (1982).

61. von Lilienfeld, O. A., Tavernelli, I., Rothlisberger, U. & Sebastiani, D. Optimization of effective atom centered potentials for London dispersion forces in density functional theory. Phys. Rev. Lett. 93, 153004 (2004).

62. Baumeier, B., Kruger, P. & Pollmann, J. Self-interaction-corrected pseudopotentials for silicon carbide. Phys. Rev. B 73, 195205 (2006).

63. von Lilienfeld, O. A. & Schultz, P. A. Structure and band gaps of Ga-(V) semiconductors: The challenge of Ga pseudopotentials. Phys. Rev. B 77, 115202 (2008).

64. von Lilienfeld, O. A. Force correcting atom centred potentials for generalised gradient approximated density functional theory: Approaching hybrid functional accuracy for geometries and harmonic frequencies in small chlorofluorocarbons. Mol. Phys. 111, 2147–2153 (2013).

65. Zhou, F., Cococcioni, M., Marianetti, C. A., Morgan, D. & Ceder, G. First-principles prediction of redox potentials in transition-metal compounds with LDA + U. Phys. Rev. B 70, 235121 (2004).

66. Marzouk, Y. M., Najm, H. N. & Rahn, L. A. Stochastic spectral methods for efficient Bayesian solution of inverse problems. J. Comp. Phys. 224, 560–586 (2007).

67. Marzouk, Y. & Xiu, D. A stochastic collocation approach to Bayesian inference in inverse problems. Commun. Computational Phys. 6, 826–847 (2009).

68. Howson, C. & Urbach, P. Scientific Reasoning: The Bayesian Approach (Open Court, 2006).

69. Robert, C. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer Texts in Statistics) (Springer, 2001).

70. Lingerfelt, E. J., Messer, O. E. B., Desai, S. S., Holt, C. A. & Lentz, E. J. Near real-time data analysis of core-collapse supernova simulations with Bellerophon. Procedia Comput. Sci. 29, 1504–1514 (2014).

71. Rupp, M., Tkatchenko, A., Muller, K. R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).

72. Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 15, 095003 (2013).

73. Lopez-Bezanilla, A. & von Lilienfeld, O. A. Modeling electronic quantum transport with machine learning. Phys. Rev. B 89, 235411 (2014).

74. Ramakrishnan, R., Dral, P. O., Rupp, M. & Anatole von Lilienfeld, O. Big data meets quantum chemistry approximations: The Δ-machine learning approach. J. Chem Theory Comput. 11, 2087–2096 (2015).

75. Pyzer-Knapp, E. O., Suh, C., Gómez-Bombarelli, R., Aguilera-Iparraguirre, J. & Aspuru-Guzik, A. What is high throughput virtual screening? A perspective from organic materials discovery. Annu. Rev. Mater. Sci. 45, 195–216 (2015).

76. Hachmann, J. et al. Lead candidates for high-performance organic photovoltaics from high-throughput quantum chemistry—the Harvard Clean Energy Project. Energy Environ. Sci. 7, 698–704 (2014).

77. Bartok, A. P., Gillan, M. J., Manby, F. R. & Csanyi, G. Machine-learning approach for one- and two-body corrections to density functional theory: Applications to molecular and condensed water. Phys. Rev. B 88, 054104 (2013).

78. Machta, B. B., Chachra, R., Transtrum, M. K. & Sethna, J. P. Parameter space compression underlies emergent theories and predictive models. Science 342, 604–607 (2013).

79. Katsoulakis, M. A. & Plechac, P. Information-theoretic tools for parametrized coarse-graining of non-equilibrium extended systems. J. Chem. Phys. 139, 074115 (2013).

80. Materials Genome Initiative Strategic Plan; http://www.nist.gov/mgi/upload/MGI-StrategicPlan-2014.pdf

81. Spiegelhalter, D. The future lies in uncertainty. Science 345, 264–265 (2014).82. Ovsjanikov, M., Bronstein, A. M., Bronstein, M. M. & Guibas, L. J. Shape

Google: a computer vision approach to invariant shape retrieval. Proc. NORDIA 1, 1 (2009).

83. Zhu, J., Ferguson, D. I. & Dolgov, D. A. System and method for predicting behaviors of detected objects. US patent 8660734 B2 (2014).

84. Tourassi, G. D., Vargas-Voracek, R., Catarious, D. M. & Floyd, C. E. Computer-assisted detection of mammographic masses: A template matching scheme based on mutual information. Med. Phys. 30, 2123–2130 (2003).

85. Scharcanski, J. & Celebi, M. E. (eds) Computer Vision Techniques for the Diagnosis of Skin Cancer (Springer, 2013).

86. Mody, C. C. M. Instrumental Community (The MIT Press, 2011).87. Reed, J. W. et al. TF-ICF: A new term weighting scheme for clustering dynamic

data streams. 5th Int. Conf. Machine Learning Appl. 258–263 (IEEE, 2006).88. http://cda.ornl.gov/piranha.shtml89. Bollen, J. et al. Clickstream data yields high-resolution maps of science.

PLoS ONE 4, e4803 (2009).90. Aiello, L. M., Schifanella, R. & State, B. Reading the source code of social ties.

Preprint at http://arXiv.org/abs/1407.5547v1 (2014).91. He, Q., Woo, J., Belianinov, A., Guliants, V. V. & Borisevich, A. Y. Better

catalysts through microscopy: Mesoscale M1/M2 intergrowth in molybdenum-vanadium based complex oxide catalysts for propane ammoxidation. ACS Nano 9, 3470–3478 (2015).

92. Lin, W. Z. et al. Direct probe of interplay between local structure and superconductivity in FeTe0.55Se0.45. ACS Nano 7, 2634–2641 (2013).

93. Tselev, A. et al. Oxygen control of atomic structure and physical properties of SrRuO3 surfaces. ACS Nano 7, 4403–4413 (2013).

94. Cruz-Silva, E. et al. Edge-edge interactions in stacked graphene nanoplatelets. ACS Nano 7, 2834–2841 (2013).

PROGRESS ARTICLENATURE MATERIALS DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 25: Use Case: Data and Analysis Requirements in Scanning Probe and ...

980 NATURE MATERIALS | VOL 14 | OCTOBER 2015 | www.nature.com/naturematerials

95. Romo-Herrera, J. M., Terrones, M., Terrones, H., Dag, S. & Meunier, V. Covalent 2D and 3D networks from 1D nanostructures: Designing new materials. Nano Lett. 7, 570–576 (2007).

AcknowledgementsThe authors thank A. Borisevich, H. Christen, J. Morris, and D. Levy, as well as multiple colleagues at ORNL and elsewhere for valuable discussions. R.K.A. acknowledges The Compute and Data Environment (CADES) for continuous support. E. Strelcov and R. Vasudevan are gratefully acknowledged for help with figure preparation. Research was sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy. A portion of this research was conducted at the Center for Nanophase Materials Sciences,

which is a DOE Office of Science User Facility. The algorithmic aspects were sponsored by the applied mathematics program at the DOE and the computational aspects made use of the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility at ORNL supported under contract no. DE-AC05-00OR22725.

Additional informationReprints and permissions information is available online at www.nature.com/reprints. Correspondence should be addressed to S.V.K.

Competing financial interestsThe authors declare no competing financial interests.

PROGRESS ARTICLE NATURE MATERIALS DOI: 10.1038/NMAT4395

© 2015 Macmillan Publishers Limited. All rights reserved

Page 26: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 DOI 10.1186/s40679-015-0006-6

REVIEW Open Access

Big data and deep data in scanning and electronmicroscopies: deriving functionality frommultidimensional data setsAlex Belianinov1,2*, Rama Vasudevan1,2, Evgheni Strelcov1,2, Chad Steed1,7, Sang Mo Yang1,2,4,5, Alexander Tselev1,2,Stephen Jesse1,2, Michael Biegalski2, Galen Shipman1,8, Christopher Symons1,7, Albina Borisevich1,3, Rick Archibald1,6

and Sergei Kalinin1,2

Abstract

The development of electron and scanning probe microscopies in the second half of the twentieth century hasproduced spectacular images of the internal structure and composition of matter with nanometer, molecular, andatomic resolution. Largely, this progress was enabled by computer-assisted methods of microscope operation, dataacquisition, and analysis. Advances in imaging technology in the beginning of the twenty-first century have openedthe proverbial floodgates on the availability of high-veracity information on structure and functionality. From thehardware perspective, high-resolution imaging methods now routinely resolve atomic positions with approximatelypicometer precision, allowing for quantitative measurements of individual bond lengths and angles. Similarly, functionalimaging often leads to multidimensional data sets containing partial or full information on properties of interest,acquired as a function of multiple parameters (time, temperature, or other external stimuli). Here, we review severalrecent applications of the big and deep data analysis methods to visualize, compress, and translate this multidimensionalstructural and functional data into physically and chemically relevant information.

Keywords: Scanning probe microscopy; Multivariate statistical analysis; High-performance computing

ReviewIntroductionThe ultimate goal for local imaging and spectroscopytechniques is to measure and correlate structure-propertyrelationships with functionality - by evaluating chemical,electronic, optical, and phonon properties of individualatomic and nanometer-sized structural elements [1]. Ifavailable directly, the information of the structure-property correlations at the single molecule, bond, ordefect levels enables theoretical models to accuratelyguide materials scientists and engineers to optimally usematerials at any length scale, as well as allow for the directverification of fundamental and phenomenological physicalmodels and direct extraction of the associated parameters.

* Correspondence: [email protected] for Functional Imaging of Materials, Oak Ridge National Laboratory,Oak Ridge, TN 37831, USA2The Center for Nanophase Materials Sciences, Oak Ridge NationalLaboratory, Oak Ridge, TN 37831, USAFull list of author information is available at the end of the article

© 2015 Belianinov et al.; licensee Springer. ThisAttribution License (http://creativecommons.orin any medium, provided the original work is p

Particularly significant challenges are offered by spatiallyinhomogeneous, partially ordered, and disordered sys-tems, ranging from spin glasses [2,3] and ferroelectricrelaxors [4,5], to solid-electrolyte interface (SEI) layers inbatteries [6] and amorphized layers in fuel cells[7,8], toorganic and biological materials. These systems offer atriple challenge: defining relevant local chemical and phys-ical descriptors, probing their spatial distribution, andexploring their evolution in dynamic temperature, light,and chemical and electrochemical reaction processes.While complex, recent progress in information and appli-cation [9] of statistics suggest that such descriptions arepossible; the challenge is to visualize and explore the datain ways that allow decoupling of various local dynamicsunder external physical and chemical stimuli.Ideally, complete studies have to be performed as a

function of global stimuli, such as temperature or uniform

is an Open Access article distributed under the terms of the Creative Commonsg/licenses/by/4.0), which permits unrestricted use, distribution, and reproductionroperly credited.

Page 27: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 2 of 25

electric field applied to the system, as well as local stimuli,using localized electric [10-13], thermal [14-18], or stressfields [19-23] exerted by a scanning probe microscopy(an SPM) probe [24-26], either within the classical SPMplatforms or combined SPM-scanning transmissionelectron microscopy (STEM) set-ups [27,28]. Furthercomplications of the detection scheme in force-basedSPMs require probing of the response in a frequencyband around resonance (since resonant frequency canbe position dependent and single-frequency methodsfail to capture these changes) [29-32].Additionally, the instrument hardware challenge is exac-

erbated by a wealth of extracted information at bothglobal and local scales necessitating a drastic improvementin capability to collect and analyze multidimensional datasets. For example, probing a local transformation requiressweeping a local stimulus (tip bias or temperature) whilemeasuring the response. Note that all first-order phasetransitions are hysteretic and often slow, constraining themeasurement of the kinetic hysteresis (and differentiatingit from thermodynamics) by measuring the systemresponse as a function of time. This caveat requiresfirst-order reversal curve-type studies, which effectivelyincrease dimensionality of the data (e.g., probing Preisachdensities [33,34]).The arguments presented above can be summarized

that in order to achieve complete probing of local transfor-mations in SPM, 6D (space × frequency × (stimulus ×stimulus) × time) data detection schemes are necessary.Figure 1a illustrates the data set size and Figure 1b thecomputational power evolution for 3D to 6D data sets forSPM techniques developed over the last decade. Some

Figure 1 Data set size and computational power evolution. (a) Evolution oAcronym list: BE, band excitation; SSPFM, switching spectroscopy piezorespmicroscopy; BE SSPFM, band excitation piezoresponse force microscopy; TRTypical processing/acquisition time (smaller value is better) on a laptop, desktwere assumed as follows: laptop - 4-core processor, 8 GB of RAM, integratedprocessor, 32 GB RAM, dedicated video, 2 hard drives, 4 TB of space; cluster -storage space.

further details pertaining to these techniques are illus-trated in Table 1. The authors also note the obvious infor-mation technology challenges associated with acquisitionof large, compound data sets bring, namely, data storage,dimensionality reduction, visualization, and interpretation.Authors note that additional registration-based prob-

lems emerge in combined structural and functional im-aging, when the information obtained via a high-resolutionstructural channel (imaging) is complemented by lowerresolution spectroscopic probing collected on a coarse grid.These types of experiments bring about problems asso-ciated with drift correction and spatial registration ofdisparate data sets. Therefore, to identify relevant physicalbehaviors in the intrinsically high-dimensional nature ofresulting data, without a deterministic physical model, clus-tering and unsupervised learning techniques can be utilizedto establish statistically significant correlations in data sets.As instrumental platforms and data acquisition elec-

tronics are becoming ubiquitous, efficiently storing andhandling the large data sets they generate become crit-ical. Hence, the key missing element is mastering ‘thebig data’ implicitly present in the (S)TEM/SPM data sets.Here, we review some of the recent advances in the ap-plication of big data analysis techniques in structuraland functional imaging data. These techniques includeunsupervised learning and clustering techniques, super-vised neural network-based classification, and deep dataanalysis of physically relevant multivariate statistics data.

Multivariate statistical methodsThe purpose of this section is to familiarize the readerwith the basic unsupervised and supervised learning

f multidimensional data sets and their sizes over the last decade.onse force microscopy; TR PFM, time resolved piezoresponse forceBE, time resolved band excitation; FORC, first-order reversal curve. (b)op, and cluster for multidimensional data sets. Hardware configurationsvideo, and 1 hard drive approximately 1 TB of space; desktop - 12-core10 nodes, each node with 8 processors at 8 cores, 20 GB of RAM, 160 GB

Page 28: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Table 1 Development of multidimensional SPM methods at Oak Ridge National Laboratory

Technique Dimensionality Current data set File volume References

1. Band excitation (BE) 3D, space, and ω (256 × 256) × 64 32 MB [35,68,69,127-129]

2. Switching spectroscopy PFM 3D, space, and voltage (64 × 64) × 128 4 MB [61,130-139]

3. Time relaxation PFM 3D, space, and time (64 × 64) × 128 4 MB [71,140,141]

4. AC sweeps 4D, space, ω, voltage (64 × 64) × 64 × 256 512 MB [142,143]

5. BE SSPFM 4D, space, ω, voltage (64 × 64) × 64 × 128 256 MB [11,72,144-148]

6. BE thermal 4D, space, ω, temp (64 × 64) × 64 × 256 512 MB [16,17,149,150]

7. Time relaxation BE 4D, space, ω, time (64 × 64) × 64 × 64 4 MB [151] [151-153]

8. First-order reversal curves 5D, space, ω, voltage, voltage (64 × 64) × 64 × 64 × 16 2 GB [153-157]

9. Time relaxation on sweep, BE 5D, space, ω, voltage, time (64 × 64) × 64 × 64 × 64 8 GB [158,159]

10. FORC time BE 6D, space, ω, voltage, voltage, time (64 × 64) × 64 × 64 × 16 × 64 128 GB Not yet realized

ω, frequency; BE, band excitation; PFM, piezoresponse force microscopy; AC, alternating current; SS PFM, switching spectroscopy piezoresponse force microscopy;FORC, first-order reversal curve.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 3 of 25

methods used to reduce dimensionality and visualizedata behavior in a high-dimensional data set. The materialpresented in this section gives but a brief overview, andthe reader is encouraged to explore the methods further ifthey have any interest in utilizing them. Minimal mathem-atical formalism is presented, as the focus is to explain thefunctional aspect of each of the methods as they areapplied to spectral and imaging data, method’s strengthand weakness, and give a brief overview of the input andoutput parameters, if any, to ease the transition to actualutility. All of the methods presented below share the same2D data structure at the input, with rows as observationsand columns as variables. This arrangement implies thatin a high-dimensional data set, certain dimensions have tobe combined. In our work, presented below, we combinedimensions by type, that spatial dimensions in the X, Y, orZ can be mixed, or similarly energy dimensions, such asAC or DC voltage. Other mixing schemes are also possibleand in some areas perhaps necessary. More details aregiven in each of the technical sections as to how each ofthe methods described below was implemented.

Principal component analysisPerhaps the easiest way to visualize a multidimensionaldata set is through principal component analysis (PCA),an approach previously reported for various applicationsin electron and force-based scanning probe microscopydata [35-41]. PCA has been widely used by a number ofscientific fields and owes its popularity to the ease of useand wide availability of the source code in practicallyany programming language. The algorithm does not takeany parameters besides the data itself and outputs threeimportant results: eigenvectors (arranged from most toleast information dense), the respective loading (or score)maps associated with each eigenvector, and a Scree plotthat represents the information content as a function ofeigenvector number. These three results allow the user to

visualize the principal behaviors in the data, througheigenvectors and their loading maps, as well as judge theinformation content of each eigenvector via the Scree plot.PCA, however, suffers from difficulty of interpretation ofhigher eigenvectors, where the information content typic-ally decreases, the qualitative nature of information con-tent assignment, and processing speed setbacks for trulylarge data sets (hundreds of thousands of observationswith hundreds of thousands long arrays of variables).Here we describe the PCA functionality as it applies to

a spectral data set collected on a grid. In PCA, a spectro-scopic data set that is N × M pixels formed by spectracontaining P points is converted into a linear superpos-ition of orthogonal, linearly uncorrelated eigenvectors wk:

Ai Uj� � ¼ aikwk Uj

� � ð1Þ

where aik ≡ ak(x, y) are position-dependent expansioncoefficients or component weights, Ai(Uj) ≡ A(x, y, Uj) isthe spectral information at a selected pixel, and Uj arethe discrete bias values at which current is measured.The eigenvectors wk(U) and the corresponding eigen-values λk are found from the singular value decompos-ition of covariance matrix, C = AAT, where A is thematrix of all experimental data points Aij, i.e., the rowsof A correspond to individual grid points (i = 1,.., N⋅M),and columns correspond to voltage points, j = 1,.., P.The eigenvectors wk(Uj) are orthogonal and are arrangedsuch that corresponding eigenvalues are placed in de-scending order, λ1 > λ2 > .... by variance. In other words,the first eigenvector w1(Uj) contains the most informa-tion within the spectral image data, the second containsthe most common response after the subtraction of vari-ance from the first one, and so on. In this manner, the first0-P maps, ak(x, y), contain the majority of informationwithin the data set, while the remaining P-p sets are domi-nated by noise. The number of significant components, p,

Page 29: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 4 of 25

can be chosen based on the overall shape of λk(i) depend-ence or from correlation analysis of loading maps, whichcorrespond to each of the eigenvectors, aik ≡ ak(x, y). Add-itionally, Scree plot is used to correlate variance in eachcomponent as a function of the component’s number.

Independent component analysisIndependent component analysis (ICA) is a method de-signed to extract presumably independent signals mixedwithin the data. Much like PCA, the output is a collectionof independent spectra and their loading maps. UnlikePCA, however, the order of ICA components is insignifi-cant, and ICA takes in some input parameters and gener-ally takes longer to run than PCA. One of the key ICAparameters is the number of independent components, adecision that can be highly non-trivial to make. Anotheroften overlooked parameter is the number of principalcomponents to retain; ICA uses PCA as a filter, and forlow-dimensional data sets, or data sets with relatively fewobservations, the last retained principal component playsa huge role in the quality of the signal separation, as itmay allow or bar certain details in your data to be pre-sented to the algorithm.ICA is part of a family of algorithms aimed at blind

source separation, where the objective is to ‘un-mix’ sev-eral sources that are present in a mixed signal [42]. Thedata variables are assumed to be linear mixtures of someunknown latent variables, and the mixing system is alsounknown. ICA assumes that the latent variables will benon-Gaussian and therefore mutually independent. Theproblem of blind source separation can be modeled inthe following manner:

x ¼ As ð2Þwhere s is a two-dimensional vector containing the in-

dependent signals, A is the mixing matrix, and x is theobserved output. As the initial step, ICA whitens thedata to remove any correlation; in other words, we areafter a linear transformation V such that if

y ¼ Vx ð3ÞWe would like to find the identity I by

E yy0f g ¼ I ð4ÞThis is possible by V = C−1/2 where C = E{xx′} giving us

E yy0f g ¼ E Vxx0V 0f g ¼ C−1=2CC−1=2 ¼ I ð5ÞAfter the whitening, independent signals can be ap-

proximated by the orthogonal transformation of thewhitened signal by rotating the joint density of themixed signals in a way to maximize the non-normalityof the marginal densities.

Bayesian de-mixingBayesian de-mixing is a very powerful technique thatshines where PCA and ICA fall short. First and foremost,Bayesian de-mixing returns a quantitative result with theunits of de-mixed spectra being the units of the input data.The de-mixed vectors are also always positive and sumto one, which makes the transition from statistics to sci-ence quite natural. There are many optional parametersthat can be tweaked within the Bayesian code, but typic-ally at least the number of independent components isrequired. The disadvantage of the Bayesian method isspeed, and additional insight is necessary to optimize thealgorithm. Typically, in our analysis flow, we start withPCA and ICA to identify the parameter space; once theregion of interesting solutions or phenomena is identified,we perform Bayesian de-mixing.While a plethora of Bayesian-based statistics methods

exist, we have found the algorithm provided by Dobigeonet al. to be the fastest and easiest to use [43]. The Bayesianapproach assumes data in a Y = MA + N form, whereobservations Y are a linear combination of position-independent endmembers, M, each weighted with respect-ive relative abundances, A, and corrupted by an additiveGaussian noise N. This approach features the following:the endmembers and the abundance coefficients are non-negative, fully additive, and sum-to-one [44-47].The algorithm operates by estimating the initial projec-

tion of endmembers in a reduced subspace via the N-FINDR [48] algorithm that finds a simplex of the maximumvolume that can be inscribed within the hyperspectral dataset using a non-linear inversion. The endmember abun-dance priors along with noise variance priors are pickedfrom a multivariate Gaussian distribution found within thedata, whereas the posterior distribution is based on end-member independence calculated by Markov Chain MonteCarlo, with asymptotically distributed samples probed bythe Gibbs sampling strategy. An additional, unique aspectof Bayesian analysis is that the endmember spectra andabundances are estimated jointly, in a single step, unlikemultiple least square regression methods where initial spec-tra should be known [43].

ClusteringA very natural way to analyze data is to cluster it. Thereare many algorithms available that have a variety ofbuilt-in assumptions about the data and as such couldpredict the optimal clustering value, order clusters basedon variance, or other distance metrics, etc. We present amethod, k-means clustering, which is rather flexible andeasy to find on a variety of platforms and in many pro-gramming languages. The only required input value fork-means is the number of clusters; however, additionalvariables such as the distance metric, number of itera-tions, how the initial sample is calculated, and how to

Page 30: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 5 of 25

handle unorthodox data events can all have drasticeffects on the results. This clustering algorithm is mod-erately fast and returns a simple index of integers whichenumerates each observation to its respective cluster.The biggest downside of k-means clustering algorithm isthe random cluster ordering on the output; however, thisinformation can be indirectly accessed by looking at theaverage distance between clusters (based on the suppliedmetric) as well as the number of points in the cluster.K-means algorithm divides M points in N dimensions

into K clusters so that the within-cluster sum of squares

argminXki¼1

Xxj∈Si

jjxj−μijj2; ð6Þ

where μi is the mean of points in Si, is minimized[49,50]. Here, we have used an implementation of the k-means algorithm that minimizes the sum, over all clusters,of the within-cluster sums of point-to-cluster-centroiddistances. As a measure of distance (minimization param-eter), in our data, we have typically used sum of absolutedifferences with each centroid being the component-wisemedian of the points in a given cluster.

Neural networksArtificial neural networks (ANNs) are an entire family ofalgorithms, modeled after the neural system found inthe animal kingdom, used to estimate unknown functionsthat may have a very large number of inputs. ANNs aresimilar to the biological neural system in that they per-form functions collectively and in parallel by the com-putational units, as opposed to have each unit a clearlyassigned task. In a mathematical sense, neuron’s func-tion f(x) can be defined as a mixture of other functiong(x) with weighting factors wi where g(x) is a non-linearweighted sum of f(x)

f xð Þ ¼ K

�Xiwig xð Þ

�ð7Þ

Here K is commonly referred to as an activation functionthat defines the node output based on the set of inputs.What has attracted people to ANNs is the possibility

of those algorithms to simulate learning. Here by learningwe imply that for a specific task and a class of functions,there is a set of observations to find that that relates thesolutions of the set of functions. To utilize this concept,we must imply a cost function C which is a measure ofhow far away a particular solution is from the optimal so-lution. Consider the problem of finding a model f whichminimizes the cost function

C ¼ E f xð Þ þ yð Þ½ �2 ð8Þ

for some set of points (x, y) from a distribution D. Insuch a case, the finite number of samples N from Dwould minimize the cost function as

C ¼ 1N

XNi¼1

E f xið Þ þ yið Þ½ �2

As the reader may note, ultimately, the cost functionis dictated by the problem we are trying to solve. In thecase of an unsupervised learning problem, we are deal-ing with a general estimation problem, so the cost func-tion is chosen to reflect our imposed model on thesystem. In the case of supervised learning, we are givena set of examples and to aim to infer the mapping basedon the data in the training and other data sets. In thesimplest case scenario, the cost function would be amean-squared error type, which would try to minimizethe average error between the network’s output and thetarget values of the example sets.

Spectral domainsWe illustrate the applications of multivariate data analyt-ics techniques to multidimensional functional spectros-copies, which include bias, current, frequency, and timechannels in SPM and electron energy loss spectroscopy(EELS) in STEM. The analysis involves signal decompos-ition along the energy or stimulus direction, whereas thespatial portion of the signal is left pristine. In this section,we illustrate analysis via unsupervised and supervisedlearning algorithms for scanning tunneling spectroscopy(STM) and atomic force microscopy (AFM)-based electro-mechanical force spectroscopies.

3D data - CITS in STEMIn STM, an electrically conductive tip is brought into acurrent tunneling distance to a conductive sample [51,52].In Z-imaging mode, the tip is scanned over the sampleand a Z feedback is used to maintain a constant currentwhile simultaneously adjusting and collecting the positionof the feedback. Conversely, in the current imaging mode,Z height is kept constant and the current variation is mea-sured [53]. In current imaging tunneling spectroscopy(CITS), the measurement is performed at an individualspatial point located at an (x, y) position on a grid with thecurrent I recorded for a given applied voltage waveformU. The final data object is a 3D stack of spectral currentimages I(x, y, U), where I is the detected current, U is thetip bias, and (x, y) are spatial surface coordinates of themeasurement [54].In this example of an Fe-based superconductor

FeTe0.55Se0.45 (Tc = 15 K), CITS imaging was performedon a 150 × 150 grid at −0.05 to 0.05 V bias, sampledover 256 points. The layered FeTe0.55Se0.45 compound isa prototypical layered, high-temperature superconductor,

Page 31: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 6 of 25

described at some length in prior publications [55,56].These data were collected on a 50 × 50 nm2 area, i.e.,each pixel corresponds to 1 × 1 Å2; the lattice constantof the material is 3.8 Å. The Z position of the piezo wasrecorded on a separate channel prior to the Z-feedbackdisengage at the beginning of each bias waveform se-quence, resulting in a 150 × 150 pixel topographic map.The Z channel spectroscopy image and the averagespectroscopy signal are shown in Figure 2a, and theinset in the top right corner is the average current-voltage (IV) curve for the entire image. Approximateacquisition time for the CITS map was 8 to 10 h whichresulted in some drift, apparent in the bottom portionof Figure 2a.The spatial variability of the electronic behavior across

the surface was analyzed using PCA [35-37,39,57]. PCAeigenvector and loading map pairs for components 2and 3 are shown in Figure 2b,c for the FeTe0.55Se0.45CITS data set. It is useful to analyze the eigenvectorsand loadings simultaneously to examine the changes inthe signal first (the eigenvector here) and its spatialdistribution next (the loading). From a statistical perspec-tive, we are mapping sources of electronic inhomogeneityarising from the negative portion of the IV curve in bothcomponents, as illustrated in Figure 2b,c. The secondeigenvector shown in Figure 2b (upper right corner inset)shows an increase in the −0.05 to 0 V half of the range,where the average signal has a negative slope in the same

Figure 2 Unsupervised learning methods, PCA, and k-means on the FeSeTthe feedback is disabled for the IV spectroscopy. The inset shows the averain (a); the inset shows the eigenvector. (c) Third PCA loading of the CITS dataresults for a five-cluster case; the inset shows the mean IV for each cluster.

region. In the third eigenvector, loading pair shown inFigure 2c, the variation is also more prominent in thenegative half of the bias range where the current formsa well, compared to the smooth decay behavior in theaverage IV. Therefore, changes in the current at negativebias are strong sources of data variance in the system andcan be attributed to chemical segregation at the surface.While the components are statistically significant and

reflect major changes in the variability of the data, theconnection to the physical properties PCA highlights isalways non-trivial. This is mostly due to the fact that in-formation variance, the property with respect to whichPCA organizes the data, is sensitive to the variability inthe signal, rather than the physical origin of the change.This suggests that PCA allows one to de-noise, de-correlate, and visualize spatial variability of the responsebut does not directly yield additional knowledge with re-spect to the effects that are being studied. In the case ofCITS data on FeTe0.55Se0.45, results of the third compo-nent, Figure 2c, can be legitimately questioned as theloading map seems to suggest behavior that is erraticand typical of an unstable tip surface tunneling regime.It is then necessary to use other methods in order tosupplement PCA results and determine the underlyingsource of variance in the signal and its relevance to theproblem at hand, as will be illustrated by Bayesian de-mixing analysis of the local conductance behavior in thesection ‘Deep learning’ [40,58,59].

e CITS data. (a) 150 × 150 pixel CITS data from the Z-channel beforege IV for the data set. (b) Second PCA loading of the CITS data shownshown in (a); the inset shows the eigenvector. (d) k-means clustering

Page 32: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 3 PCA of band excitation piezoresponse force spectroscopy(BEPS) data. (a) First eigenvector (principal component) of the BEPSdata. (b) Second eigenvector (principal component) of the BEPSdata. (c) Third eigenvector (principal component) of the BEPS data.(d) Fourth eigenvector (principal component) of the BEPS data.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 7 of 25

Another commonly used unsupervised learning methodthat reflects major organization in the data structure is k-means clustering. Insight into the spatial variability of theelectronic structure on the surface inaccessible by PCAcan be gained from the clustering analysis of the CITSdata [60], by k-means clustering. As a measure of distance(minimization parameter), we have used the sum of abso-lute differences with each centroid being the component-wise median of the points in a given cluster.The k-means result for five clusters using the square

Euclidean distance metric is shown in Figure 2d, withthe inset in the top right showing the mean IV curvesfor the individual clusters (color-coded respectively). Asseen in the k-means clustering result, the mapping is in-deed sensitive to the changes in the negative bias portionof the IV curve. Here we see clustering that is based onvariance of conductivity or alternatively the width of theband gap. Perhaps a more interesting observation is thespatial distribution of the clusters, where the regions ofthe highest maximum current (cherry red) and lowestmaximum current (green) are segregated and in mostcases surrounded by patches of varying conductivity.Note that in this result, single pixel and short line likeagglomerates of pixel outliers seen in Figure 2c areabsent. Overall, the behavior is more in line with theresults of the second PCA component.

4D and 5D data - band excitation spectroscopy analysisThe multivariate analysis of a higher-dimensional dataset (beyond the 3D) is effectively illustrated by a bandexcitation piezoresponse force spectroscopy (BEPS) dataset. This technique probes the electromechanical responseof materials, which is directly related to the material’sferroelectric state. The spectroscopic version of piezore-sponse force microscopy (PFM) probes the local ferroelec-tric switching induced by the DC bias applied to the tip viaa dynamic electromechanical response, effectively yieldingthe local piezoresponse loop.The data shown in this section consist of a 30 × 30

grid of points (x, y), where each point contains a ferro-electric hysteresis loop captured by applying a voltagewaveform and measuring the piezoresponse [61-66]. Thesample is a relaxor ferroelectric PMN-0.28PT sample,which is in the ferroelectric phase and displays strongpiezoelectricity. The amplitude of the piezoelectric re-sponse A is then a function of (x, y, V), A = A(x, y, V);the voltage waveform consists of 64 voltage steps, imply-ing 64 spatial maps of amplitude (one for each voltagestep). Alternatively, at each (x, y) location, one can in-spect the ferroelectric hysteresis loop, i.e., the amplitudeas a function of voltage for x = x1, y = y1. In total, 900hysteresis loops were captured that could be analyzed,and PCA was undertaken for this data set (method de-scribed in the previous section), with the first four

eigenvectors in Figure 3a,b,c,d with their respective load-ing maps shown in Figure 4a,b,c,d. The first componentrepresents the mean of the data set (since the mean cap-tures the most variance in the data), and subsequentcomponents detect the variations in the (amplitude) hys-teresis loop shape from iteratively deviating from themean. Note that the components can arbitrarily switchsign, but in these instances, the eigenvalues will also bereversed to preserve the correct orientation of the re-constructed data set. The second component in Figure 3bis a measure of the asymmetry of the loop (related toferroelectric imprint), while the third appears to widenthe loop (i.e., change the coercive field). Finally, thefourth component displays non-trivial features which, inthe reconstructed data set, appear as mound-like fea-tures on either side of the switching cycle. The spatialmaps illustrate significant heterogeneity for each compo-nent, as a result of the widely varying ferroelectricswitching behavior across the sample. Thus, PCA onceagain provides an effective method to quickly map thetrends in the data set.Although PCA is useful in visualizing the structure of

the data, there are no physically meaningful constraintson the eigenvectors. For example, if it is known (or pos-tulated) that the measured signal is a linear combinationof n independent signals, one may want to determinethe pure components that correspond to each of thesecases. For this particular problem, the ICA [42] techniqueprovides a solution and allows de-mixing of signals into auser-defined number of vectors (components), with theconstraint that the components must be statisticallyindependent.Consider the amplitude signal A in the BEPS example

written as a sum of four independent components si (i =

Page 33: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 4 PCA loading maps of band excitation piezoresponse force spectroscopy (BEPS) data. (a) First loading map associated with the principalcomponents in Figure 3 of the BEPS data. (b) Second loading map associated with the principal components in Figure 3 of the BEPS data. (c)Third loading map associated with the principal components in Figure 3 of the BEPS data. (d) Fourth loading map associated with the principalcomponents in Figure 3 of the BEPS data.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 8 of 25

1,…4), with mixing coefficients ci (i = 1,…4), the ampli-tude is then described by Equation 9:

A x; y; Vð Þ ¼ c1 x; yð Þs1 Vð Þ þ…þ c4 x; yð Þs4 Vð Þ ð9Þ

ICA can be used to find si(V) and ci(x,y). In essence,such a transformation allows the data to be representedby a specific number of independent ‘processes’ (compo-nents) that are mixed in the final signal, while the coeffi-cients determine the relative weights of each process tothe total signal contribution. The results of this de-mixing are shown in Figure 5, with the independentcomponents shown in Figure 5a,b,c,d and the corre-sponding mixing coefficients shown in Figure 6a,b,c,d.Unlike in PCA, there is no particular ordering to thecomponents; however, similar to PCA, the componentsmay flip in sign.The spatial maps of the mixing coefficients show vari-

ability in the response and are markedly different fromthe PCA eigenvalue maps; for example, the bottom rightarea of the sample displays high response of the secondcomponent (Figure 6b), which increases the areaenclosed within the left side of the butterfly loop. Inthis example, note that there is no reason for there tobe four components to the hysteresis loop, i.e., we illus-trate an example of the method, but based on componentshape, there should be at least four as all componentsappear significantly different. Importantly, the fourth com-ponent displays a near-ideal ferroelectric loop (Figure 5d),and the strength of this component with respect to the

other components can be seen as an indication of the de-gree of purely ferroelectric switching in those regions, asopposed to other components that appear to result fromdominating influences by surface charges, polar nanore-gions, or field-induced phase transformations. For in-stance, the first component appears largest in the top-leftcorner of the region studied (Figure 6a), and the coercivefields for this component are much lower, possibly due tothe increased propensity of field-induced phase transfor-mations (likely rhombohedral to tetragonal [67]) in thisarea. Thus, ICA is a highly useful method for blind sourceseparation and provides a powerful method accompanyingPCA to de-mix signals where the number of constituentcomponents is either known from physics or can bepostulated.

Supervised learningFunctional recognition imaging is an example of the su-pervised learning approach that employs artificial neuralnetworks. The process of recognition obviates the needfor sophisticated analytical models, instead relying onstatistical analysis of the complex spectroscopic datasets. Nikiforov et al. [68] describe functional recognitionimaging of bacterial samples containing live Micrococcuslysodeikticus and Pseudomonas fluorescens on a poly-L-lysine-coated mica substrate. These bacteria differ inshape and therefore present a good modeling system forcreating training data sets. The spectroscopic data wereprovided by the band excitation PFM method [69,70] inthe form of the electromechanical response vs. excitation

Page 34: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 5 ICA of band excitation piezoresponse force spectroscopy (BEPS) data. (a) The first independent component from the ICA analysis of theBEPS data. (b) The second independent component from the ICA analysis of the BEPS data. (c) The third independent component from the ICAanalysis of the BEPS data. (d) The fourth independent components from the ICA analysis of the BEPS data.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 9 of 25

frequency spectra collected across chosen sample areasthat contained both of the bacteria species, the substrateand debris. The spectra (shown in Figure 7a,b) of bac-teria and substrate clearly contain unique signaturesallowing for their identification on the single-pixel (i.e.,spectral) level. Note that these electromechanical re-sponses originate in a very complex interplay of differentinteraction mechanisms between the AFM tip and sur-face: long-range electric double layer forces and bacterial

Figure 6 ICA of band excitation piezoresponse force spectroscopy (BEPS) dcomponents in Figure 5. (b) Mixing coefficient maps associated with the semaps associated with the third independent components in Figure 5. (d) Mcomponents in Figure 5.

electromobility and flexoelectric properties, with thecontrol over cantilever dynamics governed by the bound-ary conditions at the tip-surface junction. This complexityprecludes analytical modeling of the data but providesenough statistical significance for the successful applica-tion of a neural network.The training of the neural network was performed on

a region of the sample outlined with a black box inFigure 7c. The inputs to the network were the first six

ata. (a) Mixing coefficient maps associated with the first independentcond independent components in Figure 5. (c) Mixing coefficientixing coefficient maps associated with the fourth independent

Page 35: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 7 Functional recognition imagining of a bacterial sample. (a, b) Electromechanical response spectra of two bacterial species and background.(c) AFM topographic image of an area containing bacteria with the training set marked with a rectangular box; coloration indicates neural networkidentification, with green corresponding to P. fluorescens, red to M. lysodeikticus, and gray to the background.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 10 of 25

components of the principal component analysis de-composition (here, acting as a filter) of the data setwithin the training region. A network of three neuronswas trained repeatedly on multiple examples until aminimal error was achieved. Following training, thenetwork was presented with the data set collected onthe whole area shown in Figure 7c and it correctlyidentified both of the bacterial species. Interestingly,other topographical features, distinct from the sub-strate, were classified as background, which identifiesthem as non-bacterial debris. However, a small rela-tively flat region (right upper corner in Figure 7c) wasclassified as M. lysodeikticus, implying that this regioncould be covered in a membrane of lysed bacteria ofthat species. Thus, supervised learning presents a power-ful image recognition tool that can identify objects basedon a small subset of information provided in the trainingset. Even though successful neural network operationrequires extensive training for accuracy, the computa-tional cost during operation is infinitesimal. The illus-trated example was computed on a typical user desktopwithout additional high-end components or computa-tional clusters. Similarly, neural network approachescan be extended to training on theoretical model outputs,with the experimental results presented for analysis. Ex-amples include functional fits to relaxation parameters[71] or Ising model simulations [72,73].

Deep learningIn this section, we discuss the pathways to establish cor-respondence between statistical analysis and a physical

model, i.e., to transition from a search for correlation toa search for causation. The previously introduced first-order reversal curve current-voltage (FORC-IV) SPMtechnique [74] has been deployed in imaging and analysisof spatially uniform Ca-substituted BiFeO3 and NiO sys-tems [74,75]. Those studies have shown that the locallymeasured hysteresis in the FORC-IV curves is related tochanges in electronic conduction sensed by the tip in re-sponse to a bias-induced electrochemical process, and thearea of the IV loop is overall indicative of local ionic activ-ity. FORC-IV spectroscopic imaging modes lack adequatedata analysis and interpretation pathways due to the flex-ible, multidimensional nature of the data set and the vol-ume of the data collected. In this example, we combineFORC-IV measurements with the multivariate statisticalmethods based on signal de-mixing, in order to discrimin-ate between different conductivity behaviors based on theshapes of the IV curves in the full spectroscopic data set.A CoFe2O4-BiFeO3 nanocomposite thin film (CFO-

BFO, Figure 8c) was grown by pulsed laser depositionand is a self-assembled, tubular heterostructure thatforms spontaneously due to segregation of the perovskiteBFO matrix and the CFO spinel inclusions [58,76]. TheFORC-IV spectroscopy was performed at humidity valuesranging from 0% to 87%, with an intermediate 58% casealso shown. To gain insight into the fine structure of theCFO pillars and the CFO-BFO tubular interface, wefirst used conductive AFM (c-AFM) to image areas ofsize 500 × 500 nm2 of the film and then collect FORC-IV data using a waveform with six triangular pulses anda maximum DC peak bias of 3 V on a 50 × 50 pixel

Page 36: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 8 c-AFM on CFO-BFO. (a) Contact mode c-AFM topography of CFO-BFO. (b) c-AFM of CFO-BFO at 100 mV tip bias. (c) Schematic of theCFO-BFO nanocomposite sample and experimental setup. (d) FORC-IV average current for all pixels at 87% humidity; the inset shows FORC-IV biaswaveform (blue) and current response (green) for the 87% humidity case.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 11 of 25

grid overlaid on the imaged area. This corresponds to apixel size of 10 × 10 nm2. Figure 8a,b illustrates a typicalambient c-AFM result including topography and a con-ductivity map collected at 100 mV shown in Figure 8b.Notice that the current is maximized at the edges of thepillars extending into the BFO matrix. Furthermore, someof the pillars feature a central spot of low conductivity,while others are fully conductive. The inset of Figure 8dillustrates details of the FORC-IV experiment, specificallythe applied triangular bias waveform, shown in blue, andaverage current response for the entire 50 × 50 pixelspectroscopy area as a function of time, shown in green.Figure 8d shows the average current loop for the wholebias waveform as a function of voltage; note that thesecurves are essentially featureless with little to no hysteresisand are highly smooth in both forward and reverse voltagesweep directions.The multidimensional nature of these data, combined

with the lack of analytical or numerical physical models,naturally calls for multivariate statistical analysis in orderto extract the most comprehensive view of the physicalbehavior of the CFO-BFO system. While PCA and ICAare powerful methods that allow one to take a closerlook into the structure of the data, a preferable methodwould preserve physical information in the data andallow fully quantitative analysis. Such a method willseparate the data into a combination of well-definedcomponents with clear spectroscopic behavior that hasan intensity weight component, providing insight intothe spatial distribution of the behavior. Ideally, these

components should be physically viable, well-behaved,positive, have additive weights, etc. This level of analysiscan be achieved by Bayesian linear de-mixing methods,specifically an algorithm conglomerate introduced byDobigeon et al. [43].The main advantage of these methods is a quantitative,

interpretable result where the final endmembers arenon-negative, in the units of input data, and with all ofthe respective abundances adding up to 1. Therefore, ateach location, the data is decomposed into a linear com-bination of spectra where each pixel in the probed gridconsists of a number of components (i.e., conducting be-haviors) present in a corresponding proportion. Notethat these constraints allow a direct transition from stat-istical analysis to physical behavior. By making the abun-dances additive and the endmembers positive, we canbegin assigning physical behavior to the shape and natureof the endmember curves. By extension, analysis of theendmember loading maps adds the spatial componentto the behavior that non-statistical methods of analysislack entirely.Following the experiments at 0%, 58%, and 87% humid-

ity, we performed Bayesian de-mixing of the current signalinto four components. The reasons for choosing fourcomponents and the supporting arguments are discussedin detail by Strelcov et al. [58]. De-mixed vectors for allthree experiments as well as loadings for the 0% and 87%humidity cases are shown in Figure 9. The de-mixed com-ponents correspond to 1) electronic transport through apotential barrier (Figure 9a) active in the central and outer

Page 37: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 9 Bayesian de-mixing on CFO-BFO FORC-IV data. Top row are de-mixed vectors for the 0%, 58%, and 87% humidity; middle row are loadingmaps for the 87% humidity; and bottom row are loadings for the 0% humidity. (a) Island conductivity Bayesian de-mixed component for the 0%, 58%,and 87% humidity. (b) Inner island conductivity Bayesian de-mixed component for the 0%, 58%, and 87% humidity. (c) BFO matrix conductivity Bayesiande-mixed component for the 0%, 58%, and 87% humidity. (d) Interfacial conductivity between the CFO and BFO matrix Bayesian de-mixed componentfor the 0%, 58%, and 87% humidity. (e) Island conductivity loading map for the 87% humidity. (f) Inner island conductivity loading map for the 87%humidity. (g) BFO matrix conductivity loading map for the 87% humidity. (h) Interfacial conductivity loading map between the CFO and BFO matrix at87% humidity. (i) Island conductivity loading map for the 0% humidity. (j) Inner island conductivity loading map for the 0% humidity. (k) BFO matrixconductivity loading map for the 0% humidity. (l) Interfacial conductivity loading map between the CFO and BFO matrix at 0% humidity.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 12 of 25

parts of the CFO pillars, 2) an Ohmic conductance(Figure 9b) present in the bulk of CFO islands, 3) negli-gible conductivity of the CFO matrix (Figure 9c), and4) interfacial electrochemistry that generates hysteresisin IV curves (Figure 9d). Evidently, although an increasein humidity level brings about an increase in overallconductivity, the response of individual components ismuch more complex, implying several competing mecha-nisms. A decrease in CFO conductivity (component 1) onhumidity increase from 0% to 58% might be due to forma-tion of water meniscus at the tip-surface junction, whicheffectively decreases the strength of electric field and ham-pers transport through the barrier. On the other hand, theohmic component stays almost unaffected in these condi-tions, being dependent on potential difference betweenthe electrodes, rather than electric field strength. A furtherincrease in humidity to 87% not only increases maximalcurrent in components 1 and 2, but also leads to intensityshift from component 1 to component 2 in the abundance

maps (cf. Figure 9e,i and Figure 9f,j pairwise), i.e., de-creasing the height of potential barrier of electron/holeinjection from the tip into the nanocomposite. Thismight be due to generation of H+ ions by the tip viawater electrosplitting. Finally, the fourth - electrochemicalcomponent - steadily intensifies as the humidity increasesfrom 0% to 87%, as expected from water electrolysis. Thethreshold voltage decreases and the reaction zone widens,as can be observed by comparing Figure 9h and Figure 9l.This exemplifies the ability of deep data analysis not onlyto highlight statistically significant traits in multidimen-sional data, but also to extract physically and chemicallyrelevant behaviors, preserving the units of measurementin the process.

Image domainThe clustering and dimensionality reduction algorithmsused in the previous sections are equally applicable to ana-lysis in image coordinate space, exploring the correlation

Page 38: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 13 of 25

between individual structural elements found in the imageitself. These can originate from both contrast andshape-based features contained in the image, as well asan analysis of features that are mathematically con-densed into a representative set.

Sliding Fourier transformAs an example of image domain analysis, we demon-strate a sliding fast Fourier transform (FFT) filter [77]for analysis of surface reconstructions on epitaxiallygrown films of La5/8Ca3/8MnO3 (LCMO). The image an-alyzed in Figure 10a, is a 50 × 50 nm2 STM topographyimage (captured at a resolution of 512 × 512 pixels). Weanalyze a small window of the surface (outlined as awhite square in the figure, of size 128 × 128 pixels) andgenerate a FFT image of that area. The window is thenslid across the image by a preset number of pixels (inthis case, 8), and the FFT image is captured once again;this process is repeated until the window has coveredthe entire real-space image in the horizontal direction.The window is then stepped in the y-direction and the

Figure 10 Sliding FFT on an STM image of La5/8Ca3/8MnO3. (a) STM topogSrTiO3. Sliding FFT was carried out, which consists of creating a window (wsliding the window across the image a preset distance and recording the nproduce the data set. PCA of this data set was performed, and the first four cTransforming the loadings to real space allows investigating the spatial distrib

process repeated until all areas of the image have beencovered. The output of this procedure is a large array(here, size 60 × 60 × 128 × 128) of position-dependentFFT images, which are then analyzed using PCA to iden-tify the trends and the spatial variations in the data set.The 2D PCA eigenvectors are plotted in Figure 10b,with the first two eigenvectors showing spacing closelyaligned across the [63] direction, while the third eigen-vector shows periodicity more closely aligned in the[010] direction. The loading maps for the eigenvectors,in Fourier space, are shown in Figure 10c, and the real-space loadings are plotted in Figure 10d. The real-spaceloadings readily identify the sites of interest, as measuredby changes in the lattice (be it spacing or angle). Thesecond real-space loading is particularly adept at findingedges of the ordered/disordered areas, as well as orderedbut differently oriented lattices. The fourth componentidentifies the regions in the image where there is a clearlattice. These results show the promise of using the slid-ing FFT/PCA algorithm to quickly identify the types ofperiodicity and their spatial distribution in an image.

raphy image of 16-unit cell sample of La5/8Ca3/8MnO3 grown on (001)hite square in image) in which the FFT is captured, and subsequentlyext FFT of the windowed area until the entire surface is covered toomponents are shown in (b), with the respective Fourier loadings (c).ution, and the first four real-space components are shown in (d).

Page 39: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 14 of 25

Clustering and classification of atomic featuresAn example of correlative learning is shown by the k-means clustering algorithm on atomically resolved scan-ning transmission electron microscopy (STEM) images.We demonstrate phase separation based on local ana-lysis of the atomic neighborhood. We identify all atomsin the image, assign them to nearest neighbors (sixneighbors in this case), and perform clustering analysison the relative bond lengths of the resultant six-memberarray set. The material system is a Mo-V-M-O (M = Ta,Te, Sb, and Nb) mixed oxide, which is one of the mostpromising catalysts for propane (amm)oxidation, withimprovement of their performance being widely pursued[78-80]. In this system, the catalytic performance can bealtered by intermixing two phases referred to as the M1(ICSD no. 55097) and the M2 (ICSD no. 55098) phases,with the correlative analysis serving as a quantitativeframework that allows to separate M1 and M2 as well asestimate their relative contributions as a function of thecatalytic conversion. Figure 11a is a scanning transmis-sion electron microscopy high-angle annular dark-fieldimaging (STEM-HAADF) image of the Mo-V-Sb-Taoxide with the M1 (highlighted by a green square) andthe remaining area being largely a M2 phase speckledwith M1 dislocations. Figure 11b shows the result of thek-means clustering for four clusters (one of the clustersconsists of the edge atoms in the image and is notshown) that clearly delineate the M1 phase members

Figure 11 K-means clustering results for four clusters on the STEM-HAADFimage. (b) K-means result for four clusters based on the length to the six n(e) Sole cluster 3.

(shown in red), M2 matrix phase (shown in green), anda strain relieving interface between the two shown inblue. Figure 11c,d,e shows each cluster individually.While the example is relatively simple, it serves to bridgemachine learning methods with high-quality experimen-tal data that, due to its intrinsic complexity, is typicallyonly qualitatively analyzed. While it may be possible tomanually distinguish phases and assign them to theatomic species, performing this task done quantitativelyfor a large number of frames quickly becomes a monu-mental feat.

Imaging in k-spaceThe concept of image frame analysis can be furtherextended to image sequences, which are perfectly suitedfor analysis using the same multivariate statisticalmethods. As an example, we turn to an image sequence ofreflection high-energy electron diffraction (RHEED) dataacquired during deposition of SrRuO3 on (001) SrTiO3.The (00) or specular spot is closely monitored for signs ofoscillations, which would indicate a layer-by-layer growthof the film on the substrate. As a first approximation,these oscillations arise due to a filling of incomplete layers(which reduces step density and therefore increases theintensity of the diffracted beam), until the layer iscomplete followed by more roughening as more mater-ial is deposited, with a corresponding decrease in inten-sity of the specular spot [81]. The process continues as

image of the Mo-V-Sb-Ta oxide two-phase catalyst. (a) Raw STEMearest neighbors distance metric. (c) Sole cluster 1. (d) Sole cluster 2.

Page 40: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 15 of 25

the deposition proceeds, and the resulting profile of thespecular spot intensity over time is periodic if the growthmode is layer-by-layer. The intensity of the specular spotover the course of the deposition is shown inset in thelower panel in Figure 12, plotted as an olive line. Thisgraph shows that, after the start of the deposition, oscilla-tions can only be observed up to t ~ 110 s (see expandedinset in graph), and afterward, no oscillations are ob-served, indicating a transition to a step-flow growth mode.We studied the first 220 s of growth by using k-means

clustering, with ten clusters, with the mean of the clus-ters plotted in the upper half of Figure 12, while thetemporal dependence of each cluster is graphed in thelower panel. After the deposition begins (at t = 50 s),there are five distinct clusters that characterize thegrowth process before the transition to step-flow mode.

Figure 12 Clustering on RHEED image sequences. K-means analysis of the firperformed. The mean of all members within each cluster for the ten clustersof the clusters is shown in the lower panel. Inset, below: The mean of the specu

These highlight the pathway for the transition - it appearsthat it occurs with the streaks gradually losing intensityover time until they are more spot-like (seen in cluster 6).Beyond t = 120 s, four clusters characterize the remainingt = 100 s of growth, and these are outlined with olive dashlines in the upper panel. Interestingly, there is little differ-ence between these clusters (compared with the clustersin the layer-by-layer growth segment), and moreover, thesimilarity suggests little roughening effects in the grownfilm. We can therefore assign, unambiguously, that thelayer-by-layer growth transitions to the step-flow modewhen cluster 1 is active, i.e., at t = 120 s. Thus, the k-means clustering allows identification of growth modetransitions as well as the pathway through which this oc-curs in k-space, and furthermore allows identification ofexistence or absence of surface roughening. The method

st segment of the RHEED movie (of SrRuO3 growing on (001) SrTiO3) wasis shown in the upper panel of the figure, and the temporal dependencelar diffraction spot, with an expanded view of the first 220 s of the growth.

Page 41: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 16 of 25

is equally applicable to detect 2D → 3D growth modetransitions [82], disordered → ordered transitions [83],strain relaxation [84], etc.

Supervised learning: domain shape recognitionPrincipal component analysis combined with neural net-works can be used for the analysis of ferroelectric domainshapes, which provides insight into the highly non-trivialmechanism of ferroelectric domain switching, and poten-tially establishes a new paradigm for the informationencoding based on the capture domain shape in the image.Recent investigation of the SPM tip-induced ferroelec-

tric domain switching by sequences of positive and nega-tive electric pulses (labeled as a sequence of 0s and 1s)demonstrated unexpectedly complex, symmetrical, andasymmetrical morphologies of the formed domains(Figure 13a) [85]. These results suggest an intriguingpossibility of practical applications in modern data storagedevices, where data is encoded as a set of parameters thatdefine domain shape and size. However, development ofthis approach into a viable device necessitates reliable ana-lysis techniques which allow recognition of the sequenceof the written electrical pulses via shape and size of the

Figure 13 Recognition of the shape of the ferroelectric domains. (a) Shape opositive and negative electric pulses to the SPM tip. (b, c) Principal componeeigenvectors and (c) weights.

resulting domain. We illustrate a combinatorial PCA andneural network approach to address this problem [86].The experimental data set consisted of PFM images of

the domains produced by an application of a number ofelectrical pulses of varying length, and a total of 288 do-mains were acquired for testing.We used PCA to obtain a set of the descriptors that

characterized the individual domains. Each domainimage consisted of N × N pixels and was unfolded intoa 1D vector of N2 length. PCA eigenvectors (Figure 13b)and corresponding weight coefficients (Figure 13c) charac-terized the domain morphology. Color map of the weightsdemonstrates clear differences between the domaingroups corresponding to different switching pulses(Figure 13c). This approach illustrates use of eigenvectorsfor characterization of all of the experimentally observedfeatures of the domain morphology, and the weights canbe used as an input parameter for the recognition by afeed-forward neural network.For testing of this approach, the experimental data set

was divided into training and test data sets. The PCAover the training data set (about 15% of the domains)was used for calculation of etalon eigenvectors, which wasused for deconvolution of the testing weight coefficients

f the ferroelectric domains produced by application of the sequences ofnt analysis over experimental data set of 288 domains. (b) First 16

Page 42: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 17 of 25

over the test data set. The set of the training weights andcorresponding switching sequences are then applied forneural network training. The set of testing weights are fur-ther used as inputs for recognition.Experimental simulations of the suggested approach

showed its practical applicability and demonstratedprobability of the recognition above 65%; however, thisrelatively low value is mainly defined by irreproducibilityof the switching process, caused by the non-ideal natureof the ferroelectric crystal.

High-performance computingThe trend of the generated scientific data, by instruments,experiments, sensors, or supercomputer simulations, hashistorically been characterized by exponential growth, andthis trend is anticipated to continue into the future[87,88]. As detailed in Table 1, the current scientific datavolume output by local imaging and spectroscopy tech-niques is significant and will require high-performancecomputing platforms to meet the demands of analysis andvisualization. There is clearly a need for a framework thatwill allow for near real-time processing and interactivevisualization of scanning and electron microscopy data.Figure 14 exemplifies the hardware types and algorithmsneeded in the life cycle of scientific data, from the point ofgeneration to analysis, visualization, and data archival.Customizable scalable methods, big data workflows, andvisualization for scanning and electron microscopy dataare detailed in the following sections.

Scalable methodsThe key concepts in generating effective high-performancecomputing methods are managing latency of data transferand balancing workload. Algorithms that are structured toeffectively utilize the ever-increasing capacity of high-performance computing are called scalable methods.The movement of data in high-performance computingand across storage devices is well known, with hierarchiesof transmission latency; therefore, analyzing scanning and

Figure 14 High-performance computation microscopy workflow. Life cycle ofexperimental measurements.

electron microscopy data on these platforms will requirephysics-based algorithms that are customized to exploitparallel work while minimizing communication cost [89].Experimental scientists will need to join with computa-tional scientists and applied mathematicians to continueto scale this analysis to the next generation of high-performance computing (HPC) systems [90].Future HPC systems are expected to have processor

cores, memory units, communication nodes, and othercomponents totaling in the hundreds of millions [91],and it is expected that faults in these systems will occurin the time frame of seconds [92]. This underscores therequirement of the algorithms specifically designed foranalysis of scanning and electron microscopy data whichmust use robust workload balancing tools that are resilientto errors in algorithmic execution, as well as data transfer.

Big data workflowsTo effectively leverage scalable methods for analysis onlarge-scale HPC systems, a sophisticated data workflowis required. Whereas computational scientists are accus-tomed to dealing with the idiosyncrasies of HPC envi-ronments (compiler technologies, scientific libraries,communication libraries, complex data models), mi-croscopists generally are not. This presents a challengein delivering the promise of near real-time analysis tothe users at scanning probe, focused X-ray, tomog-raphy, and electron microscopy imaging facilities [89].To overcome this challenge, we are employing an auto-mated workflow-based approach.In a typical example, the user will collect data from

the instrument via the instrument control interface. Asmeasurements progress, data is generated in a standardmicroscopy data format such as Digital Microscopy ver-sion 3 (DM3) or a text-based file (see Figure 14). Uponthe completion of a measurement, the workflow begins,with the data transfer via a light communication nodefrom the instrument to a high-performance storage [93].This approach allows pipelining of the data to an HPC

near-real-time analysis of large data from local imaging and spectroscopy

Page 43: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 18 of 25

environment in parallel while subsequent measurementsare taken at the instrument and other instruments aresending data.Once the data file is stored within an HPC environ-

ment, the next stage of the workflow includes conver-sion to a data model suitable for HPC-based analysis,generally using the Hierarchical Data Format version 5(HDF5). With the data set now converted and residenton a parallel file system, the next stage of the workflow,analysis via scalable methods, can be executed. At thisjuncture, an analysis algorithm is selected based on theinstrument, the measurement, the material composition,and other user-specified criteria. Once selected, the ana-lysis is executed on an HPC system. The resultant dataand statistics are then made available to the user for in-spection and further analysis. Initial experimentation ofthis concept has shown that analysis can be completedin seconds, allowing near real-time feedback from themeasurement. Upon completion of the analysis, the datais then organized for possible archival. Once data move-ment and analysis is completed, interactive visual analysisis made available for further inspection of the data.

Scalable analyticsIt is important to note that the difficulties surroundingscalable analytics in the context of the imaging methodsinsofar discussed extend far beyond the need for task-based and data-based parallelism. In particular, one ofthe primary challenges expected to impede further pro-gress is the application of statistical methods in ex-tremely high dimensions. Due to the structure of theanalysis problems in computational settings, the com-plexity of the problem space manifests itself as a high-dimensional analysis problem, where dimensionality ismost often associated with the number of measurementsbeing considered simultaneously. The curse of dimension-ality is a persistent phenomenon in modern statistics dueto our ability to measure at rates and scales unheard ofuntil the modern era [94]. However, there are many strat-egies to mitigate the statistical consequences of highdimensionality.While some of the methods noted earlier in this paper

are computationally scalable, in many cases, they are notappropriate for other reasons. For example, althoughPCA, ICA, k-means, and back propagation for neuralnetworks all fit the Statistical Query Model, and thus be-long to a known set of problems that can essentiallyscale linearly, this does not necessarily solve the issuesraised by high-dimensional analysis [95]. For example, itis important to observe that in high-dimensional spaces,nearest neighbors become nearly equidistant [96]. Thisis particularly problematic for clustering algorithms butalso has significant consequences for other dimensional-ity reduction techniques.

Clustering in high-dimensional spaces has been ad-dressed using a variety of methods that consider scalabil-ity. A good example is the use of hashing in similaritymeasurements. Hashing techniques that facilitate neigh-borhood searches in high-dimensional space rely on vari-ous assumptions for tractability. Often, these assumptionsinclude independence among the dimensions; in the caseof Weiss et al., the authors suggest the use of PCA inorder to prep the data in such a way that these assump-tions are more accurate [97]. Moreover, various hashingtechniques attempt to preserve distances between pointsin different ways, such that the user must be savvy enoughto understand these assumptions in order to choose thebest approach [97,98]. For example, Weiss et al. gainsmuch of its power by only attempting to preserve the rela-tive order of small distances. After a certain distance inspace is reached, all distances beyond that are allowed tobecome equidistant in the space represented by the hashcodes. However, this brings us back to the unfortunatesituation that in extremely high dimensions, points tendto become equidistant, such that these hashing ap-proaches cannot be expected to work for problems thatdo not have structure allowing some sort of dimension-ality reduction.We also suspect that many important patterns cannot

be captured by linear dimensionality reduction tech-niques alone. However, non-linear techniques, such asthose shown by Roweis et al., Tenenbaum et al., Belkinet al., and Gerber et al., are less scalable [99-102]. Manysuch methods fall under the umbrella of manifold learn-ing, which is a technique meant to take advantage ofcases where the data lie on a non-linear subspace thatcan be represented by a significantly smaller number ofdimensions [103]. Many manifold learning approachesinvolve the solution of a symmetric diagonally dominant(SDD) linear system, but recent progress has been madein finding more efficient, scalable solutions to suchproblems [104].When dimensionality reduction techniques still leave

large numbers of potentially relevant measurements, otherscalable approaches for dealing with high-dimensionalanalysis are still required. In the case of clustering, onesuch scalable approach that deals with high-dimensionalclustering can be found in the methodology of Vatsavaiet al. [105]. Note that this method also automaticallyattempts to select the number of clusters, a known prob-lem for k-means clustering.Many of the most effective solutions to the challenges

presented by high-dimensional data have relied on theinjection of additional knowledge. In the case wherehuman expertise can play a part in pattern discoveryand dimensionality reduction, data analysis becomesmuch easier. Unfortunately, more often than not, we aredealing with problems where the physics are unknown

Page 44: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 19 of 25

and the discovery of manual patterns is extremely difficulteven in the case of deep domain knowledge. Thus, moreautomated methods for incorporating additional informa-tion, such as the integration of alternate imaging modal-ities, become important.Moreover, methods of automated pattern discovery in

large data sets have made great progress in recent years.In particular, in the case of imagery methods, much pro-gress in automated feature extraction has occurred inthe area known as deep learning [106]. However, suchmethods rely on large aggregated image repositories.This means that big data workflows have to be in placeto retain large numbers of experimental results andallow their joint analysis. In addition, while these methodshave proven to be scalable, they are also subject to findingmany irrelevant patterns when utilizing networks consist-ing of extreme numbers of parameters [107].

VisualizationDynamic hypothesis generation and confirmation tech-niques are a necessity for enabling scientific progress inextreme-scale science domains. Indeed, when insight isdetected in the data, new questions arise, leading tomore detailed examination of specific constituents. Ac-cordingly, scientific analysis techniques should enhancethe scientist’s cognitive workflow by intelligently blend-ing human interaction and computational analytics atscale via interactive data visualization. The orchestrationof human cognition and computational power is criticalfor two primary reasons: (i) the data are too large forpurely visual methods and require assistance from dataprocessing and mining algorithms, and (ii) the tasks aretoo exploratory for purely analytical methods and callfor human involvement. Having established our strategyfor harnessing computational power through automatedanalytical algorithms, we will devote the remainder of thissection to several key strategies for integrating human-guided scientific analysis at scale in the materials sciences.Given the scale and complexity of the materials data, a

visual analytics approach is the most viable solution toaccelerate knowledge discovery. Thomas et al. definevisual analytics as ‘the science of analytical reasoningfacilitated by interactive visual interfaces’ [108]. The fun-damental goal of visual analytics is to turn the challengeof the information overload into an opportunity by visu-ally representing information and allowing direct inter-action to generate and test hypotheses. The advantage ofvisual analytics is that users can focus their full cognitiveand perceptual capabilities on the analytical process,while simultaneously leveraging advanced computationalcapabilities to guide the discovery process [109]. Visualanalytics is a modern take on the concept of exploratorydata analysis (EDA) [110]. Introduced by Tukey, EDAis a data analysis philosophy that emphasizes the

involvement of both visual and statistical understand-ing in the analysis process.To allow efficient EDA in materials science, the com-

bination of multiple views (CMV) and focus + contextinformation visualizations are needed. CMV is an inter-action methodology that involves linked view manipula-tions distributed across multiple visualizations, and recentevaluations demonstrate that this approach fosters morecreative and efficient analysis than non-coordinated views[111]. In a CMV system, as the scientist manipulates aparticular visualization (e.g., item selections, filtering, vari-able integrations), the manipulations are immediatelypropagated to the other visualizations using a linked datamodel. In conjunction with CMV, focus + context repre-sentations support efficient EDA by preserving the contextof the more complete overview of the data during zoom-ing and panning operations. As the scientist zooms intothe data views to see more details, the focus + context dis-play simultaneously maintains the context or gestalt [112]of the entire data set. In this way, the operator is less likelyto lose their orientation within the overall data space whileinvestigating fine-grain details.Given the need to analyze multiple dimensions in ma-

terials science scenarios, multidimensional informationvisualization techniques that enable comparative studiesare required. In conjunction with the dimensionality re-duction techniques, previously mentioned, lossless multi-dimensional visualizations are also desired. A promisingsolution is to use an approach similar to the ExploratoryData Analysis Environment (EDEN) system [113], whichis built around a highly interactive variant of the parallelcoordinates technique. Inselberg initially popularized theparallel coordinates technique as an approach for repre-senting hyper-dimensional geometries [114]. In general,the technique yields a compact two-dimensional represen-tation of multidimensional data sets by representing theN-dimensional data tuple C with coordinates (c1, c2,…, cN)[115] on N parallel axes that are joined with a polyline. Intheory, the number of dimensions that can be displayed isonly limited by the horizontal resolution of the display de-vices (i.e., Figure 15 shows a particular parallel coordinatesplot in EDEN that accommodates the simultaneous dis-play of 88 variable axes). Consequently, parallel coordinatesavoid the loss of information afforded by dimensionality re-duction techniques. But in a practical sense, the axes thatare immediately adjacent to one another yield the most ob-vious information about relationships between attributes.In order to analyze attributes that are separated by one ormore axes, interactions and graphical indicators arerequired. Several innovative extensions that seek toimprove interaction and cognition with parallel coordi-nates have been described in the visualization researchliterature. For example, Hauser et al. [116] described ahistogram display, dynamic axis re-ordering, axis inversion,

Page 45: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Figure 15 Lossless multidimensional visualization. EDEN is used tovisually analyze a 1,000 simulation CLM4 point ensemble data setwith 81 parameters and 7 output variables on ORNL’s EVERESTpower wall facility which offers 115,203,072 (35 million) pixels. EDENis a promising technique for materials science data analysisespecially when it is coupled with dimensionality reduction andstatistical learning algorithms.

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 20 of 25

and details-on-demand capabilities for parallel coordinates.The literature covering parallel coordinates is vast andcovers multiple domains as recently surveyed by Heinrichand Weiskopf [117].EDEN extends the classical parallel coordinates axis by

providing cues that guide and refine the analyst’s explor-ation of the information space. This approach is akin tothe concept of the scented widget described by Willettet al. [118]. Scented widgets are graphical user interfacecomponents that are augmented with an embeddedvisualization to enable efficient navigation in the infor-mation space of the data items. The concept arises fromthe information foraging theory described by Pirolli andCard [119], which relates human information gatheringto the food foraging activities of animals. In this model,the concept of information scent is identified as the ‘userperception of the value, cost, or access path of informa-tion sources obtained by proximal cues’ [119]. In EDEN,the scented axis widgets are augmented with informationfrom automated data mining processes (e.g., statisticalfilters, automatic axis arrangements, regression mining,correlation mining, and subset selection capabilities) thathighlight potentially relevant associations and reduceknowledge discovery timelines.The parallel coordinates plot is ideal for exploratory

analysis of materials science data because it accommo-dates the simultaneous display of a large number of vari-ables in a two-dimensional representation. In EDEN, theparallel coordinates plot is extended with a number ofcapabilities that facilitate exploratory data analysis andguide the scientist to the most significant relationships inthe data. A full description of these extensions is beyond

the scope of this article, but the reader is encouraged toexplore prior publication for more detailed explanationsof our multidimensional analysis techniques [113,120].EDEN is an exemplary case of the indispensable visual an-alytics techniques that provide intelligent user interfacesby leveraging both visual representations and humaninteraction, thereby enhancing scientific discovery withvital assistance from automated analytics. As we developnew visual analytics approaches like EDEN for materialsscience workflows, we expect to dramatically reduceknowledge discovery timelines through more intuitive andexploratory analysis guided by machine learning algo-rithms in an intelligent visual interface.

ConclusionsThe development of electron and scanning probe mi-croscopies in the second half of the twentieth centurywas enabled by computer-assisted methods for automaticdata acquisition, storage, analysis, and tuning and refine-ment of feedback loops as well as imaging parameters. Inthe last decade, high-resolution STEM and STM imagingtechniques have enabled acquisition of high-veracity infor-mation [121] at the atomic scale, readily providing insighton positions and functionality of materials that have beeninaccessible due to a lagging analysis framework in the mi-croscopy communities. Naturally, progress in complexityof dynamic and functional imaging leads to multidimen-sional data sets containing spectral information on localphysical and chemical functionalities, which can be easilyexpanded further to acquire data as a function of a pleth-ora of parameters such as time, temperature, or manyother external stimuli.Maximizing the scientific output from existing and fu-

ture microscopes brings forth the challenge of analysis,visualization, and storage of data, as well as decorrelationand classification of the known and unknown hidden dataparameters, the traditional big data analysis. The existinginfrastructure for such analysis has been developed in thecontext of medical and satellite imaging, and its extensionto functional and structural imaging data is a natural nextstep. Of course, further development toward a flexible in-frastructure where the scientists can select or define theirown analysis algorithms to analyze the data ‘on the fly’ asit is being collected can be envisioned. This will requirescalable algorithms, high-performance computing, andstorage infrastructure. Reducing the data sets to a moremanageable size, while initially attractive, comes with therisk of losing significant information within the data, par-ticularly for exploratory studies in which the phenomenaof interest may not be captured by statistical methods.Beyond the big data challenges [122,123] is the transi-

tion to a deep data approach, in which we fully utilize allthe information present within the data to derive an un-derstanding [124] - namely, how do we ascribe relevant

Page 46: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 21 of 25

physical and chemical information contained within thedata sets, differentiate relevant and coincidental behaviors,move beyond simple correlation, and link to scientific the-ory? High-resolution imaging allows us to explore themicroscopic degrees of freedom in the system - how canwe use theory to understand these behaviors, refine theor-etical models, and ultimately enable knowledge-driven de-sign and optimization of new materials? [125]. To achievethis goal, new methods and theories will be necessary fordefining the local chemical and physical descriptions, theirspatial distribution and evolution during reactions. Whilecomplicated, recent progress in information and statisticaltheory suggest that such descriptions are possible [126].One of the approaches to achieve this goal is through

the user center model that combines development andmaintenance of cutting edge tools, as well as experienceand detailed knowledge of data interpretation in termsof relevant behaviors, all while maintaining an open ac-cess policy - making the findings available to the broaderscientific community. Equally important will be thecross-disciplinary synergy between theory, imaging, anddata analytics, harnessing the power of multivariatestatistical methods to understand and explore multidi-mensional imaging and spectroscopy data sets.Integration of the knowledge in the field will allow de-

velopment of universal database libraries allowing identi-fication and data mining of novel and well-understoodmaterials, refinement and improvement of dynamic data,and ultimately creation of supervised expert systems thatwill allow rapid identification and analysis of unknownsystems. Successes in fields such as medical diagnosticsand imaging suggest that this is fully possible. These de-velopments will further open the pathway for explorationand tailoring of desired material functionalities based onbetter information. We anticipate the emergence ofGoogle-like environments that will allow storage andinterpretation of collective knowledge and image inter-pretation in the context of data and historical knowledge.Rather than creating multiple samples, the structure-property relationships extracted from a single disor-dered sample could offer a statistical picture of materialsfunctionality, providing the experimental counterpart toMaterials Genome-type programs.

AbbreviationsAC: alternating current; AFM: atomic force microscopy; ANN: artificial neuralnetwork; BE: band excitation; BEPS: band excitation polarization spectroscopy;c-AFM: conductive atomic force microscopy; CFO-BFO: CoFe2O4-BiFeO3;CITS: current imaging tunneling spectroscopy; CMV: combination of multipleviews; DC: direct current; DM3: Digital Microscopy version 3; EDA: exploratorydata analysis; EDEN: Exploratory Data Analysis Environment; FFT: fast Fouriertransfrorm; FORC: first-order reversal curve; FORC-IV: first-order reversal curvecurrent-voltage; HPC: high-performance computing; HDF5: Hierarchical DataFormat version 5; ICA: independent component analysis; IV: current-voltage;PCA: principal component analysis; PFM: piezoresponse force microscopy;RHEED: reflection high-energy electron diffraction; SPM: scanning probemicroscopy; SS PFM: switching spectroscopy piezoresponse force microscopy;

STEM: scanning transmission electron microscopy; STEM-HAADF: scanningtransmission electron microscopy high-angle annular dark-field imaging;STM: scanning tunneling microscopy; STS: scanning tunneling spectroscopy;SSD: symmetric diagonally dominant; ω: frequency.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsAB prepared the manuscript and assembled the detailed statistical methods.RV prepared sections 4D and 5D data - Band Excitation Spectroscopy Analysisand Imaging, ‘Sliding Fourier Transform’, and ‘Imaging in k-space’. ES preparedthe sections ‘Independent component analysis’ and ‘Bayesian de-mixing’. CS,GS, CS, and RA prepared the section ‘Image domain’ in its entirety. SMY carriedout experiments in the section ‘Independent component analysis’. AT and MBcharacterized and prepared samples described in sections ‘3D data - CITS inSTEM’ and ‘Imaging in k-space’. AB characterized samples in the section ‘4D and5D data - band excitation spectroscopy analysis’. SJ and SK heavily contributedto the writing of manuscript as well as the meaningful discussion. All authorsread and approved the final manuscript.

AcknowledgementsThis research was sponsored by the Division of Materials Sciences andEngineering, BES, DOE (RKV, AT, SVK). The data analysis portion of thisresearch (ES, MB) was conducted at the Center for Nanophase MaterialsSciences, which is a DOE Office of Science User Facility. Research related toatomic resolution imaging (AB, AB, SJ) was sponsored by LaboratoryDirected Research and Development Program of Oak Ridge NationalLaboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy.The authors gratefully acknowledge Dr. S. Zhang (Penn. State) for providingthe PMN-PT ferroelectric relaxor sample as well as Dr. Ying-Hao Chu andYing-Hui Hsieh for providing BFO-CFO nanocomposite samples. SMYacknowledges the support by IBS-R009-D1, Korea.

Author details1Institute for Functional Imaging of Materials, Oak Ridge National Laboratory,Oak Ridge, TN 37831, USA. 2The Center for Nanophase Materials Sciences,Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA. 3MaterialsSciences and Technology Division, Oak Ridge National Laboratory, Oak Ridge,TN 37831, USA. 4Center for Correlated Electron Systems, Institute for BasicScience (IBS), Seoul 151-747, South Korea. 5Department of Physics andAstronomy, Seoul National University, Seoul 151-747, South Korea.6Computer Science and Mathematics Division, Oak Ridge NationalLaboratory, Oak Ridge, TN 37831, USA. 7Computational Sciences andEngineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831,USA. 8Computer, Computational, and Statistical Sciences, Los AlamosNational Laboratory, Los Alamos, NM 87545, USA.

Received: 27 January 2015 Accepted: 21 April 2015

References1. Mody, C: Instrumental Community: Probe Microscopy and the Path to

Nanotechnology. MIT Press, Boston, MA (2011)2. Binder, K, Young, AP: Spin-glasses: experimental facts, theoretical concepts,

and open questions. Rev Mod Phys 58(4), 801–976 (1986). doi:10.1103/RevModPhys.58.801

3. Binder, K, Reger, JD: Theory of orientational glasses models, concepts,simulations. Adv Phys 41(6), 547–627 (1992). doi:10.1561/2200000006

4. Westphal, V, Kleemann, W, Glinchuk, MD: Diffuse phase transitions andrandom-field-induced domain states of the “relaxor” ferroelectric PbMg1/3Nb2/3O3. Phys Rev Lett 68(6), 847–850 (1992). doi:dx.doi.org/10.1103/PhysRevLett.68.847

5. Tagantsev, AK, Glazounov, AE: Does freezing in PbMg1/3Nb2/3O3 relaxormanifest itself in nonlinear dielectric susceptibility? Appl Phys Lett 74(13),1910–1912 (1999). doi:10.1063/1.123710

6. Winter, M, Besenhard, JO, Spahr, ME, Novak, P: Insertion electrode materialsfor rechargeable lithium batteries. Adv Mater 10(10), 725–763 (1998).doi:10.1002/(sici)1521-4095(199807)10:10<725::aid-adma725>3.0.co;2-z

7. Bagotsky, VS: Fuel Cells: Problems and Solutions. Wiley, Hoboken, NJ (2009)

Page 47: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 22 of 25

8. Adler, SB: Factors governing oxygen reduction in solid oxide fuel cellcathodes. Chem Rev 104(10), 4791–4843 (2004). doi:10.1021/cr020724o

9. Machta, BB, Chachra, R, Transtrum, MK, Sethna, JP: Parameter spacecompression underlies emergent theories and predictive models. Science342, 604–607 (2013). doi:10.1126/science.1238723

10. Kalinin, SV, Balke, N: Local electrochemical functionality in energy storagematerials and devices by scanning probe microscopies: status andperspectives. Adv Mater 22(35), E193–E209 (2010). doi:10.1002/adma.201001190

11. Balke, N, Jesse, S, Morozovska, AN, Eliseev, E, Chung, DW, Kim, Y, Adamczyk, L,Garcia, RE, Dudney, N, Kalinin, SV: Nanometer-scale electrochemicalintercalation and diffusion mapping of Li-ion battery materials. NatNanotechnol 5, 7349–7357 (2010)

12. Balke, N, Bdikin, I, Kalinin, SV, Kholkin, AL: Electromechanical imaging andspectroscopy of ferroelectric and piezoelectric materials: state of the art andprospects for the future. J Am Ceram Soc 92(8), 1629–1647 (2009).doi:10.1111/j.1551-2916.2009.03240.x

13. Kalinin, SV, Rodriguez, BJ, Jesse, S, Maksymovych, P, Seal, K, Nikiforov, M,Baddorf, AP, Kholkin, AL, Proksch, R: Local bias-induced phase transitions.Materials Today 11(11), 16–27 (2008). doi:10.1016/s1369-7021(08)70235-9

14. Felts, JR, Somnath, S, Ewoldt, RH, King, WP: Nanometer-scale flow of moltenpolyethylene from a heated atomic force microscope tip. Nanotechnology23(21), 215301 (2012). doi:10.1088/0957-4484/23/21/215301

15. King, WP, Kenny, TW, Goodson, KE, Cross, G, Despont, M, Dürig, U, Rothuizen,H, Binnig, GK, Vettiger, P: Atomic force microscope cantilevers for combinedthermomechanical data writing and reading. Appl Phys Lett 78(9), 1300–1302(2001). doi:dx.doi.org/10.1063/1.1351846

16. Jesse, S, Nikiforov, MP, Germinario, LT, Kalinin, SV: Local thermomechanicalcharacterization of phase transitions using band excitation atomic forceacoustic microscopy with heated probe. Appl Phys Lett 93(7), 073104(2008). doi:10.1063/1.2965470

17. Nikiforov, MP, Jesse, S, Morozovska, AN, Eliseev, EA, Germinario, LT, Kalinin, SV:Probing the temperature dependence of the mechanical properties of polymersat the nanoscale with band excitation thermal scanning probe microscopy.Nanotechnology 20(39), 395709 (2009). doi:10.1088/0957-4484/20/39/395709

18. Somnath, S, Corbin, EA, King, WP: Improved nanotopography sensing viatemperature control of a heated atomic force microscope cantilever.Sensors J, IEEE 11(11), 2664–2670 (2011). doi:10.1109/JSEN.2011.2157121

19. Kelly, SJ, Kim, Y, Eliseev, E, Morozovska, A, Jesse, S, Biegalski, MD, Mitchell, JF,Zheng, H, Aarts, J, Hwang, I: Controlled mechnical modification of manganitesurface with nanoscale resolution. Nanotechnology 25(47), 475302 (2014).doi:10.1088/0957-4484/25/47/475302

20. Kim, Y, Kelly, SJ, Morozovska, A, Rahani, EK, Strelcov, E, Eliseev, E, Jesse, S,Biegalski, MD, Balke, N, Benedek, N: Mechanical control of electroresistiveswitching. Nano Lett 13(9), 4068–4074 (2013). doi:10.1021/nl401411r

21. Lu, H, Kim, D, Bark, C-W, Ryu, S, Eom, C, Tsymbal, E, Gruverman, A:Mechanically-induced resistive switching in ferroelectric tunnel junctions.Nano Lett 12(12), 6289–6292 (2012). doi:10.1021/nl303396n

22. Zhang, JX, Xiang, B, He, Q, Seidel, J, Zeches, RJ, Yu, P, Yang, SY, Wang, CH,Chu, YH, Martin, LW, Minor, AM, Ramesh, R: Large field-induced strains in alead-free piezoelectric material. Nat Nanotechnol 6(2), 98–102 (2011).doi:10.1038/nnano.2010.265

23. Dao, M, Chollacoop, N, Van Vliet, K, Venkatesh, T, Suresh, S: Computationalmodeling of the forward and reverse problems in instrumented sharpindentation. Acta Mater 49(19), 3899–3918 (2001). doi:10.1016/S1359-6454(01)00295-6

24. Garcia, R, Martinez, RV, Martinez, J: Nano-chemistry and scanning probenanolithographies. Chem Soc Rev 35(1), 29–38 (2006). doi:10.1039/B501599P

25. Martinez, J, Martínez, RV, Garcia, R: Silicon nanowire transistors with a channelwidth of 4 nm fabricated by atomic force microscope nanolithography. NanoLett 8(11), 3636–3639 (2008). doi:10.1021/nl801599k

26. Van Vliet, KJ, Li, J, Zhu, T, Yip, S, Suresh, S: Quantifying the early stages ofplasticity through nanoscale experiments and simulations. Phys Rev B67(10), 104105 (2003). doi:dx.doi.org/10.1103/PhysRevB.67.104105

27. Chang, HJ, Kalinin, SV, Yang, S, Yu, P, Bhattacharya, S, Wu, PP, Balke, N, Jesse, S,Chen, LQ, Ramesh, R, Pennycook, SJ, Borisevich, AY: Watching domains grow:in-situ studies of polarization switching by combined scanning probe andscanning transmission electron microscopy. J Appl Phys 110(5), 052014 (2011).doi:10.1063/1.3623779

28. Nelson, CT, Gao, P, Jokisaari, JR, Heikes, C, Adamo, C, Melville, A, Baek, SH,Folkman, CM, Winchester, B, Gu, YJ, Liu, YM, Zhang, K, Wang, EG, Li, JY,

Chen, LQ, Eom, CB, Schlom, DG, Pan, XQ: Domain dynamics duringferroelectric switching. Science 334(6058), 968–971 (2011). doi:10.1126/science.1206980

29. Jesse, S, Guo, S, Kumar, A, Rodriguez, BJ, Proksch, R, Kalinin, SV: Resolutiontheory, and static and frequency-dependent cross-talk in piezoresponseforce microscopy. Nanotechnology 21(40), 405703 (2010). doi:10.1088/0957-4484/21/40/405703

30. Jesse, S, Kalinin, SV: Band excitation in scanning probe microscopy: sines ofchange. J Phys D Appl Phys 44(46), 464006–464021 (2011). doi:10.1088/0022-3727/44/46/464006

31. Kalinin, SV, Jesse, S, Proksch, R: Information acquisition & processing inscanning probe microscopy. J Name: R & D Magazine 50(4), 20 (2008)

32. Rodriguez, BJ, Callahan, C, Kalinin, SV, Proksch, R: Dual-frequencyresonance-tracking atomic force microscopy. Nanotechnology 18(47),475504–475509 (2007)

33. Mayergoyz, ID, Friedman, G: Generalized Preisach model of hysteresis. IEEETrans Magn 24(1), 212–217 (1988). doi:10.1109/20.43892

34. Mitchler, PD, Roshko, RM, Dahlberg, ED: A Preisach model with a temperatureand time-dependent remanence maximum. J Appl Phys 81(8), 5221–5223(1997). doi:10.1063/1.364473

35. Jesse, S, Kalinin, SV: Principal component and spatial correlation analysis ofspectroscopic-imaging data in scanning probe microscopy. Nanotechnology20(8), 085714 (2009). doi:10.1088/0957-4484/20/8/085714

36. Nan Y, Belianinov A, Strelcov E, Tebano A, Foglietti V, Di Castro D, Schlueter C,Lee T-L, Baddorf A P, Balke N, Jesse S, Kalinin S V, Balestrino G, Aruta C: Effect ofdoping on surface reactivity and conduction mechanism in samarium-dopedceria thin films. ACS Nano, 8(12), 12494-12501. doi:10.1021/nn505345c

37. Bosman, M, Watanabe, M, Alexander, DTL, Keast, VJ: Mapping chemical andbonding information using multivariate analysis of electron energy-lossspectrum images. Ultramicroscopy 106(11–12), 1024–1032 (2006).doi:10.1016/j.ultramic.2006.04.016

38. Bonnet, N: Artificial intelligence and pattern recognition techniques inmicroscope image processing and analysis. In: Hawkes, PW (ed.) vol. 114.Advances in Imaging and Electron Physics, pp. 1–77. Elsevier AcademicPress Inc, San Diego (2000)

39. Bonnet, N: Multivariate statistical methods for the analysis of microscopeimage series: applications in materials science. J Microsc-Oxf 190, 2–18(1998). doi:10.1046/j.1365-2818.1998.3250876.x

40. Belianinov, A, Ganesh, P, Lin, W, Sales, BC, Sefat, AS, Jesse, S, Pan, M, Kalinin, SV:Research update: spatially resolved mapping of electronic structure on atomiclevel by multivariate statistical analysis. APL Materials 2(12), 120701 (2014).doi:dx.doi.org/10.1063/1.4902996

41. Belianinov, A, Kalinin, SV, Jesse, S: Complete information acquisition in dynamicforce microscopy. Nat Commun. 6, (2015). doi:10.1038/ncomms7550

42. Hyvärinen, A, Karhunen, J, Oja, E: Independent component analysis, vol. 46.John Wiley & Sons, Danvers, MA (2004)

43. Dobigeon, N, Moussaoui, S, Coulon, M, Tourneret, JY, Hero, AO: JointBayesian endmember extraction and linear unmixing for hyperspectralimagery. IEEE Trans Signal Process 57(11), 4355–4368 (2009). doi:10.1109/tsp.2009.2025797

44. Parra, L, Mueller, K-R, Spence, C, Ziehe, A, Sajda, P: Unmixing hyperspectraldata. Advances in Neural Information Processing Systems (NIPS) 12,942–948 (2000)

45. Dobigeon, N, Tourneret, JY, Chein, IC: Semi-supervised linear spectral unmixingusing a hierarchical Bayesian model for hyperspectral imagery. IEEE TransSignal Process 56(7), 2684–2695 (2008). doi:10.1109/tsp.2008.917851

46. Moussaoui, S, Brie, D, Mohammad-Djafari, A, Carteret, C: Separation ofnon-negative mixture of non-negative sources using a Bayesian approachand MCMC sampling. IEEE Trans Signal Process 54, 4133–4145 (2006).doi:10.1109/TSP.2006.880310

47. Dobigeon, N, Moussaoui, S, Tourneret, JY: Blind unmixing of linear mixturesusing a hierarchical Bayesian model. Application to spectroscopic signalanalysis, pp. 79–83. Proc. IEEE-SP Workshop Stat. and Signal Processing,Madison, WI (2007)

48. Winter, ME: N-FINDR: an algorithm for fast autonomous spectral end-memberdetermination in hyperspectral data. In: Shen, MRDSS (ed.) SPIE, pp. 266–275.(1999)

49. Hartigan, JA, Wong, MA: Algorithm AS 136: a K-means clustering algorithm.J R Stat Soc: Ser C: Appl Stat 28(1), 100–108 (1979). doi:10.2307/2346830

50. MacQueen, JB: Some methods for classification and analysis of multivariateobservations. In: Cam, L.M.L., Neyman, J. (eds.) Proc. of the fifth Berkeley

Page 48: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 23 of 25

Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297.University of California Press (1967)

51. Binnig, G, Rohrer, H: Scanning tunneling microscopy. Helv Phys Acta 55(6),726–735 (1982)

52. Binnig, G, Rohrer, H, Gerber, C, Weibel, E: 7X7 Reconstruction on Si(111)resolved in real space. Phys Rev Lett 50(2), 120–123 (1983). doi:10.1103/PhysRevLett.50.120

53. Stroscio, J, Stroscio, A, Kaiser, W, Kaiser, J: Scanning Tunneling Microscopy,vol. volume 27. Methods in Experimental Physics. Academic Press, SanDiego, CA (1993)

54. Asenjo, A, Gomezrodriguez, JM, Baro, AM: Current imaging tunnelingspectroscopy of metallic deposits of silicon. Ultramicroscopy 42, 933–939(1992). doi:10.1016/0304-3991(92)90381-s

55. Sales, BC, Sefat, AS, McGuire, MA, Jin, RY, Mandrus, D, Mozharivskyj, Y: Bulksuperconductivity at 14 K in single crystals of Fe{1+y}Te{x}Se{1−x}. Phys Rev B79(9), 094521 (2009)

56. Sefat, AS, Singh, DJ, Mater, DJ: Chemistry and electronic structure ofiron-based superconductors. Mater Research Bull 36, 614 (2011)

57. Tselev A, Ivanov I N, Lavrik N V, Belianinov A, Jesse S, Mathews J P, MitchellG D, Kalinin SV: Mapping internal structure of coal by confocal micro-Ramanspectroscopy and scanning microwave microscopy. Fuel. 126, 32-37.doi:10.1016/j.fuel.2014.02.029

58. Strelcov, E, Belianinov, A, Hsieh, Y-H, Jesse, S, Baddorf, AP, Chu, Y-H, Kalinin, SV:Deep data analysis of conductive phenomena on complex oxide interfaces:physics from data mining. ACS Nano 8(6), 6449–6457 (2014). doi:10.1021/nn502029b

59. Strelcov, E, Belianinov, A, Sumpter, BG, Kalinin, SV: Extracting physics throughdeep data analysis. Materials Today 17(9), 416–417 (2014). doi:10.1016/j.mattod.2014.10.002

60. Haykin, SS: Neural Networks: A Comprehensive Foundation. Prentice Hall,New York, NY (1999)

61. Bintachitt, P, Trolier-McKinstry, S, Seal, K, Jesse, S, Kalinin, SV: Switchingspectroscopy piezoresponse force microscopy of polycrystalline capacitorstructures. Appl Phys Lett 94(4), 042906 (2009). doi:10.1063/1.3070543

62. Marincel D M, Zhang H R, Britson J, Belianinov A, Jesse S, Kalinin SV, Chen LQ,Rainforth WM, Reaney IM, Randall CA, Trolier-McKinstry S: Domain pinning neara single-grain boundary in tetragonal and rhombohedral lead zirconatetitanate films. Physical Review B. 91, 134113. doi:10.1103/PhysRevB.91.134113

63. Gruverman, A, Kholkin, A: Nanoscale ferroelectrics: processing, characterizationand future trends. Rep Prog Phys 69(8), 2443–2474 (2006). doi:10.1088/0034-4885/69/8/r04

64. Gruverman, A, Auciello, O, Ramesh, R, Tokumoto, H: Scanning forcemicroscopy of domain structure in ferroelectric thin films: imagingand control. Nanotechnology 8, A38–A43 (1997). doi:10.1088/0957-4484/8/3a/008

65. Gruverman, AL, Hatano, J, Tokumoto, H: Scanning force microscopy studiesof domain structure in BaTiO3 single crystals. Jpn J Appl Phys Part 1 - RegulPap Short Notes Rev Pap 36(4A), 2207–2211 (1997). doi:10.1143/jjap.36.2207

66. Roelofs, A, Bottger, U, Waser, R, Schlaphof, F, Trogisch, S, Eng, LM: Differentiating180 degrees and 90 degrees switching of ferroelectric domains with three-dimensional piezoresponse force microscopy. Appl Phys Lett 77(21), 3444–3446(2000). doi:10.1063/1.1328049

67. Li, F, Zhang, S, Xu, Z, Wei, X, Luo, J, Shrout, TR: Composition and phasedependence of the intrinsic and extrinsic piezoelectric activity of domainengineered (1− x) Pb (Mg1/3Nb2/3) O3− xPbTiO3 crystals. J Appl Phys108(3), 034106 (2010). doi:dx.doi.org/10.1063/1.3466978

68. Nikiforov, MP, Reukov, VV, Thompson, GL, Vertegel, AA, Guo, S, Kalinin, SV,Jesse, S: Functional recognition imaging using artificial neural networks:applications to rapid cellular identification via broadband electromechanicalresponse. Nanotechnology 20(40), 405708 (2009). doi:10.1088/0957-4484/20/40/405708

69. Jesse, S, Kalinin, SV, Proksch, R, Baddorf, AP, Rodriguez, BJ: The bandexcitation method in scanning probe microscopy for rapid mapping ofenergy dissipation on the nanoscale. Nanotechnology 18(43), 435503(2007). doi:10.1088/0957-4484/18/43/435503

70. Jesse, S, Vasudevan, RK, Collins, L, Strelcov, E, Okatan, MB, Belianinov, A,Baddorf, AP, Proksch, R, Kalinin, SV: Band excitation in scanning probemicroscopy: recognition and functional imaging. Annu Rev Phys Chem 65,519–536 (2014). doi:10.1146/annurev-physchem-040513-103609

71. Kalinin, SV, Rodriguez, BJ, Budai, JD, Jesse, S, Morozovska, AN, Bokov, AA, Ye, ZG:Direct evidence of mesoscopic dynamic heterogeneities at the surfaces of

ergodic ferroelectric relaxors. Phys Rev B 81(6), 064107 (2010). doi:dx.doi.org/10.1103/PhysRevB.81.064107

72. Kumar, A, Ovchinnikov, O, Guo, S, Griggio, F, Jesse, S, Trolier-McKinstry, S,Kalinin, SV: Spatially resolved mapping of disorder type and distribution inrandom systems using artificial neural network recognition. Phys Rev B84(2), 024203 (2011). doi:dx.doi.org/10.1103/PhysRevB.84.024203

73. Ovchinnikov, OS, Jesse, S, Bintacchit, P, Trolier-McKinstry, S, Kalinin, SV:Disorder identification in hysteresis data: recognition analysis of therandom-bond-random-field Ising model. Phys. Rev. Lett. 103(15) (2009).doi:10.1103/PhysRevLett.103.157203

74. Strelcov, E, Kim, Y, Jesse, S, Cao, Y, Ivanov, IN, Kravchenko, II, Wang, CH,Teng, YC, Chen, LQ, Chu, YH, Kalinin, SV: Probing local ionic dynamics infunctional oxides at the nanoscale. Nano Lett 13(8), 3455–3462 (2013).doi:10.1021/nl400780d

75. Kim, Y, Strelcov, E, Hwang, IR, Choi, T, Park, BH, Jesse, S, Kalinin S.:Correlative multimodal probing of ionically-mediated electromechanicalphenomena in simple oxides. Sci. Rep. 3, 2924-2921-2927 (2013).doi:10.1038/srep02924

76. Hsieh, YH, Liou, JM, Huang, BC, Liang, CW, He, Q, Zhan, Q, Chiu, YP, Chen,YC, Chu, YH: Local conduction at the BiFeO3-CoFe2O4 tubular oxide interface.Adv Mater 24(33), 4564–4568 (2012). doi:10.1002/adma.201201929

77. Vasudevan, RK, Belianinov, A, Gianfrancesco, AG, Baddorf, AP, Tselev, A,Kalinin, SV, Jesse, S: Big data in reciprocal space: sliding fast Fouriertransforms for determining periodicity. Appl Phys Lett 106(9), 091601 (2015).doi:dx.doi.org/10.1063/1.4914016

78. DeSanto, P, Buttrey, DJ, Grasselli, RK, Lugmair, CG, Volpe, AF, Toby, BH, Vogt, T:Structural characterization of the orthorhombic phase M1 in MoVNbTeOpropane ammoxidation catalyst. Top Catal 23(1–4), 23–38 (2003). doi:10.1023/A:1024812101856

79. Grasselli, RK, Buttrey, DJ, Burrington, JD, Andersson, A, Holmberg, J, Ueda, W,Kubo, J, Lugmair, CG, Volpe, AF: Active centers, catalytic behavior, symbiosisand redox properties of MoV(Nb, Ta)TeO ammoxidation catalysts. Top Catal38(1–3), 7–16 (2006). doi:10.1007/s11244-006-0066-x

80. Shiju, NR, Guliants, VV: Recent developments in catalysis using nanostructuredmaterials. Appl Catal A Gen 356(1), 1–17 (2009). doi:10.1016/j.apcata.2008.11.034

81. Dobson, P, Joyce, B, Neave, J, Zhang, J: Current understanding andapplications of the RHEED intensity oscillation technique. J Cryst Growth81(1), 1–8 (1987). doi:10.1016/0022-0248(87)90355-1

82. Boschker, JE, Folven, E, Monsen, ÅF, Wahlström, E, Grepstad, JK, Tybell, T:Consequences of high adatom energy during pulsed laser deposition ofLa0. 7Sr0. 3MnO3. Cryst Growth Des 12(2), 562–566 (2012). doi:10.1021/cg201461a

83. Vasudevan, RK, Tselev, A, Baddorf, AP, Kalinin, SV: Big-data reflection highenergy electron diffraction analysis for understanding epitaxial film growthprocesses. ACS Nano 8(10), 10899–10908 (2014). doi:10.1021/nn504730n

84. Massies, J, Grandjean, N: Oscillation of the lattice relaxation in layer-by-layerepitaxial growth of highly strained materials. Phys Rev Lett 71(9), 1411(1993). doi:dx.doi.org/10.1103/PhysRevLett.71.1411

85. Ievlev, AV, Morozovska, AN, Eliseev, EA, Shur, VY, Kalinin, SV: Ionic field effectand memristive phenomena in single-point ferroelectric domain switching.Nat Comm 5, 4545 (2014). doi:10.1038/ncomms5545

86. Ievlev, A.V., Kalinin, S.V.: Data encoding based on the shape of the ferroelectricdomains produced by the a scanning probe microscopy tip. Nano Letters (2015)

87. Department of Energy Scientific Grand Challenges Workshop Series:Architectures and Technology for Extreme Scale Computing. http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/Arch_tech_grand_challenges_report.pdf (2009). Accessed 3 March 2015

88. Department of Energy Scientific Grand Challenges Workshop Series:Discovery in Basic Energy Sciences: The Role of Computing at the ExtremeScale. http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/Bes_exascale_report.pdf (2009). Accessed 3 March 2015

89. Chen, J, Choudhary, A, Feldman, S, Hendrickson, B, Johnson, CR, Mount, R,Sarkar, V, White, V, Williams, D: Synergistic Challenges in Data-IntensiveScience and Exascale Computing. Department of Energy Office of Science,http://sdav-scidac.org/images/publications/Che2013a/ASCAC_Data_Intensive_Computing_report_final.pdf (2013). Accessed 2 March, 2015

90. Department of Energy Scientific Grand Challenges Workshop Series: Cross-Cutting Technologies for Computing at the Exascale. http://science.energy.-gov/~/media/ascr/pdf/program-documents/docs/Crosscutting_grand_challenges.pdf (2009). Accessed 3 March 2015

Page 49: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 24 of 25

91. Dongarra, J, Beckman, P, Moore, T, Aerts, P, Aloisio, G, Andre, JC, Barkai, D,Berthou, JY, Boku, T, Braunschweig, B, Cappello, F, Chapman, B, Chi, X,Choudhary, A, Dosanjh, S, Dunning, T, Fiore, S, Geist, A, Gropp, B, Harrison, R,Hereld, M, Heroux, M, Hoisie, A, Hotta, K, Jin, Z, Ishikawa, Y, Johnson, F, Kale, S,Kenway, R, Keyes, D, et al.: The international exascale software projectroadmap. Int J High Perform Comput Appl 25(1), 3–60 (2011). doi:10.1177/1094342010391989

92. Department of Energy Scientific Grand Challenges Workshop Series: ExascaleWorkshop Panel Meeting Report http://extremecomputing.labworks.org/crosscut/index.stm (2010). Accessed 3 March 2015

93. Oak Ridge National Laboratory: Accelerating Data Acquisition, Reductionand Analysis. http://www.csm.ornl.gov/newsite/adara.html (2015).Accessed 3 March 2015

94. Donoho, DL: High-Dimensional Data Analysis: The Curses and Blessings ofDimensionality. Aide-Memoire of a Lecture at AMS Conference on MathChallenges of the 21st Century. (2000)

95. Chu, C-T, Kim, SK, Lin, Y-A, Yu, YY, Bradski, G, Ng, AY, Olukotun, K.: Map-Reducefor machine learning on multicore. In: Advances in Neural InformationProcessing Systems (NIPS). (2006)

96. Parsons, L, Haque, E, Liu, H: Supspace clustering for high dimensional data:a review. SIGKDD Explor Newsl 6, 90–105 (2004)

97. Weiss, Y, Fergus, R, Torralba, A: Multidimensional spectral hashing. In:European Conference on Computer Vision. Florence, Italy (2012)

98. Weiss, Y, Torralba, A, Fergus, R: Spectral hashing. In: Advances in NeuralInformation Processing Systems (NIPS). Vancouver, Canada (2008)

99. Belkin, M, Niyogi, P: Laplacian eigenmaps for dimensionality reduction anddata representation. Neural Comput 15(6), 1373–1396 (2003). doi:10.1162/089976603321780317

100. Gerber, S., Tasdizen, T., Whitaker, R.: Robust non-linear dimensionalityreduction using successive 1-dimensional Laplacian eigenmaps. In:Proceedings of the 24th International Conference on Machine Learning(ICML), Corvallis, OR 2007, pp. 281–288

101. Roweis, ST, Saul, LK: Nonlinear dimensionality reduction by locally linearembedding. Science 290(5500), 2323–2326 (2000)

102. Tenenbaum, JB, de Silva, V, Langford, JC: A global geometric framework fornonlinear dimensionality reduction. Science 290, 2319–2323 (2000).doi:10.1126/science.290.5500.2319

103. Lin, T, Zha, H: Riemannian manifold learning. IEEE Trans Pattern Anal MachIntell 30(5), 796–809 (2008). doi:10.1109/TPAMI.2007.70735

104. Kelner, J.A., Orecchia, L., Sidford, A., Zhu, Z.A.: A simple, combinatorialalgorithm for solving SDD systems in nearly-linear time. Paper presented atthe Proceedings of the Forty-Fifth Annual ACM Symposium on Theory ofComputing, Palo Alto, California, USA,

105. Vatsavai, RR, Symons, CT, Chandola, V, Jun, G: GX-Means: a model-baseddivide and merge algorithm for geospatial image clustering. InternationalConference on Computational Science Singapore, In (2011)

106. Bengio, Y: Learning deep architectures for AI Found. Trends Mach Learn2(1), 1–127 (2009)

107. Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y., Catanzaro, B.: Deep learningwith COTS HPC systems. In: 30th International Conference on MachineLearning, Atlanta, Georgia, USA 2013

108. Thomas, JJ, Cook, KA: A visual analytics agenda. Computer Graphics andApplications, IEEE 26(1), 10–13 (2006)

109. Keim, DA, Mansmann, F, Schneidewind, J, Thomas, J, Ziegler, H: Visualanalytics: scope and challenges. Springer Berlin Heidelberg, Berlin (2008)

110. Tukey, JW: Exploratory Data Analysis. 1977. Addison-Wesley,Massachusetts (1976)

111. Roberts, JC: Exploratory visualization with multiple linked views. In: Dykes, J,MacEachren, AM, Kraak, M-J (eds.) Exploring Geovisualization. Elsevier, SanDiego, CA (2005)

112. Arnheim, R: Art and visual perception: a psychology of the creative eye.Univ of California Press, Los Angeles, CA (1954)

113. Steed, CA, Ricciuto, DM, Shipman, G, Smith, B, Thornton, PE, Wang, D, Shi, X,Williams, DN: Big data visual analytics for exploratory earth system simulationanalysis. Comput Geosci 61, 71–82 (2013). doi:10.1016/j.cageo.2013.07.025

114. Inselberg, A: The plane with parallel coordinates. Vis Comput 1(2), 69–91(1985). doi:10.1007/BF01898350

115. Inselberg, A: Parallel coordinates. Springer, New York, NY (2009)116. Hauser, H, Ledermann, F, Doleisch, H: Angular brushing of extended parallel

coordinates. In: IEEE Symposium on Information Visualization. INFOVIS 20022002, pp. 127–130. (2002). IEEE

117. Heinrich, J, Weiskopf, D: Eurographics 2013-State of the Art Reports, pp.95–116. The Eurographics Association, Goslar (2012)

118. Willett, W, Heer, J, Agrawala, M: Scented widgets: improving navigation cues withembedded visualizations. IEEE Trans Vis Comput Graph 13(6), 1129–1136 (2007)

119. Pirolli, P, Card, S: Information foraging. Psychol Rev 106(4), 643 (1999)120. Steed, CA, Swan, J, Jankun-Kelly, T, Fitzpatrick, PJ: Guided analysis of

hurricane trends using statistical processes integrated with interactiveparallel coordinates. In: IEEE Symposium on Visual Analytics Science andTechnology. VAST 2009 2009, pp. 19–26. (2009). IEEE

121. Yankovich, AB, Berkels, B, Dahmen, W, Binev, P, Sanchez, SI, Bradley, SA, Li, A,Szlufarska, I, Voyles, PM: Picometre-precision analysis of scanning transmissionelectron microscopy images of platinum nanocatalysts. Nat. Comm. 5 (2014).doi:10.1038/ncomms5155

122. Spiegelhalter, D: The future lies in uncertainty. Science 345(6194), 264–265(2014). doi:10.1126/science.1251122

123. Efron, B: Bayes’ theorem in the 21st century. Science 340(6137), 1177–1178(2013). doi:10.1126/science.1236536

124. Baldi, P, Sadowski, P, Whiteson, D: Searching for exotic particles in high-energyphysics with deep learning. Nat. Comm. 5 (2014). doi:10.1038/ncomms5308

125. Brouwer, WJ, Kubicki, JD, Sofo, JO, Giles, CL: An investigation of machinelearning methods applied to structure prediction in condensed matter.arXiv preprint arXiv, pp. 1405–3564. (2014)

126. Schmidt, M, Lipson, H: Distilling free-form natural laws from experimentaldata. Science 324(5923), 81–85 (2009). doi:10.1126/science.1165893

127. Jesse, S, Mirman, B, Kalinin, SV: Resonance enhancement in piezoresponseforce microscopy: Mapping electromechanical activity, contact stiffness, andQ factor. Appl. Phys. Lett. 89(2) (2006). doi:10.1063/1.2221496

128. Jesse, S, Kalinin, SV, Proksch, R, Baddorf, AP, Rodriguez, BJ: Energy dissipationmeasurements on the nanoscale: band excitation method in scanning probemicroscopy. Nanotechnology 18, 435503 (2007). doi:10.1088/0957-4484/18/47/475504

129. Nikiforov, MP, Thompson, GL, Reukov, VV, Jesse, S, Guo, S, Rodriguez, BJ,Seal, K, Vertegel, AA, Kalinin, SV: Double-layer mediated electromechanicalresponse of amyloid fibrils in liquid environment. ACS Nano 4(2), 689–698(2010). doi:10.1021/nn901127k

130. Jesse, S, Baddorf, AP, Kalinin, SV: Switching spectroscopy piezoresponseforce microscopy of ferroelectric materials. Appl Phys Lett 88(6), 062908(2006). doi:10.1063/1.2172216

131. Jesse, S, Lee, HN, Kalinin, SV: Quantitative mapping of switching behavior inpiezoresponse force microscopy. Rev Sci Instrum 77(7), 073702 (2006).doi:10.1063/1.2214699

132. Rodriguez, BJ, Jesse, S, Alexe, M, Kalinin, SV: Spatially resolved mapping ofpolarization switching behavior in nanoscale ferroelectrics. Adv Mater 20,109 (2008). doi:10.1002/adma.200700473

133. Jesse, S, Rodriguez, BJ, Choudhury, S, Baddorf, AP, Vrejoiu, I, Hesse, D, Alexe, M,Eliseev, EA, Morozovska, AN, Zhang, J, Chen, LQ, Kalinin, SV: Direct imaging ofthe spatial and energy distribution of nucleation centres in ferroelectricmaterials. Nat Mater 7(3), 209–215 (2008). doi:10.1038/nmat2114

134. Tan, Z, Roytburd, AL, Levin, I, Seal, K, Rodriguez, BJ, Jesse, S, Kalinin, SV,Baddorf, AP: Piezoelectric response of nanoscale PbTiO3 in compositePbTiO3−CoFe2O4 epitaxial films. Appl Phys Lett 93, 074101 (2008).doi:dx.doi.org/10.1063/1.2969038

135. Rodriguez, BJ, Choudhury, S, Chu, YH, Bhattacharyya, A, Jesse, S, Seal, K,Baddorf, AP, Ramesh, R, Chen, LQ, Kalinin, SV: Unraveling deterministicmesoscopic polarization switching mechanisms: spatially resolved studies ofa tilt grain boundary in bismuth ferrite. Adv Funct Mater 19(13), 2053–2063(2009). doi:10.1002/adfm.200900100

136. Seal, K, Jesse, S, Nikiforov, MP, Kalinin, SV, Fujii, I, Bintachitt, P, Trolier-McKinstry, S: Spatially resolved spectroscopic mapping of polarizationreversal in polycrystalline ferroelectric films: crossing the resolution barrier.Phys Rev Lett 103(5), 057601 (2009). doi:10.1103/PhysRevLett.103.057601

137. Wicks, S, Seal, K, Jesse, S, Anbusathaiah, V, Leach, S, Garcia, RE, Kalinin, SV,Nagarajan, V: Collective dynamics in nanostructured polycrystallineferroelectric thin films using local time-resolved measurements and switchingspectroscopy. Acta Mater 58(1), 67–75 (2010). doi:10.1016/j.actamat.2009.08.057

138. Rodriguez, BJ, Jesse, S, Bokov, AA, Ye, ZG, Kalinin, SV: Mapping bias-inducedphase stability and random fields in relaxor ferroelectrics. Appl Phys Lett 95,9 (2009). doi:10.1063/1.3222868

139. Rodriguez, BJ, Jesse, S, Morozovska, AN, Svechnikov, SV, Kiselev, DA, Kholkin, AL,Bokov, AA, Ye, ZG, Kalinin, SV: Real space mapping of polarization dynamicsand hysteresis loop formation in relaxor-ferroelectric PbMg1/3Nb2/3O3-PbTiO3

solid solutions. J Appl Phys 108(4), 042006 (2010). doi:10.1063/1.3474961

Page 50: Use Case: Data and Analysis Requirements in Scanning Probe and ...

Belianinov et al. Advanced Structural and Chemical Imaging (2015) 1:6 Page 25 of 25

140. Rodriguez, BJ, Jesse, S, Kim, J, Ducharme, S, Kalinin, SV: Local probingof relaxation time distributions in ferroelectric polymer nanomesas:time-resolved piezoresponse force spectroscopy and spectroscopic imaging.Appl Phys Lett 92(23), 232903 (2008). doi:10.1063/1.2942390

141. Kalinin, SV, Rodriguez, BJ, Jesse, S, Morozovska, AN, Bokov, AA, Ye, ZG:Spatial distribution of relaxation behavior on the surface of a ferroelectricrelaxor in the ergodic phase. Appl Phys Lett 95(14), 142902 (2009).doi:dx.doi.org/10.1063/1.3242011

142. Bintachitt, P, Jesse, S, Damjanovic, D, Han, Y, Reaney, IM, Trolier-McKinstry, S,Kalinin, SV: Collective dynamics underpins Rayleigh behavior in disorderedpolycrystalline ferroelectrics. Proc Natl Acad Sci U S A 107(16), 7219–7224(2010). doi:10.1073/pnas.0913172107

143. Griggio, F, Jesse, S, Kumar, A, Marincel, DM, Tinberg, DS, Kalinin, SV,Trolier-McKinstry, S: Mapping piezoelectric nonlinearity in the Rayleighregime using band excitation piezoresponse force microscopy. Appl PhysLett 98(21), 212901 (2011). doi:10.1063/1.3593138

144. Jesse, S, Maksymovych, P, Kalinin, SV: Rapid multidimensional dataacquisition in scanning probe microscopy applied to local polarizationdynamics and voltage dependent contact mechanics. Appl Phys Lett 93(11),112903 (2008). doi:10.1063/1.2980031

145. Maksymovych, P, Balke, N, Jesse, S, Huijben, M, Ramesh, R, Baddorf, AP,Kalinin, SV: Defect-induced asymmetry of local hysteresis loops on BiFeO3

surfaces. J Mater Sci 44(19), 5095–5101 (2009). doi:10.1007/s10853-009-3697-z

146. Anbusathaiah, V, Jesse, S, Arredondo, MA, Kartawidjaja, FC, Ovchinnikov, OS,Wang, J, Kalinin, SV, Nagarajan, V: Ferroelastic domain wall dynamics inferroelectric bilayers. Acta Mater 58(16), 5316–5325 (2010). doi:10.1016/j.actamat.2010.06.004

147. McLachlan, MA, McComb, DW, Ryan, MP, Morozovska, AN, Eliseev, EA,Payzant, EA, Jesse, S, Seal, K, Baddorf, AP, Kalinin, SV: Probing local andglobal ferroelectric phase stability and polarization switching in orderedmacroporous PZT. Adv Funct Mater 21(5), 941–947 (2011). doi:10.1002/adfm.201002038

148. Kim, Y, Kumar, A, Tselev, A, Kravchenko, II, Han, H, Vrejoiu, I, Lee, W, Hesse, D,Alexe, M., Kalinin, SV: Non-linear phenomena in multiferroic nanocapacitors:Joule heating and electromechanical effects. ACS Nano. 5(11), 9104–9112.doi:10.1021/nn203342v

149. Nikiforov, MP, Gam, S, Jesse, S, Composto, RJ, Kalinin, SV: Morphologymapping of phase-separated polymer films using nanothermal analysis.Macromolecules 43(16), 6724–6730 (2010). doi:10.1021/ma1011254

150. Nikiforov, MP, Hohlbauch, S, King, WP, Voitchovsky, K, Contera, SA, Jesse, S,Kalinin, SV, Proksch, R: Temperature-dependent phase transitions in zeptolitervolumes of a complex biological membrane. Nanotechnology. 22(5) (2011).doi:10.1088/0957-4484/22/5/055709

151. Balke, N, Jesse, S, Kim, Y, Adamczyk, L, Tselev, A, Ivanov, IN, Dudney, NJ,Kalinin, SV: Real space mapping of Li-ion transport in amorphous Si anodeswith nanometer resolution. Nano Lett 10(9), 3420–3425 (2010). doi:10.1021/nl101439x

152. Guo, S, Jesse, S, Kalnaus, S, Balke, N, Daniel, C, Kalinin, SV: Direct mapping ofion diffusion times on LiCoO2 surfaces with nanometer resolution. JElectrochem Soc 158(8), A982–A990 (2011). doi:10.1149/1.3604759

153. Ovchinnikov, O, Jesse, S, Guo, S, Seal, K, Bintachitt, P, Fujii, I, Trolier-McKinstry, S,Kalinin, SV: Local measurements of Preisach density in polycrystallineferroelectric capacitors using piezoresponse force spectroscopy. Appl Phys Lett96(11), 112906 (2010). doi:dx.doi.org/10.1063/1.3360220

154. Guo, S, Ovchinnikov, OS, Curtis, ME, Johnson, MB, Jesse, S, Kalinin, SV:Spatially resolved probing of Preisach density in polycrystalline ferroelectricthin films. J Appl Phys 108(8), 084103–084110 (2010). doi: dx.doi.org/10.1063/1.3493738

155. Balke, N, Jesse, S, Kim, Y, Adamczyk, L, Ivanov, IN, Dudney, NJ, Kalinin, SV:Decoupling electrochemical reaction and diffusion processes in ionically-conductive solids on the nanometer scale. ACS Nano 4(12), 7349–7357(2010). doi:10.1021/nn101502x

156. Vasudevan, R, Liu, Y, Li, J, Liang, WI, Kumar, A, Jesse, S, Chen, YC, Chu, YH,Valanoor, N, Kalinin, SV: Nanoscale-control of phase-variants in strain-engineeredBiFeO3. Nano Lett 11(8), 3346–3354 (2011). doi:10.1021/nl201719w

157. Arruda, TM, Kumar, A, Kalinin, SV, Jesse, S: Mapping irreversible electrochemicalprocesses on the nanoscale: ionic phenomena in Li ion conductive glassceramics. Nano Lett 11(10), 4161–4167 (2011). doi:10.1021/nl202039v

158. Kumar, A, Ovchinnikov, OS, Funakubo, H, Jesse, S, Kalinin, SV: Real-spacemapping of dynamic phenomena during hysteresis loop measurements:dynamic switching spectroscopy piezoresponse force microscopy.Appl Phys Lett 98(20), 202903 (2011). doi: dx.doi.org/10.1063/1.3590919

159. Kumar, A, Ciucci, F, Morozovska, AN, Kalinin, SV, Jesse, S: Measuring oxygenreduction/evolution reactions on the nanoscale. Nat Chem 3(9), 707–713(2011). doi:10.1038/nchem.1112

Submit your manuscript to a journal and benefi t from:

7 Convenient online submission

7 Rigorous peer review

7 Immediate publication on acceptance

7 Open access: articles freely available online

7 High visibility within the fi eld

7 Retaining the copyright to your article

Submit your next manuscript at 7 springeropen.com

Page 51: Use Case: Data and Analysis Requirements in Scanning Probe and ...

COMMENTUNIVERSITIES Cronyism and wrong metrics hold back Chinese academia p.492

THEATRE Technical précis of climate change takes centre stage p.491

AGRICULTURE Why did the chicken cross

the world? p.490

NUCLEAR WEAPONS Two takes on how the cold war changed science and scientists p.489

of microscope vision, set by the thermal vibrations of atoms. Small structural distor-tions that determine magnetism, valence (the number of chemical bonds an atom can form) and spin state would become apparent.

Currently, limits inherent to electron-microscope optics restrict us to seeing atoms or columns of atoms in two dimen-sions. Lens imperfections, electronic insta-bilities, thermal noise and environmental factors also blur views. Some scientists argue that microscopes will never clear these hurdles6,7. Others feel that because there are no materials in which atoms are spaced closer than 0.5 Å, greater resolution is not worth chasing.

We disagree. Keener microscopes are needed urgently to solve major world problems: solar, battery and fuel cells, computer memory chips and solid-state lighting all need to be more efficient. Three-dimensional (3D) maps of atoms would reveal how their interactions enable or limit functionality and, importantly, how materials can be improved.

Within a few years, at relatively low cost, researchers could improve microscope reso-lutions to 0.2 Å by honing aberration-correc-tor designs that are already available8. Stability could be improved through judicious tweaks to microscopes, materials, optics and elec-tronics, and by reducing ambient noise. The main barrier is commercial — companies that build microscopes do not invest in specialist technologies if demand, and thus financial return, is expected to be low.

Three things are needed to accelerate microscope technology development: part-nerships between academia and industry; government seed funds; and centres of excellence to develop computing power, data storage and analytical techniques.

CLEAR VISIONExceeding atomic resolution is crucial for understanding important classes of materi-als such as superconductors, magnets and catalysts. It is often the small deviations from symmetry in atom positions that allow materials to store charge, information or energy — for example, in ferroelectric oxides used in computer memory chips or electrocatalytic oxides used in solid fuel cells. The complex atomic arrangements

Hasten high resolutionBuild precision microscopes to map atoms, say

Stephen J. Pennycook and Sergei V. Kalinin.

Sulphur atoms (yellow) ‘dance’ on a copper-layered catalyst under a scanning tunnelling microscope.

BR

OO

KH

AVEN

NAT

L LA

B./

SP

L

The best electron and scanning probe microscopes today can resolve indi-vidual atoms and chemical bonds1–4.

Views of materials such as graphene, cata-lysts and oxides on these scales — around 0.5 ångströms — reveal structures and the impacts of crystal defects on their properties.

To truly understand materials’ chemical and physical properties, atomic arrange-ments need to be mapped with much greater precision. Resolutions of 0.1 Å — the goal set by physicist Richard Feynman in his 1959 American Physical Society lec-ture, ‘There’s Plenty of Room at the Bot-tom’5 — would take us to the physical limit

2 7 N O V E M B E R 2 0 1 4 | V O L 5 1 5 | N A T U R E | 4 8 7© 2014 Macmillan Publishers Limited. All rights reserved

Page 52: Use Case: Data and Analysis Requirements in Scanning Probe and ...

of nanophase metals, ceramics, alloys, solar cells, batteries and different types of glass have yet to be probed.

Interfaces between different materi-als — such as magnet–superconductor or oxide–oxide junctions — might exhibit properties such as electrical conductivity, chemical reactivity, superconductivity and ferromagnetism that are not found in their separate constituents. More-exact measure-ments of bond lengths and angles, ideally in three dimensions, are needed to hone mat-erials for use in next-generation energy and information-technology devices.

Aberrations are inherent in electron lenses, which use magnetic fields and which, unlike glass lenses, cannot be shaped to arbi-trary curvature. As in a camera, opening the aperture reduces the depth of field but also increases depth resolution. At today’s prac-tical resolution limit of 0.5 Å, the limited apertures available restrict depth resolution to the nanometre scale, which is too coarse to discern individual atoms.

Lateral resolution at Feynman’s level would allow us to distinguish atoms vertically. No longer would specimens need to be aligned to their internal crystal planes. One image would reveal a cross-section. A series of images taken using different foci would build up a 3D scan. Seeing atomic positions in three dimen-sions could distinguish between competing theories for a material’s behaviour.

Such scans, like today’s imagery, would have to be limited so that the sample is not destroyed by the electron beam. And biolo-gists will have to take images using shorter, lower-current scans than can be used for materials. Research-ers will need to learn new tricks, but the pay-off would be huge. As resolu-tion improves, noise reduces, and we will be able to scan faster and monitor how things change with time.

Other types of microscopy would also be improved with higher resolution. Electron energy loss spectroscopy (EELS) would ben-efit from 3D capability to reveal elements, chemical valence and energy band levels at the same time as atomic structure.

In scanning probe microscopy (SPM), an image forms as a result of interaction between a sharp probe tip and the sample surface. Measuring the current between the two, as in scanning tunnelling micro-scopy, or the minute forces, as in atomic force microscopy, traces the structure of the surface. Although the maximum resolution achievable is limited by the fundamental physics of tip–surface interactions, new low-noise systems will allow mapping of

displacements of surface atoms and changes in bond lengths to less than 10 picometres.

By probing electronic, phonon (vibrations in a lattice) and spin responses with SPM, we will better understand the factors that con-trol the ferro electric, magnetic and supercon-ductive functions of materials. By tuning the force and current applied through the probe tip, atoms and molecules could be manipu-lated and their chemical and electrochemical responses explored away from equilibrium.

Improved energy resolution would ena-ble physicists to map energy band gaps and phonons in materials for solid-state lighting, thermoelectrics and solar cells using EELS9. This resolution could be achieved using advanced electron optics such as mono-chromators that narrow the energy spread of the electron beam from today’s 300 milli-electronvolts to 10 millielectronvolts or less. SPM provides information on local super-conductivity, energy band gap and molecu-lar vibrations or phonon structure.

For electron and scanning-probe micro-scopy, additional signals, such as emitted light or electronic current, could be col-lected simultaneously. From this, one could test whether particular lattice defects kill or enhance the effectiveness of solid-state light-ing or solar cells, how a molecule interacts with the substrate, or how local polarization gradients affect oxidation states and mag-netic properties in ferro electrics and polar materials.

PUSHING AHEADTo achieve Feynman’s goal, microscope optics and electronic and mechanical stabil-ity must be improved. We need new designs for correctors with larger apertures.

A main problem is a lack of aware-ness among scientists of how much can be gained from even higher-resolution microscopy. With aberration correctors selling well (hundreds each year), there

is little incentive for manufacturers to develop new capabilities. The low-noise scanning probe microscopes used for cutting-edge studies are largely lab-built.

Significant further investment will be necessary to deliver a factor-of-two improvement in spatial resolution in elec-tron microscopes within five years, just as multimillion-dollar government-funded projects in the United States and Japan have led to the previous factor-of-two resolution increase in the past decade. As was the case with today’s aberration-corrected machines, the new state-of-the-art microscopes would soon become available, at a probable cost of between US$5 million and $10 million each.

One of the pathways to achieving this goal is through community-wide workshops to construct a road map for instrumental developments and identify scientific oppor-tunities. The crucial transition from taking images to acquiring detailed information on atomic positions, bond lengths and local functionalities will require new methodolo-gies. Large, multi-dimensional data sets will pose challenges for data collection, storage and analysis. New approaches will be needed for extracting relevant knowledge and link-ing it to theory.

The scientific community should set up centres to host and coordinate the high-power computing services needed to sup-port high-resolution microscopy. Shared online environments will foster collective interpretation. By pooling data, fewer experi-ments would have to be repeated, providing the experimental counterpart to other pro-grammes sharing analytical tools, such as the $100-million US Materials Genome Initiative.

To paraphrase Feynman: there’s still plenty to see at the bottom. ■

Stephen J. Pennycook is research professor in the department of materials science and engineering, University of Tennessee, Knoxville, Tennessee, USA. Sergei V. Kalinin is a director of the Institute for Functional Imaging of Materials, and theme leader at the Center for Nanophase Materials Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA. e-mail: [email protected]

1. Erni, R., Rossell, M. D., Kisielowski, C. & Dahmen, U. Phys. Rev. Lett. 102, 96101 (2009).

2. Sawada, H. et al. J. Electron. Microsc. 58, 357–361 (2009).

3. Krivanek, O. L. et al. Nature 464, 571–574 (2010).

4. Zhou, W. et al. Phys. Rev. Lett. 109, 206803 (2012).

5. Feynman, R. P. J. Microelectromech. Sys. 1, 60–66 (1992).

6. Reich, E. S. Nature 499, 135–136 (2013). 7. Uhlemann, S., Müller, H., Hartel, P., Zach, J. &

Haider, M. Phys. Rev. Lett. 111, 046101 (2013).8. Sasaki, T. et al. J. Electron. Microsc. 59, S7–S13

(2010).9. Krivanek, O. L. et al. Nature 514, 209–212

(2014).

“Exceeding atomic resolution is crucial for understanding important classes of materials.”

A scanning transmission image reveals a silicon atom in a layer of graphene.

WU

ZH

OU

/OA

K R

IDG

E N

ATL

LAB

.

4 8 8 | N A T U R E | V O L 5 1 5 | 2 7 N O V E M B E R 2 0 1 4

COMMENT

© 2014 Macmillan Publishers Limited. All rights reserved