Top Banner
59

Improving Semantics in Agriculture Workshop Pre‐workshop ...

Jan 03, 2017

Download

Documents

votruc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Improving Semantics in Agriculture Workshop

Pre‐workshop questions ‐ all answers received Bayer CABI CGIAR + Bioversity Italian National Centre for Soil Mapping (CNCP) International Maize and Wheat Improvement Center (CIMMYT) Embrapa Agricultural Informatics French National Institute for Agricultural Research (INRA) International Food Policy Research Institute (IFPRI) Indian Statistical Institute (ISI) Integrated Modelling Collaboratory (IMC) Kenya Agricultural and Livestock Research Organisation (KALRO) Association for Technology and Structures in Agriculture (KTBL) Syngenta Agricultural Sustainability Institute (ASI) UC DAVIS USDA Agricultural Research Service (ARS) USDA National Agricultural Library (NAL) Wageningen UR Library (Wageningen UR Lib) Wageningen UR Alterra (Alterra) U Aston Cornell

Page 2: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Bayer A) About The plant biotechnology Innovation center of Bayer in Ghent is is one of the most important research and development centers of the Seeds division of Bayer CropScience. The Gent site now has more than 400 employees. Bayer CropScience is a leading international player in the research, development and marketing of agricultural and vegetable seeds with added value traits. The innovation and product development in Ghent is focused on agricultural crops such as oilseed rape, cotton, rice and wheat. The headquarters of Bayer CropScience are in Monheim, Germany. The company has more than 23,100 employees in more than 120 countries. Erick Antezana belongs to the Computational Life Sciences (CLS) division. He is responsible in CLS of the Master Data for R&D (controlled vocabularies, taxonomies, ontologies) as well as the platform which stores those artifacts. B) Datasets maintained We exploit publicly available resources (e.g. UniProt) in our research activities. Moreover, we have developed proprietary datasets according to the various R&D programs targeting specific areas (e.g. yield increase). Some datasets get shared in specific situations (e.g. collaboration with academia). We maintain an internal database (gathering public and private datasets) using semantic web technologies. Those resources conform to semantic web standards and technologies (e.g. RDF, SPARQL). Nevertheless, the RDF models are custom developments to our needs. C) Vocabularies maintained We mainly rely on public resources (e.g. trait ontology, sequence ontology); therefore we don’t publish them as a separate resource; however, we interact with some of the consortia behind those resources so that they could get enriched with our contributions. We maintain several internal vocabularies. All of them are in English. Some of those vocabularies are strongly linked to external ones (e.g. taxonomy of species), others are verbatim copies of external vocabularies (e.g. Gene Ontology). D) Uses of datasets and vocabularies The type of application consuming our datasets and vocabularies range from pure Research tools (e.g. functional characterization of genes and proteins) to tools in the Development phase (e.g. plant breeding activities). Users are from diverse backgrounds and skills: data scientists and wet scientists. We have no smartphone applications. E) Vocabulary maintenance Maintenance is performed using various tools: custom pipelines, OBO edit, Protégé, excel, …

Page 3: Improving Semantics in Agriculture Workshop Pre‐workshop ...

F) Interoperability and future visions (no limit) The Ag data management arena lacks of a coordinated approach to deliver relevant, specialised vocabularies. A long term approach (sustainability) of publicly available vocabularies is desirable. New vocabulary resulting from new technologies are often missing in current resources. The coming decade will demand solutions to accommodate the usage of new technologies or trends in modern agriculture such as digital farming, precision phenotyping, and so forth. Standards are needed in those areas. Some of the needed vocabularies:

­ Taxonomy of species + varieties + EPPO integration ­ Typical units used in agricultural activities ­ Processes / activities (e.g. molecular breeding)

This community should follow the F.A.I.R. approach (https://www.force11.org/node/6062). We should follow/learn from the Pistoia Alliance and other major successful initiatives (e.g. OpenPhacts, iPlant). Academia, government, industry, … should work in a pre‐competitive framework that benefits not only the participants but especially the end users. An output of this workshop should be a position paper or similar stating the needs of this community and encouraging other to join the effort.

Page 4: Improving Semantics in Agriculture Workshop Pre‐workshop ...

CABI A) About CABI an inter‐governmental, not‐for‐profit organization, originally formed as the Commonwealth Agricultural Bureaux. CABI works in academic publishing, knowledge management, agricultural international development and scientific sectors. Anton Doroszenko is the Thesaurus Manager at CABI, in charge of the management and upkeep of the CAB Thesaurus and associated authority files. Anton has been worked on the GACS project during the first two phases. Phil Roberts works in the Solutions and Architecture team and in charge of data management, innovation and the future strategy of the knowledge creation. B) Datasets maintained We maintain on behalf of the UK government – R4D – which is fully open and available as RDF. R4D is a free access on‐line portal containing the latest information about research funded by DFID, including details of current and past research in over 40,000 project and document records. http://r4d.dfid.gov.uk/. CABI maintains a more detailed version of geonames and releases updates to this back into geonames. In addition we have a number of products that are open access but have yet to be released as LOD. The plan is for these to be opened out further over the next 6‐12 months. There are permission issues relating to how much can be LOD due to the nature of the data – especially those relating to pest distribution records around the world. C) Vocabularies maintained Does your organization maintain its own vocabularies, whether strictly for internal use or for publication? If publicly available, please list their URLs along with brief descriptions (e.g., available languages). Are any of your vocabularies mapped or linked to external vocabularies? The CAB Thesaurus (http://www.cabi.org/cabthesaurus/) contains more than 250,000 concepts, almost complete in four languages (English, Spanish, Portuguese and Dutch) and lesser content in seven other European languages. It covers all aspects of agriculture and applied life sciences. The thesaurus along with many other authority files are used for the indexing of over 11 million bibliographic records (http://www.cabdirect.org/) and several 10,000s datasheets on crop protection, invasive species, forestry, aquaculture and other related areas. A core of the CAB Thesaurus is part of the Global Agricultural Concept Scheme (GACS) project. CABI has separately published this core set of concepts in a trial of LOD (see http://id.cabi.org/cabt/page/54380 for example). The URIs are live RDFa versions under a restrictive CC license, which will be opened up further by the end of 2015 when we have full scale LOD APIs in place. D) Uses of datasets and vocabularies R4D is used in numerous applications including DevTracker. CABI data is not currently consumed directly by other applications; rather they are fed to vendors and other organizations via FTP. This is changing in 2016.

Page 5: Improving Semantics in Agriculture Workshop Pre‐workshop ...

The majority of this is subscription‐based and not open. These models will also change during the coming 12‐18 months. E) Vocabulary maintenance MultiTes (http://www.multites.com/) is used for thesaurus maintenance, for web deployment, and for distributing files in multiple formats, including SKOS/RDF, to customers and online hosts for our bibliographic databases. MultiTes uses a very simple but effective system for importing vocabularies and their relationships using specially formatted text files that requires no IT support. Anton works on the thesaurus full time and draws on additional support from 2 other people on an ad hoc basis F) Interoperability and future visions Interoperability of data is important in the coming years. Currently this is a major bottleneck. Whilst automated processes to incorporate datasets and vocabularies will increasingly play a role, manual curation will remain important to maintain data quality. Vocabulary maintenance is a labor intensive process and requires domain expertise that cannot be shortcut. This is particularly important for international trade and the legal landscape of agriculture. Data at all geographical resolutions are required. Data needs to flow more quickly, both in terms of release of data and political permissions/reporting. A major priority for CABI is that, whatever we decide upon, is sustainable. URIs must be persistent and maintained. CABI is happy to assist in making this project into a long‐term, self‐sustaining program with appropriate business models to achieve this.

Page 6: Improving Semantics in Agriculture Workshop Pre‐workshop ...

CGIAR + Bioversity

CGIAR (the Consultative Group for International Agricultural Research) is a partnership addressing agricultural

research for development. CGIAR contributes to the global effort to tackle poverty, hunger and major nutrition

imbalances, and environmental degradation. Work is carried out by fifteen research centers, members of the

CGIAR Consortium, with almost 10,000 scientists and staff in 96 countries, in collaboration with hundreds of

partners, including national and regional research institutes, civil society organizations, academia, development

organizations, and the private sector. CGIAR Research Programs are supported by a CGIAR Fund.

C) Vocabularies maintained

We do not maintain our 'own' institutional vocabulary but we develop and maintain vocabularies/ontologies in the domain of Genetic plant resources, breeding, agronomy with all CGIAR centers and non‐CGIAR partners: www.cropontology.org

The Crop Ontology trait terms are cross‐ referenced to Plant Ontology, Trait Ontology and also with Bioversity plant Descriptors . The Crop Research Ontology will be mapped to the ICASA variables of AgMIP project and to the Plant Environment Ontology.

Agronomy Ontology is initiated MultiCRopPassport Data in collaboration with FAO: mapped to DARWIN CORE germplasm extension of

GBIF The series of Plant descriptors for genebanks – booklets ‐ they do not map as this is produced in PDF

format. A first attempt to parse it for better integration in the ontology is being performed. Collecting mission forms Household surveys on biodiversity

D) Uses of datasets and vocabularies

Multi‐Crop Passport Data: used by all ex situ genebanks databases and EURISCO, Genesys ‐ The descriptors are traditionally downloaded by national genebanks and botanists as a model for field

forms The Crop Ontology is used by the Breeding Management System of the Integrated Breeding Platform for

annotation of collected data, for fieldbooks. By the Next Generation Cassava Database maintained by Boyce Thompson Institute; By the Global Agricultural Evaluation Trials Database (Agtrials.org); by the european Solanaceae breeding database; tested by the Australian Phenotyping Facility for their Phenomic ontology driven database. INRA has uploaded on the Crop Ontology site the vitis ontology; USDA the Soybean ontology of Soybase; POLAPGEN the barley ontology; The Triticieae‐global : the oats ontology

E) Vocabulary maintenance

The Crop Ontology web site us used for publishing and is currently based on google apps, located on google cloud and has an API. The species‐specific ontologies are developed using an excel template called Trait Dictionary to facilitate interaction with breeders and also in OBO. The site provide conversion routines from excel to OBO.

Page 7: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Mappings to Trait Ontology are currently done manually by curators but we are testing mapping tools to produce automatic mappings

The Crop Ontology terms are embedded into the Breeding Management System with a local ontology and variable manager

The Bioversity plant descriptors are available in PDF as publication‐ 1st trials to parse them in Json available on GitHub. A first small test to convert the PDF into a xml files and support their integration into the Crop Ontology was performed with Cambridge University.

F) Interoperability and future visions

An ontology supporting near‐natural language query in agriculture A google‐like search engine for agriculture Information services adapted to users using mobile technology – requires open data Semantics must facilitate queries and discovery of agricultural related data supporting research for

climate change, nutrition, food security, land restoration etc which leads to link data sets from multidisciplinary research where the semantics may be different. A mediation language supporting the interpretation between the stakeholders is important : understand how farmers describe the variety their prefer and translate into traits that breeder can integrate in a breeding schema ( adoption) ; create the basis for crop modelers to understand what the breeders' traits represent and they translate into their model, etc…

Should support discovery of data across species (between crops, between animals and plant, pollinators, etc) from environmental, socio economic, cultural data sets.

Linked Open Data cloud does not contains much data related to agriculture. To perform data discovery, data should be formatted for LOD (RDF) and properly described with ontologies for enabling queries.

An integrated access to all ontologies related to the agricultural domain for facilitating annotations, construction of controlled vocabularies for databases and fieldbooks.

A model for annotation tool using multiple ontologies and for field book creation Process/tools for annotating images and videos

Problems and solutions Adding metadata to barriers: semantics heterogeneity of the metadata prevents interoperability; no homogenous naming of the variables measured; no systematic description of the method used and the unit or scale which lead to low quality data and not re‐usable Quality of the data and lacking metadata data sets is cumbersome – automatic annotations in the pipelines and workflows producing data (e.g.hightrouput phenotyping ) Quality checking tools Coming decade

Yes, integration of semantics in geographic information systems is one area to re‐enforce Currently lacking an open ontology for farmers preferences; also lacking on functional traits useful for agro‐ecology (e.g. Functional traits like N‐fixation; water‐use

efficiency; nutritional value; cover plant also the use : shade, fodder, etc; for species interactions like for pollinators)

Agronomy ontology for field book is just starting and needs more support

Page 8: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Ontology for monitoring indicators will be crucial for the impact of the interventions and decision making Future actions: Identify the vocabulary and ontologies needed by the agricultural domain (existing and missing in GASC) and propose a single and easy access for users to select the vocabularies they need Stimulate the production of ontologies for agronomy, agro‐ecology, for experimental design, measurement methods and scales Engage targeted scientific/data management community in the contribution to and use of the vocabularies to support open data Engage the community for the translations of the vocabulary to expand the use Engage with public open data repositories for them to include the agricultural domain vocabulary in the metadata used for describing data sets Define the short term and long term steps towards Linked Open Data – RDF conversion of the data platforms Identify suit of tools/wizzards for data sets annotation ; for semantic‐based data submission tools in repositories Expend the group to platforms already publishing they data in RDF and actors of the semantic web standards: W3C, Biosharing, Bio2RDF, etc Develop text mining activities Relevant projects or initiatives in related areas Planteome‐ NSF‐awarded pilot project on creating a platform that will compile the main reference ontologies for Plants, provide a suite of tools for ontology mapping and annotations – also project proposed for the Divseek initiative Agroportal‐ prototype for Bioportal support by French grant involving INRA and Bioversity aside LIRRM UNEP/WCMC : ontology development for the indicators supporting the SDG on water and air quality, biodiversity, land use etc ISA‐Tools ‐ open source ISA metadata tracking tools for experiments and assays‐ developed at Oxford University by the team of Philippe ‐Rocca‐Serra: http://www.isa‐tools.org/ The suit contain the Ontomaton tools which enables to get a semantically supported google spreadsheet . https://github.com/ISA‐tools/OntoMaton and Isa creator for creating experiment template. The group also develops ontologies: http://www.obi‐ontology.org ‐ Ontology for Biomedical Investigations (OBI) project ‐ wet lab but can be expanded to fields Browsing OBI: http://www.ontobee.org/browser/index.php?o=OBI STATO:describing the most commonly used statistical methods for experimental design, graphical representation too http://www.stato‐ontology.org Biosharing: https://www.biosharing.org/pages/about/ Bio2RDF TheContentMine project (contentmine.org) at Cambridge university – this is designed tool ifs design to ocrawl, scrape, normalize and mine the scientific literature, extracting hundreds of millions of facts annually. Direction in coming decade Natural language queries User‐friendly Workflow for data sharing Metadata and vocabularies embedded in the data production pipelines

Page 9: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Italian National Centre for Soil Mapping (CNCP) A) About The Italian National Centre for Soil Mapping (Centro Nazionale di Cartografia Pedologica ‐ CNCP) is a research group maintaining the Italian Soil Information System (SISI) and sample collection. CNCP was created in 1999 with the project "Soil Methodologies: definition of criteria and specifications for the construction, maintenance, updating and consultation the 1:250.000 scale soil map of Italy”. CNCP belongs to the Consiglio per la Ricerca e sperimentazione in Agricoltura (Agricultural Research Council ‐ CRA) a National Research Organization operating under the supervision of the Italian Ministry of Agriculture (MiPAAF) and is located by the Research Centre for Agrobiology and Pedology of Florence (CRA‐ABP). Currently, CNCP is funded by CRA and two international projects (EU 7th Framework Programme agINFRA www.aginfra.eu; Arzebaijan Project ‐ AZER) and belongs to the networks of other projects (INQUA Aeomed; CostAction Desertnub; Italy‐Israel Ringo). CNCP maintains the soil database of Italy and a collection of several thousand of soil samples taken all over Italy, in Peloponnesus (Greece), Israel, and Azerbaijan. Data technologist, Researcher; Data user; Data aggregator; Data producer. B) Datasets CRA infrastructure is curate by CRA Informative Systems Service. This Service ensures the integrated development of IT tools to support the research and administration http://sito.entecra.it/portale/public/documenti/PSI/cra‐directory.ods Soil data managed by CRA‐ABP might be classified into three main categories: Soil Maps; Soil Profiles and Soil Samples: Soil Maps CODE : CRA‐ABP‐000 ACRONYM: Soilmaps NAME: Banca dati delle risorse cartografiche e/o informative sui suoli italiani. DESCRIPTION: The database was created thanks to Moncapri project commissioned by the Ministry of Agriculture to monitor soil maps printed on the national territory and was implemented over the years to include many maps documentation coming today http://soilmaps.entecra.it/ita/ric_av.php Soil Profiles CODE : CRA‐ABP‐001 ACRONYM: SISI NAME: Soil Information System of Italy. DESCRIPTION: It is a WebGis application based on a Cloud Computing Service that was published for on‐line data consultation about Italian soil profiles. Consists of a hierarchy of geo‐database that includes the Soil Regions which aim to correlate the soil of Italy with those of other European countries in terms of soil typological units valid at 1: 5,000,000 and, at the national level, through the Soil systems valid at scales 1: 1,000.000 http://aginfra‐sg.ct.infn.it/webgis/cncp/public/ (tool to represent the geographical location of samples) Soil Samples CODE : CRA‐ABP‐002 ACRONYM: ARCAN NAME: Banca dati dei campioni di suolo collezionati presso la pedoteca del CRA‐ABP. DESCRIPTION: WebGIS service to represent the location of soil samples collected at CRA‐ABP. The catalogued

Page 10: Improving Semantics in Agriculture Workshop Pre‐workshop ...

samples have been described and analyzed, and allow calibration of new instruments of analysis, soil monitoring and could potentially be reused for new research topics. http://93.63.35.107:8080/geoexplorer/composer/#maps/1 (tool to represent the geographical location of samples); According to the document "Strategia per la valorizzazione del patrimonio informativo del CRA” (November 6, 2013) produced collections are shared according to Release license (IODL v2.0) and the principle of Open by default: freely available to all, free of copyright or other forms of control which prevent the reproduction, and other restrictions beside the obligation to cite the source. Open Data are published on the Web in an open format, suitable for use regardless of the necessary tools for their subsequent treatment. C) Vocabularies http://vocabularies.aginfra.eu/soil# (tool to represent data vocabulary). English. Agrovoc/Vocbench: http://artemide.art.uniroma2.it/vocbench2/#Concepts; http://202.73.13.50:55481/vocbench/#Concepts (tool to represent data KOSs and Thesaurus). English‐Italian. http://rdf.entecra.it/soilmaps/ (tool to represent data in RDF formats). English. http://202.45.139.84:10035/catalogs/fao/repositories/agINFRA (AllegroGraph WebView 4.11 is the reference repository for agINFRA). English‐Italian. Used soil classification systems: USDA soil Taxonomy, 10th edition (2006): www.nrcs.usda.gov/Internet/FSE_DOCUMENTS/nrcs142p2_052172.pdf. English. World Reference Base, 2th edition (2006): ftp://ftp.fao.org/agl/agll/docs/wsrr103e.pdf. English. D) Uses of datasets and vocabularies We are looking for implementation in AGRIS – http://agris.fao.org/. Researchers and Professional Data Users in the fields of agriculture, agro‐industry, food, fishery and forestry. Do not jet used in smartphone application. E) Vocabulary maintenance (1‐2 paragraphs) Name: rdf.entecra.it IP address: 93.63.35.32 Location: CRA Informative Systems Service, Roma Hosted tool: D2RQ Name: vocabularies.aginfra.eu IP address: 37.187.148.108 Location: GFAR, FAO, Roma Hosted tool: Neologism ‐ data vocabularies Name: artemide.art.uniroma2.it IP address: 202.73.13.50 http://artemide.art.uniroma2.it/vocbench2/#Concepts Location: Università degli Studi di Roma Tor Vergata Hosted tool: Agrovoc/Vocbench – Thesaurus, KOS Name: not named IP address: 202.45.139.84 Location: GFAR, FAO, Roma Hosted tool: AllegroGraph WebView 4.11 ‐ agINFRA reference repository

Page 11: Improving Semantics in Agriculture Workshop Pre‐workshop ...

F) Interoperability and future visions Our dream is to have soon a smartphone application that could implement every sort of data produced among CRA (see http://sito.entecra.it/portale/public/documenti/PSI/cra‐directory.ods ). Beside that, some work has been done on RDF implementation on Germoplasm and Soil Data. Since our organization does not have jet in program to develop any application about such data, it would be nice to see other Organizations or Hackathons using our Open Data. Smartphones could be also used in Soil survey because of implementing: GPS, Camera, positional sensors (i.e. for slope), other sensors could be connected to survey environmental data. The need for an improved understanding of soil distribution, function and state to support science and policy development, to improve agricultural productivity in a sustainable manner and to address other global issues such as climate change and biodiversity decline. This understanding needs to be underpinned by quality‐assessed soil data and information that can be organized, aggregated and made accessible in a consistent granular and consumable form. The soil interoperability experiment will refine and test SoilML2, consolidating existing soil standards by testing them (through working implementations) against an agreed set of use cases for the exchange and analysis of soil data. Passing from an agriculture based upon the concept of marginal utility to the concept of the global sustainability, by means of an holistic approach. Dataset needed: Soil Maps ‐ economic value of soil ecosystem services. CNCP has been involved in the EU INSPIRE Thematic Working Group on Soil, Research Data Alliance (RDA) Interest Group on Agriculture Data (IGAD), and Open Geospatial Consortium (OGC) Soil Data Interoperability Experiment (Soil IE).

Page 12: Improving Semantics in Agriculture Workshop Pre‐workshop ...

International Maize and Wheat Improvement Center (CIMMYT) A) About

CIMMYT (International Maize and Wheat Improvement Center) belongs to a consortium of 15 international agricultural research centers that is part of CGIAR. It is also lead center for two CGIAR Research Programs, MAIZE and WHEAT, and works as a partner in four other such programs. CIMMYT’s germplasm bank holds large, unique collections of native maize (28,000 accessions) and wheat (125,000 accessions) varieties and wild relatives from the respective centers of origin and other key eco‐regions for each crop as well as some advanced materials. The seeds are conserved, studied, and shared under the terms of the International Treaty on Plant Genetic Resources for Food and Agriculture. Each year the center ships over half a million seed packets to more than 600 partner organizations in 100 countries.

Rosemary Shrestha works in the Data Management Unit (DMU) under the Genetic Resources Program (GRP). The unit mainly focuses on improving data management efficiency through implementation of integrated platforms (Breeding Management System, electronic data capture and data curation systems like KSU FieldBook, KDSmart, KD‐Xplore), implementation of the CIMMYT Research Data and Information policy in compliance with the CG open access data policy, and manage and publish data through Dataverse. My major role is to ensure that the data are stored in appropriate database/repositories, introduce new informatics tools to scientists, collect feedbacks from them and coordinate with the Informatics and Knowledge Management teams to create/improve tools as well as data standards and policies to ultimately bridge the gap between scientists and developers. In addition, I am helping to maintain the maize and wheat trait ontologies. See below for more detail.

B) Datasets maintained

CIMMYT produces diverse data and maintained in different repositories/databases. The link (http://www.cimmyt.org/en/resources) provides access to following information:

i) Repository for publications and multimedia: To access journal articles, books, annual reports, Images, Videos etc.

ii) Databases and Datasets:

a. Maize Finder (Maize Trial Data and Software)

b. CIMMYT Maize Inbred Lines (CMLs)

c. International Maize Trials Network

d. International Wheat Improvement Network (IWIN)

e. CIMMYT evaluation of lines

f. GRIN Global Maize

g. GRIN Global Wheat

iii) CIMMYT Research Data Repository: Recently launched (http://data.cimmyt.org/dvn/).

All data are available only in English. The above mentioned datasets/databases are available for the public and we have a policy in place as well as licensing arrangements. However, the data are not published as Linked Data.

Page 13: Improving Semantics in Agriculture Workshop Pre‐workshop ...

C) Vocabularies maintained CIMMYT contributed to developing the crop ontology in collaboration with the Crop Ontology team (Bioversity International) and the Integrated Breeding Platform (IBP). Three major ontologies (maize trait ontology, wheat trait ontology and crop research ontology) are developed and used by CIMMYT. These ontologies are published and maintained in crop ontology website. The MCPD is commonly used in the germplasm bank for passport data collection and evaluation of the germplasm along with the GRIN‐Global database. The URL to access the Crop Ontology is http://www.cropontology.org/. At present, the ontologies are available only in English but there is a high demand of translating these ontologies in to Spanish, French and Chinese. The vocabularies are mapped to Plant Ontology (PO) and Plant Trait Ontology (TO) whenever possible. D) Uses of datasets and vocabularies The Breeding management system (BMS) is using the crop trait ontology and crop research ontology. The vocabularies will be used soon in FieldBooks and handheld devices (KSU FieldBook, and KDSmart). Breeders and scientists who design crosses, nurseries, advancing lines and trials in mulita‐locations, assistants (local staffs) who use the tool/systems to create FieldBooks, record data in fields, data managers and developers who assist breeders and their assistants to use the systems/tools are the users of these applications. They will be used in the KDSmart v.2 which will be available for use on Android phones. E) Vocabulary maintenance We submit the updated versions of the ontologies to the Crop Ontology team and they publish the ontology and/or data dictionaries through the official website. Additions to the Crop Ontology traits can be made locally within the CIMMYT‐specific installation of the BMS, but not all of these additions will likely be shared with the public if they do not seem to have broad applicability. F) Interoperability and future visions Heterogeneous data: Agronomic data, socio‐economic data, climate data, germplasm passport data, geo‐spatial data, image data, phenotypic data, low‐high molecular maker genotypic data. It is not possible to store all these diverse data in one platform, but we should find the way to link these to another. One of the approaches is to use common vocabularies and standards across the platforms which is challenging at the semantic level as well. Harmonization of ontologies: Mapping vocabularies or terms to external vocabularies is always challenging. For example: A single trait can be measured by different methods and units. Fully functional data management platforms: The use of controlled vocabularies in the systems is possible only when they are fully functional. Otherwise, there is no control of vocabularies which complicates linking information from one platform to another. Management of Big data: IT Infrastructure (including adequate network bandwith, reliable connectivity, etc.), databases which can hold large volume of data (e.g. data generated by precision phenotyping (Image data) and genotyping by sequencing data, and adequate compute power will be required. All of these needs should receive special support in developing countries to ensure greater impact. Often scientists focus on generating data and use these mainly for their publications. Understanding how to transform these research based outputs into knowledge including sustainable intensification aspects so that

Page 14: Improving Semantics in Agriculture Workshop Pre‐workshop ...

poor farmers and national partners can utilize it to improve livelihoods is a key need. A large investment in standardized and interoperable data management systems as well in data managers and curators should be made. How we deal with semantics should be greatly improved in all data domains including GIS and should be applied in developing as well as in developed countries. Recommendations on controlled vocabulary for GIS data would be helpful. We may need to discuss the databases/repositories/platforms that are in use in our organizations so that we can figure out how these systems are linked to each other and fully comply with interoperability standards. From another viewpoint we should work on changing the culture of data and knowledge sharing within and outside of our domains. The Planteome project (http://planteome.org/node/9) and the Semantic GIS project (http://www.semanticgis.net/) are recently known projects working in this domain. We should focus on the needed large investment in standardized and interoperable data management systems and at the same time promote the required cultural change processes by experienced promoters (evangelists). Acknowledgement: I would like to acknowledge the CIMMYT Data Management Team (Jesus Herrera, Kai Sonder, Kate Dreher and Richard Fulss) for providing their inputs.

Page 15: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Embrapa Agricultural Informatics A) About Embrapa (PT: Empresa Brasileira de Pesquisa Agropecuária; EN: Brazilian Agricultural Research Corporation) – www.embrapa.br. Legally Embrapa is “a public corporation with private rights”, i.e., it is not just a “public” institution; Embrapa can reserve for itself “private rights” on its products, services or technologies, e.g. commercial ones. Embrapa is directly linked and subordinated to the Brazilian Ministry of Agriculture, Livestock and Food Supply and composes and coordinates the “Brazilian national system of agricultural research” (SNPA, in its Portuguese acronym) and so represents one of the major sources of agricultural information in Brazil. The SNPA consists of Brazilian public and private institutions and universities that, in a cooperated way, develop agricultural researches involving several areas and fields of scientific knowledge. Geographical distribution: 15 central Units in Brasilia, DF; 47 decentralized Units represented by thematic, eco‐regional or products (e.g., cotton, rice and bean, beef cattle) research centers and services and distributed in all Brazilian states; 04 virtual laboratories abroad (Labex) in USA, Europe, China and South Korea and 03 international offices in Latin America and Africa. People: 9,790 employees (2,444 researchers; 2,503 analysts; 1,780 technicians; 3,063 assistants). Budget (2014): aprox. € 750,000,.000 / USD 850,000,000. Describe the department or unit of the organization represented at the workshop. As representatives of your organization at the workshop, what are your roles? Embrapa Agricultural Informatics (EAI) is one of the thematic research center, located in Campinas, São Paulo State and develops computing and information technology applications for agriculture. EAI has nine researcher groups, one of them is the Laboratory of Organization and Processing for Electronic Information, where I work as researcher since 2009, after a previous period of over 15 years in RD&I management positions in another Embrapa Unit (Embrapa Satellite Monitoring). Returning to exercise my academic functions I chose the knowledge organization and representation thematic and have been working with the interface “agricultural KOS + NLP (Natural Language Processing)” aiming explore textual corpora, terminologies and conceptual structures. As a researcher at EAI, I have the freedom to choose the subject of my technical‐scientific activities, proposing projects to run them. But, I must contextualized my research initiatives both (1) aiming improvements of the general information and knowledge management corporative processes at Embrapa and (2) introducing KOS/Terminologies in scientific projects as a facilitating tool of collaborative work in multi‐inter‐trans‐disciplinary networks and preparing the informational content (databases, information systems, publications) produced during the execution of projects for greater global interoperability and visibility. B) Datasets maintained by your organization (2‐3 paragraphs) Please describe the datasets maintained by your organization ‐‐ ideally, with a list of URLs with brief descriptions. Please specify the language or languages of the datasets Do you publish any of this data on the Web as open data, and does your organization have a policy regarding open data (e.g., copyright)? Is any of the data published as Linked Data, defined here to mean data that conforms to open semantic standards? Before describing Embrapa´s datasets, I must introduce some considerations: 1. Information in Embrapa is highly fragmented but reasonably well organized in datasets, despite its incipient interoperability; 2. None of Embrapa´s scientific databases (I mean, raw scientific data, not the scientific production repositories) offers widespread access; only a restricted audience, usually closer to the database construction can access the data, even internally at Embrapa. This is a very critical issue from a cultural point of view and this situation, for example, has a direct impact on Embrapa adherence to the Open Data paradigm: what kind of data

Page 16: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Embrapa must (or can) open? This cultural aspect is currently being worked in a wider project in Embrapa, proposing an information governance model to the institution; 3. We can present two kinds of datasets that are currently available to be immediately worked from a semantic point of view at Embrapa:

I.Embrapa projects database . 1

Ideare (fantasy name!). Ideare is the programming management system of Embrapa built on PostGres. It is available directly on the web and support the management of the projects portfolio. Access to Ideare is restricted to employees from Embrapa and partners from other institutions who participate in its projects: https://sistemas.sede.embrapa.br/ideare/.

II.Embrapa technical‐scientific production datasets; BDP@ (PT: Base de dados da Pesquisa Agropecuária; EN: Agricultural Research Database) BDP@ is the web interface (http://www.bdpa.cnptia.embrapa.br/consulta/) for the Embrapa´s libraries general documentary collection Ainfo 2

Three others digital repositories are related with BDP@: Alice (PT: Acesso Livre à Informação Científica da Embrapa; EN: Open Access to Embrapa Scientific Information); http://www.alice.cnptia.embrapa.br/ Alice intends to gather, organize, store, preserve and disseminate scientific information produced by Embrapa researchers and published in book chapters, articles in indexed journals, articles in conference proceedings, theses and dissertations, technical notes, and more. By using standardized technologies also adopted by the scientific community, it is interoperable with other open access systems, and therefore part of a global network of scientific information. So in addition to being able to contribute directly and automatically to increase the impact of research results, will also contribute to greater visibility of Embrapa and its researchers. DSpace/DC; infoteca‐e (PT: Informação Tecnológica em Agricultura); EN: Technological Information on Agriculture); http://www.infoteca.cnptia.embrapa.br Infoteca‐e collects and provides access to information on technologies produced by the Brazilian Agricultural Research Corporation (Embrapa), which are related to the areas of expertise of its other research centers. Its collections are made up of domestic edited content (in the form of booklets, books for technology transfer, radio and TV), with adapted language so that farmers, extension workers, agricultural technicians, students and teachers of rural schools, cooperatives and other sectors of agricultural production can assimilate them more easily, and thus appropriating technologies generated by Embrapa. DSpace/DC; Sabiia (PT: Sistema Aberto e Integrado de Informação em Agricultura; EN: Open and Integrated Information System on Agriculture; http://www.sabiia.cnptia.embrapa.br/sabiia/?initQuery=t Sabiia is an automated search engine that collects and centralizes metadata from scientific data open access providers previously selected. This interface gathers information on agriculture and related areas, enabling access to the full text of thousands of scientific publications available in several national and international institutions. Sabiia provides access to documents such as books, book chapters, journal articles, brochures, theses, proceedings and proceedings of events, among others. Sabiia system was built with jOAI free software (Digital Library for Earth System Education) and Solr (The Apache Software Foundation).

1 Currently being prepared to aggregate semantic tools aiming information recovery; 2 Ainfo is a software for managing libraries that was developed by Embrapa Agricultural Informatics, launched in 1991 and currently in its 6th version. This software is mainly used by Embrapa´s libraries in order to organize, preserve and disseminate the documentary collections. He was developing with open source technologies, including the Java programming language, the database management system MySQL, wherein the system digital module is integrated to the DSpace repositories technology. Ainfo metadata standard is Marc21 with automatic conversion to DC used by Alice and infoteca‐e.

Page 17: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Obs.: 1. Language: PT, but all databases include EN records; 2. BDP@; Alice and infoteca‐e publish open data, but there are no policies; 3. No semantic standards are used in the datasets. C) Vocabularies maintained by your organization (2‐3 paragraphs) Until recently (1‐3 year ago) Embrapa had no technological, methodological or procedural framework to support the design, construction and management of a controlled vocabulary. Until now, librarians just use controlled vocabularies to index documentary information and they consult manually thesauri websites (mainly the multilingual Agrovoc; the bilingual NAL Thesaurus and the Brazilian monolingual Thesagro). Until now, there was no automatized facilitations in this process. Face this situation and considering that we obviously could not propose to Embrapa to build its own controlled vocabulary “starting from point zero” (i.e., concept by concept, term by term, relationship by relationship), we start a project aiming provide such framework and proposing to populate it initially with terminologies and conceptual structures re‐used from other sources. We design a very broad proposal of knowledge organization and engineering, based on corpora linguistics, conceptual structures and terminologies (Figure 1). In this process we suggest to use several software to run several steps (Figure 1). The need to re‐use other sources has led to the development of a specific computational tool, an "automatic extractor of terms and multi‐lingual agricultural conceptual structures" (ETECAM in Portuguese acronym) (Figure 2) to automate the task of matching terms with 3

thesauri. ETECAM was included as an alternative step in the proposed corporative framework mentioned above. Does your organization maintain its own vocabularies, whether strictly for internal use or for publication? If publicly available, please list their URLs along with brief descriptions (e.g., available languages). Yes, for now! But we intend to make it available either to process actors or to users through webservices. We consider too, the possibility to associate idiomatic equivalences in EN, ES, FR, IT. Note that Embrapa´s vocabularies must consider PT/BR particularities differing from European PT. Are any of your vocabularies mapped or linked to external vocabularies? Yes. We have mapped with Agrovoc, NAL Thesaurus and Thesagro. D) Uses of your datasets and vocabularies (1‐2 paragraphs) What kinds of applications consume your datasets and vocabularies? Please characterize the users of these applications; can you distinguish different user groups? It must be remembered what was said in C), just above: the conditions for Embrapa build and manage controlled vocabularies have been developed only recently and are currently being implemented. Applications derived from these new resources should begin to be developed and implemented as following results. However, exactly to acculturate Embrapa´s ambience in using such applications, some exercises have been developed in parallel and may be cited, as examples: 1. Proposal of the corporative process Knowledge and Engineering Organization at Embrapa, included as an essential element to the Corporate Project "Data and Information Governance for Knowledge at Embrapa: Model Development and Implementation Plan (GovIE)";

3 NOTE: the matching is “terminological” and is based on comparison between the input term and its occurrence in the triples (concept‐relationship‐concept) registered in the thesauri; a real “semantic” matching could be achieved when Embrapa´s vocabularies have been transcribed “concepts” into Skos, the next evolution to be worked in this process.

Page 18: Improving Semantics in Agriculture Workshop Pre‐workshop ...

2. Re‐engineering proposal to Thesagro the Brazilian agricultural thesaurus, adding to its structure a more refined semantics, following Agrovoc model. This proposal was developed as a PhD thesis, presented to the Post‐Graduate Program of the Information Science School of Federal University of Minas Gerais, on June 24th 2015; 3. Development of a formal ontology (OWL language) – OntoAgroHidro ‐ for domain representation and for information recovery related to the interfaces of “environmental and socioeconomics impacts of agriculture and climate changes on hydric resources in Braziliam biomes.” 4. Proposal of terminological alignment between the controlled vocabularies of INRA and Embrapa on the thematic of Agroecology, via conceptual and terminological mapping on Agrovoc, using the SKOS representation language and enabling mutual visibility of scientific production of open data recorded in the respective repositories of both institutions, through 510 common terms with idiomatic equivalences between the French and Portuguese languages and those with English; 5. Selection ranking of Embrapa RD&I projects, from Ideare dataset, that can be re‐grouped into thematic arrangements; this process is based on terminological similarity indexes between reference documents and projects textual content. Are your datasets or vocabularies used in any smartphone applications? Not yet! E) Vocabulary maintenance (1‐2 paragraphs) What software tools does your organization use to maintain or publish your vocabularies? Please describe any tools or processes used to import vocabularies for use in your organization or to create mappings or links between your vocabularies and others. Currently, the main software used is e‐Termos , a free access web collaborative environment dedicated to 4

terminology management that has been adapted to support the corporative process of knowledge organization and engineering at Embrapa. Originally e‐Temos uses theoretical assumptions of linguistic basis and implements six work steps representing the creation stages of terminological products. Each stage gathers specific tasks inherent to the process of making these products; different tools of linguistic analysis are linked to the six steps, which will have the function to support the Natural Language Processing tasks involved in this process. Aiming to facilitate user´s usability when building controlled vocabularies, when the whole protocol of e‐Termos is not necessary, another software was aligned to the process: TheXML® , a proprietary Brazilian software, based on 5

ISO 2788 and 5964, for creation and maintenance of monolingual, multilingual or poly‐hierarchical thesauri, taxonomies (flat, hierarchical, network or faceted), controlled vocabularies and ontologies. Following request from Embrapa, the TheXML was upgraded to import/export XML files allowing minimum interoperability with e‐Termos. Please, see attached Figure 1 and 3 for more details of the entire process and for software interoperabilities. F) Interoperability and future visions (no limit) Please describe your vision for semantics in agriculture and consider the following issues your response. Please describe any problems your organization may be experiencing with regard to the interoperability of datasets or vocabularies. From your perspective, where are the bottlenecks, and what sorts of tools, resources, or actions are needed to solve them? Problems from a cultural/organizational point of view: Embrapa must recognize the strategic role of linked information to support several corporative process as planning, collaborative work, global visibility, decision making, etc.; Embrapa must evolve its organizational, methodological and technological framework for information and knowledge management aiming to facilitate creation, reuse, sharing and dissemination of data and information

4 https://www.etermos.cnptia.embrapa.br/index.php 5 http://www.viaapia.com.br/index.php/thexml

Page 19: Improving Semantics in Agriculture Workshop Pre‐workshop ...

and aiming to put them closer to those corporative processes. In this scenario the potential of semantic tools are not yet explored!!! Problems from a conceptual point of view: Embrapa must consolidate epistemological models for modelling complex systems (including agricultural ones) and emergent scientific knowledge subdomains as bio/geo/nano‐technologies, allowing put them closer to the current and strictly local/ organizational/political/circumstantial/transitory models and, doing so, also providing better convergence and coherence in creation, analysis and recombination of data and information in a global scale; Problems from a tactical point of view: Embrapa must consolidate its terminological databases, controlled vocabularies and other KOS (Knowledge Organization Systems) aiming applications (including computing ones) for agricultural knowledge organization and representation; Embrapa must introduce in the semantic construction of terminological resources approaches of Communicative Theory of Terminology (Teresa Cabré), trying to capture the meaning of the concepts expressed in natural language and bring it to the language of expertises or vice‐versa; Problems from an operational point of view: Embrapa must develop better and user‐friendly tools for KOS edition and visualization. Visualization is a very important element in KOS conception and management; Embrapa must transform their current controlled vocabulary to SKOS; Embrapa must develop methodological tools for KOS merging; Embrapa must create its own namespace and allowing reciprocal linkage among other controlled vocabularies wide world; Embrapa must develop technological tools for recognize and capture from textual corpora the definitory contexts for concepts associating them to the controlled vocabularies; What do you see as the most pressing needs in agriculture for the coming decade? I think the most pressing needs in agriculture for the coming decade will remain the same needs from past and present decades: first of all ensure food for billions and billions of people and also become an alternative for energy generation, both in sustainable basis. I would prefer answering what are the pressing needs for the agricultural information management! So, I think we must work three conceptual perspectives: 1. To align in a more suitable and pragmatic way to the management process the DIK (data‐information‐knowledge) relationship, abandoning the conventional hierarchical/pyramidal or linear/progressive models and adopting a cyclical, continuous and feedback model; 2. To put closer and closer the information to the decision makers in each and every level and instantiation in which it is necessary and available; 3. To align with each other the DIK life cycles concepts and to align with each other the several subprocess of DIK management. Semantics can bring celerity and objectivity in pragmatic information retrieval and reuse to address complex problems… e.g.: What Embrapa could say about this hypothetical information context?: (…) Mitigation proposals to socioeconomic and environmental impacts due to silting caused by deforestation of riparian forest, with decreased water flow in the river bed, and due to the increase of population density and economic activities related to irrigated orcharding, and due to eutrophication caused by leaching of fertilizer (N) in

Page 20: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Rio do Peixe in San Francisco river basin, Cerrado‐Caatinga biomes transition after extreme drought events over the last five years (…) Obs.: 1. If to say means “to have” (data, projects, publications, etc.), then: syntactic information recovery; Boolean rules; numerous lexical combinations to compose keywords for queries in databases and repositories… Embrapa operates well in this level! 2. If to say means “to know” (reasoning), then: semantics aggregates cognition value... Embrapa needs technological and methodological evolution and innovation in this level. I would like to thinking that semantics could help to reconcile the “ubiquity X ambiguity” conflict of digital information, for example, in a searching information process: when requesting information, semantics should support the achievement of the highest possible ubiquity to potentially relevant and available information… When returning the answer semantics would help point information with more pragmatic value (based on the query) further the set of recovered relevant information. Obviously, semantics in not the only tool to be employed in such situation, but certainly has an important role to play. (Please, see the Figures 4‐8 comparing search motors answers to different queries built in natural language basis). What sort of datasets are needed, and what sorts of vocabularies are needed to support access to and use of those datasets? I think that we have to prepare a very wide conceptual framework (perhaps like NASA SWEET fundational ontologies) to house all sorts of agricultural KOS (existing ones or to be developed) and providing conditions to interrelate them reciprocally. This extensive exercise aiming to develop a general conception of world agriculture and to elect its fundamental elements involves the choice of a shared epistemological view of agriculture, identifying their generalities and its regional particularities and trying to represent it through KOS like thesauri, semantic networks and ontologies. To achieve this goal, the systematization of vocabularies concerning Agriculture and related disciplines is a basic task, including some acuities, e.g. identification, expression, recording of regional idiomatic variations within the same language; Do particular areas need to be strengthened, such as integration of semantics in geographic information systems? Yes! In OntoAgroHidro we have used geo‐spatial information to disambiguate rivers (entities) with the same name throughout Brazilian territory which have very popular denominations as Rio do Peixe (“river of the fish”) or Rio Verde (“green river”). What priorities should the organizations represented at the workshop set for future actions? [strategic level] ‐ To work politically and institutionally (internally or externally to their own organizations) to clarify and convince decision makers that well organized, shared and disseminated information is strategic to organizational management and governance processes strengthening the role of RD&I institutions in its commitment to provide solutions to the current global problems of Economy‐Environment‐Agriculture interfaces; [Tactical level] – To work together and collaboratively developing methodological and technological tools aimed KOS convergence and interoperability and showing pragmatic applications in information and knowledge management processes; [Operational level] – To create or participate in global scientific communities or networks and develop joint projects. Are you aware of, or involved in, other relevant projects or initiatives in related areas? I know numerous virtual communities but I really see very few and pragmatic outcomes from them! In what direction should we try to head over the coming decade? I can think some many things!!! But concerning specifically “semantics” I would like to explore the technological possibilities to develop more suitable KOS to represent complex systems! How to semantically represent the three characteristics of such systems: uncertainty, randomness and unpredictability???!!!

Page 21: Improving Semantics in Agriculture Workshop Pre‐workshop ...

French National Institute for Agricultural Research (INRA)

A) About INRA is a public research organisation ranked 2nd in the world and 1st in Europe for publications in agricultural plant and animal sciences. The missions of INRA are to: (i) serve the public interest by maintaining a balance between excellence of research and the demands of society; (ii) produce and disseminate scientific knowledge and innovation, particularly in the fields of agriculture, food and environment; (iii) contribute to the expertise, training, promotion of scientific and technical culture and science/society debate. The INRA department of Scientific Information (DIST) works transversely to support INRA scientific strategy and provide innovative services within the area of data, information and knowledge management. DST develops tools and services to support the research activities, and promote open access to scientific and technical information. DIST particularly supports the scientists in building and publishing domain vocabularies according to open standards. B) Datasets maintained Many datasets are maintained by INRA in a great variety of agriculture and food related domains. 90 platforms offer access to both scientific databases, tools and services. Some examples:

‐ GnpIS is a multispecies integrative information system dedicated to plant and fungi pests. ‐ GenoTool is dedicated to sequence analysis ‐ Oqali si a database gathering the nutritional characteristics of processed foodstuffs sold on the French

market, at the brand level A couple of current projects aim at publishing data on the web of data. DIST published last year a open linked data version of the content of the institutional archive Prodinra. C) Vocabularies maintained DIST maintains the institutional reference vocabulary VOCINRA that is used by STI professionals and researchers to index scientific publications and research activities. Up to now, VOCINRA is for internal use, though a RDF version has recently been released (somehow confidentially). An ongoing work in collaboration with Embrapa Brazil resulted in the automatic mapping of some 1,500 terms on the basis of a corpus of publications in the area of agroecology. Research teams build and sometimes publish vocabularies for the need of their projects. They can be very specialised regarding the domain, expressed in French and/or English, from thesauri to formal ontologies and follow or not representation standards. DIST has started an inventory work with the objective of proposing valorisation facilities to those teams, among which the LOVINRA portal. The vocabularies included in LOVINRA are also available on Agroportal (http://agroportal.lirmm.fr/) which aims to become a reference ontology repository for the agronomic domain. Some examples :

‐ ANAEE thesaurus deals with gene‐environment interactions, biodiversity, biotic interactions and ecosystem functioning, and eco‐evolutionary processes. (more information, http://agroportal.lirmm.fr/ontologies/ANAEEF )

‐ ATOL is a reference ontology on animal phenotypes (http://www.atol‐ontology.com/index.php/fr/)

Page 22: Improving Semantics in Agriculture Workshop Pre‐workshop ...

‐ BIOREFINARY (http://ist.blogs.inra.fr/lovinra/2015/05/04/biorefinery/, http://agroportal.lirmm.fr/ontologies/BIOREFINERY)

‐ TRANSMAT (http://ist.blogs.inra.fr/lovinra/2015/05/04/transmat/) D) Uses of your datasets and vocabularies VOCINRA is used by STI professionals who index research publications in Prodinra. Datasets maintained by research and bioinformatics teams are directly exploited by dedicated tools for analysis and modelisation. The users are researchers. E) Vocabulary maintenance DIST recently released the beta version of LOVINRA (Linked Open Vocabularies @ INRA) that allows the LOD publication of vocabularies. It is based on a WordPress website, the Sesame triple store and the Pubby Linked Data API to provide REST access to the concepts and facilitate their link to external resources. DIST is also involved in the Agroportal project which is based on the NCBO Bioportal technology. Indeed, the vocabularies published through LOVINRA are also available within Agroportal. The ANAEE team works with VocBench to build and maintain its thesaurus. As VOCINRA contains common keywords from international vocabulary but also very specific and recent keywords from recent research, we plan to enhance vocabulary quality and to perform alignements with main vocabularies in agriculture and Life sciences. An ongoing work of DIST with an Embrapa team used Onagui to compute alignments of VOCINRA with Agrovoc. We will study the possibility of using VOCBENCH for its maintenance. F) Interoperability and future visions (no limit) INRA is well engaged in the Open Science movement both internally and by taking parts to interinstitutional and international groups (Research Data Alliance, Science Europe, etc.). But it still lacks a clear policy regarding what can be published and under what conditions (access, sensitivity, licences, etc.). Researchers are facing the diversity of repositories (institutional, domain‐specific, recommended/prescribed by publishers) and of legal issues, with little/no time to spend. This is true for both data and vocabularies. Vocabularies are little reusable due to 1) lack of visibility, 2) poor description (scope, goals, etc.), 3) lack of confidence indicators Repositories already exist but are not well identified by the community of researchers. As a consequence, they tend to redevelop ad hoc vocabularies, which is an impediment to database interoperability. Regarding tools for vocabulary, the need is in collaborative and intuitive interfaces that allow 1) the finding and integration of existing resources, 2) their adaptation with data from texts, 3) alignment with published vocabularies, 4) easy structuring, 5) discussion about structure and content. Guidelines for discovery metadata are also needed. The current challenges regarding data interoperability from our point of view:

Resource identification: there is a need for machine‐readable identifiers as well as human friendly identifiers (naming conventions)

Provenance information: what are the minimum requirements in terms of provenance metadata to allow one contextualize and assess the data? What kind of incentives could we have towards the scientists so they will provide these informations? What could be collected automatically and how?

Semantic coverage: there is a lot of work done, we need to have the overall big picture and identify the gaps.

How do we link new datasets back to the existing ones, given the distributed context of data infratrustures?

Data formats: we need at least common formats for data exchange

Page 23: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Some relevant initiatives and projects

COPO (Collaborative Open Plant Omics) is a BBSRC granted project that aims to combat the lack of interoperability in the plant research sector by developing a framework that establishes a shared system for the description, publication, identification and citation of datasets. By improving the transparency of data, COPO intends to increase access to datasets and facilitate easier reproduction of data, a function central to the analysis and advancement of the research field.

Agroportal: The Agroportal project is based on the NCBO BioPortal technology and aims at reusing the scientific outcomes and experience of the biomedical domain in the context of plant, agronomic and environment sciences. It will offer an ontology portal which features ontology hosting, search, versioning, visualization, comment, but also services for semantically annotating data with the ontologies, as well as storing and exploiting ontology alignments and data annotations. All of these within a fully semantic web compliant infrastructure. The main objective of this project is to enable straightforward use of agronomic related ontologies, avoiding data managers and researchers the burden to deal with complex knowledge engineering issues to annotate the research data. The AgroPortal project will specifically pay attention to respect the requirements of the agronomic community and the specificities of the crop domain.

WheatIS: this project aims at building an International Wheat Information System, called hereafter WheatIS, to support the wheat research community. The main objective is to provide a single‐access web base system to access to the available data resources and bioinformatics tools.

Page 24: Improving Semantics in Agriculture Workshop Pre‐workshop ...

International Food Policy Research Institute (IFPRI)

A) About The International Food Policy Research Institute (IFPRI) provides research‐based policy solutions to sustainably reduce poverty and end hunger and malnutrition in developing countries. Established in 1975, IFPRI currently has more than 500 employees working in over 50 countries. It is a research center of the CGIAR Consortium, a worldwide partnership engaged in agricultural research for development Vision and Mission: IFPRI’s vision is a world free of hunger and malnutrition. Its mission is to provide research‐based policy solutions that sustainably reduce poverty and end hunger and malnutrition. Library & Knowledge Management unit captures, organizes, and provides access to and exchange of IFPRI's research, through its knowledge repositories and academic networks. Supports and provides training to IFPRI researchers with tools, datasets, data visualizations, widgets, and internal blogs. Administers the Iibrary where IFPRI staff can access collections of books, journals, datasets, and databases. My role‐ Knowledge Management Systems Coordinator‐ create/maintain LKM website, create LOD for IFPRI datasets, format data as RDF for VIVO initiative at IFPRI, creation of Agricultural Technology Ontology, and the upcoming coordinatination of a triple store for IFPRI publications. B) Datasets maintained IFPRI hosts an internal instance of Dataverse (IFPRI staff only) that hosts some of the research data and some of third party data that has been acquired to support research. IFPRI also has an external dataset supporting the research conducted by the institution on the Harvard Dataverse network‐ https://dataverse.harvard.edu/dataverse/IFPRI .These are some of the types of datasets we maintain: Social Accounting Matrix (SAM), Household Surveys, Global Hunger Index (GHI), Global Nutrition Report (GNR), Agricultural Science and Technology Indicators (ASTI). Linked Open Data is also published for the Global Hunger Index (GHI), Agricultural Science and Technology Indicators (ASTI), Statistics of Public Expenditure for Economic Development (SPEED), and Arab Spatial datasets; these are housed on a separate server and can be accessed through ‐ http://data.ifpri.org/ IFPRI’s requires that proper citation and attribution be given in the reuse of data, and follows the IFPRI copyright and fair use statement, http://www.ifpri.org/copyright . C) Vocabularies maintained The vocabularies we maintain are used in our repository for internal purposes. These are lists of descriptors specific to IFPRI research, organizational structure, and classifications. The keywords used to

Page 25: Improving Semantics in Agriculture Workshop Pre‐workshop ...

describe materials in the repository come from AGROVOC and CABI, as well as the ‘author supplied’ field that serves as a catch all for the terms not found in these vocabularies. In development – Agricultural Technology Ontology (ATO) is an effort aimed to map technologies, crops, and methods from research across the CGIAR. So far, this ontology links to the Geopolitical ontology with plans on linking to the Crop Ontology. VIVO is an open source semantic application, an internal ontology covering the research and organizational structures at IFPRI has been created to work with and supplement the existing VIVO set of ontologies. D) Uses of datasets and vocabularies The IFPRI LOD page has a SPARQL Endpoint for accessibility and use of data, as well as browsing options and the original files for download. The data has been reused to populate country profiles in other websites such as AGRODEP (http://www.agrodep.org/country/NGA ). In 2014 The IFPRI LOD page generated 4,645 visits with 1,907 unique visitors. The major users for IFPRI LOD were AGRODEP, Land Portal, and FAO. E) Vocabulary maintenance The LOD site is Java code running on an Apache server, with annual data updates created through Java and Ruby scripts. Creation of the ATO has been through the use of Protégé, and the IFPRI VIVO ontology was created through the VIVO interface. A large effort was undertaken to map repository keywords to ‘Topics’ used on the new IFPRI website, as a way of categorizing IFPRI outputs for browsing; Open Refine, spreadsheets, and Access were used iteratively in these efforts. F) Interoperability and future visions We should be working on standardizing the data collection and recording processes. Currently, different countries have different methods making it difficult to compare and use with other sources; making the data interoperable allows for the development of more applications and tools. These innovations can aid in revealing more efficient ways to invest, and new methods for solving problems. Having interoperable agriculture, health, GIS, and soils information helps represent the ‘big picture’ of what the data reveals, allowing us the opportunities to create better tools that fight poverty and hunger issues more effectively. Problems and solutions Donors require different methods and standards, some require the use of certain vocabularies and/or reporting methods, making it difficult to map and relate data for reuse. A huge problem is the lack of having the staff involved, in curating data, using normalized vocabularies, or ontologies. The lack of staff training leads to individualized fixes that further perpetuate the difficulties in reusing data. Most pressing needs in agriculture for the coming decade

Page 26: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Training of staff. The use of tools, standards, and vocabularies should be prominent in order for changes to have lasting effects. This also involves hiring real data curators and not only techies. In recent years, with the rise of awareness for the need of data curation, skillsets applicable to data curation (including GIS) have been nurtured, it is time to bring these to the forefront. Future priorities Focus on ways to implement tools and developments that are interoperable, so they may be integrated with other projects that are being funded. The Agricultural Technology Ontology started with this in mind, the process and work that has gone into the creation of the ATO has shown that finding the line between broad enough to cover everything yet specific enough for precision is hard. This work is necessary if we hope for widespread use of ontologies and vocabularies to make data interoperable. Our work in the future should lean towards a comprehensive framework of related standards for the exchange, integration, sharing and retrieval of electronic agricultural, nutrition, gender and health information should be developed. The agricultural framework should be based on a truly modern web services approach making it easier for systems to exchange very specific, well‐defined pieces of information, rather than entire documents.

Page 27: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Indian Statistical Institute (ISI) A) About Indian Statistical Institute (ISI) is a unique institution devoted to the research, higher education and application of statistics, natural sciences and social sciences. Founded 1931, the institute gained the status of an Institution of National Importance by an act of the Indian Parliament in 1959. Documentation Research and Training Center (DRTC) established in 1962 under ISI, has grown into a world renowned centre of higher learning and research. The objective of the center is to contribute to the development of the different branches of information sciences guiding and supporting research and development in the concerned fields. The aim is to develop expertise and excellence in different areas of information science. B) Datasets maintained DRTC, ISI hosts and maintains the following:

Librarians' Digital Library [LDL] https://drtc.isibang.ac.in Indus Asian Agricultural Resources gateway http://drtc.isibang.ac.in/indus CALIS: Current awareness service based on RSS feeds in LIS discipline

C) Vocabularies maintained No we do not maintain our own vocabularies. We however help institutes and research organizations manage their vocabularies using a semantic framework built based on principles and postulates of LIS. D) Uses of datasets and vocabularies Mainly ours are digital repositories and are used by researchers to search and retrieve scholarly articles. Smartphone applications are possible but not yet implemented. E) Vocabulary maintenance Homegrown [software tools]. F) Interoperability and future visions Issues: Interoperability problem: The only way forward is to maximize the use of agriculture data is to adhere to world standards in storing, representation,interchange and at stages that data passes through from its production to its consumption. While this is desirable it is true that huge amounts of legacy data exists and it is being used perfectly well within the communities that are familiar with the data. The issue of interoperability arises when we think of web wide use of such data. Disparity in the process of acquisition, storage, description, exchange formats and many more is then a major issue. This major issue can be sub‐divided into several others such as vocabulary, representation standards, IT stack and standards, exchange format standards and so on. Challenges: the Challenge, put in a single sentence, therefore is:

Page 28: Improving Semantics in Agriculture Workshop Pre‐workshop ...

'How to make agricultural data accessible and usable by web‐wide communities?' Solutions: THE ONLY solution now is to empower communities to be able to harness available data, at all levels of stakeholders in Agriculture community worldwide. Care should be taken that it is not only aimed at IT professionals and Information managers but also agriculture scientists, field workers and researchers. Whatever the solutions offered will succeed if they data are thrown open by any means but as light weight in architecture and in a simple parsable mode of web publishing along with embedded semantics.

Page 29: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Integrated Modelling Collaboratory (IMC) A) About Fernando Villa leads the Integrated Modelling Collaboratory (IMC), an entity dedicated to semantically integrated modelling joining many collaborators worldwide and publishing the only semantic modelling software platform in existence. We also maintain a set of ontologies and a core ontology for modeling and scientific observation. The main project under the integratedmodelling.org umbrella is the ARIES (Artificial Intelligence for Ecosystem Services), which provides a next‐generation platform for enviro‐social assessment and valuation, widely published and used by many groups worldwide. B) Datasets maintained We do not authoritatively maintain any datasets, but are responsible for the semantic annotation and distribution of many global and local datasets for use with our integratedmodelling.org platform. We maintain collaboration sites and have secure upload and semantic annotation facilities for users of the collaboratory. C) Vocabularies maintained We have developed the k.IM language which is the only descriptional framework available for semantic annotation of data and models. It uses the core IM ontology as a defining feature, which encodes both the semantics of observation and of observables for semantic mediation within the integratedmodelling.org platform. The language has facilities for integration of authorities, which bridge to existing vocabularies such as AGROVOC, GBIF encoders and IUPAC chemical identities so that they can be used within semantically explicit workflows. D) Uses of datasets and vocabularies The ARIES platform (www.ariesonline.org, soon to be superseded by aries.integratedmodelling.org) is the flagship project that is serving as the testing and development grounds for the IMC community. The project is widely known as one of the major platforms for ecosystem services assessment and valuation. E) Vocabulary maintenance An open source software stack (named k.LAB) is the main activity of the IMC and contains both a user‐side platform (language, IDE and modeling engine) and a network node package that allows users to install semantic web nodes using the infrastructure and ontologies. We do not publish a vocabulary for a specific discipline, but maintain a comprehensive set of ontologies that constitute a socio‐environmental system worldview. Internally, our approach is based on OWL2 and DL reasoning. F) Interoperability and future visions We have a fully worked out infrastructure and community where long‐standing hypotheses about semantically‐driven interoperability of data and models are being tested. The road is long but we believe our approach solves all the outstanding issues that have doomed efforts like OBOE, SWEET, SUMO and other initiatives that have tried to provide communities with tools for data/model interoperability. The main dimensions of the effort are aimed at:

Page 30: Improving Semantics in Agriculture Workshop Pre‐workshop ...

1. Awareness of scale, whose lack has created unworkable ambiguities in all previous efforts. Is Lighting a process or an event? Is the movement of molecules in a gas a random/statistical process or a deterministic one? The explicit scale stance in a worldview defines the phenomenology embedded in the ontologies.

2. Providing a mechanism that can keep the core domain concept set small and flexible. Most ontology efforts just specialize terms as needed, which creates monster ontologies that are never complete. We have defined and standardized an approach based on explicit notions of domain, traits and authorities to successfully manage size and complexity.

3. Avoidance of “domain creep" and jargon control. Concepts differ by discipline and disciplines overlap and conflict for the use of the same terms. We work above disciplines and only concentrate on observables at specified scales; we use jargon files to bridge if necessary. Domains are explicit and carefully encapsulate what is disciplinary and what is not.

4. There have been no satisfactory solutions so far to the complexities resulting from bridging to naming systems such as controlled vocabularies and taxonomies. We use only traits for those, assigning them the special status of "identities", and use authorities to allow flexible bridging to recognized naming systems without having to explicitly build ontologies with an infinite number of identities.

5. All our efforts are carefully designed to allow a mapping to models. The rationale for using explicit ontologies in models is to discover, connect and mediate models and data in safe and unambiguous ways. The innovations we are pioneering (explicit, scale‐aware worldviews; ontology size control; use of traits to describe data reduction and other observation‐related issues; conformance to observation ontology through language constraints; use of recognized authorities for large‐scale identities, etc.) enable this practice for the first time. ARIES is the demonstration project that illustrates this. The k.IM language is used for models as well as concepts and there is a seamless continuity between the two, thanks to the unifying metaphor of observation.

Page 31: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Kenya Agricultural and Livestock Research Organisation (KALRO)

A) About

The Kenya Agricultural and Livestock Research Organisation (KALRO) is the premier national institution bringing together research programmes in crops, livestock, biotechnology and socio‐economics and envisions a vibrant commercially‐oriented and competitive agricultural sector propelled by science, technology and innovation. KALRO promotes sound agricultural research, technology generation and dissemination to ensure food security through improved productivity and environmental conservation.

One of KALRO’s goals is to promote the dissemination and application of research findings in the field of agriculture and the establishment of a Science Park. KALRO coordinates Kenya Agricultural Information Network (KAINet), which is a comprehensive national repository of research, extension, training and other materials related to research. It aims at making research open and accessible to various stakeholders.

KALRO has created a Knowledge Management unit which is in charge of the development, implementation and management of knowledge management systems. I am currently in‐charge of the knowledge management systems and also in‐charge of the KAINet.

B) Datasets maintained

KAINet and the KALRO E‐repository http://www.kalro.org:8080/repository are electronic metadata sets of documents incorporating agricultural research carried out in Kenya by KALRO and its partners over the years. The KAINet e‐repository www.kainet.or.ke has access to a large collection of information covering aspects of agricultural and forestry sciences and grey literature which is not available through normal publication and distribution channels.

KAINet online repository contain over 35,000 metadata bibliographic references in the AGRIS AP XML format, from four collaborating institutions. Until now, the biggest contributor by far is the KALRO E‐repository with 1,745 metadata sets with full text documents all in the English language.

Currently, KALRO scientists individually collect and maintain agricultural datasets on livestock, food science, climate change, socio‐economics research mainly due to the lack of policies. Unfortunately, these dataset are not available and accessible online in any platforms. However, efforts are underway to sensitize researchers on the importance of sharing datasets, the implementation of the right and interoperable tools and standards and development of strategies/policies to make datasets accessible and available online.

C) Vocabularies maintained

The adoption of AgriDrupal and AgriOceanDspace also ensured that KAINet and KARI repositories conform to some of the recommendations by the Coherence in Information for Agricultural Research for Development (CIARD). CIARD is a global initiative working to make agricultural research information publicly available and accessible to all. Both tools have adopted Resource Description Framework (RDF), linked data, Really Simple Syndication (RSS), and Open Archives Initiative Protocol for Metadata Harvesting (OAI‐PMH), thus enhancing visibility and accessibility of agricultural content in the repositories.

Page 32: Improving Semantics in Agriculture Workshop Pre‐workshop ...

The AGROVOC is used to index documents, and can also be used as a hub to access many other vocabularies available on the web. The adoption of this tools has:‐

• increased the standards for metadata (AGRIS AP, MODS) and OAI‐PMH compliant; Controlled vocabularies (ASFA, AGROVOC);

• improved the exchange of metadata, by offering import and export functionalities based on widely used formats (Dublin Core, AGRIS AP, CSV or RSS). AGROVOC can be used to index any contents:

• improved the quality of metadata; a cataloguing interface that out‐of‐the‐box provides the most commonly used metadata elements in bibliographic databases, in particular those defined by AGRIS AP, but is easily extendable to include any other element; special input interface for subject indexing with the AGROVOC thesaurus.

D) Use of datasets and vocabularies

The e‐repositories that bring together what the institution has done, has become an important ‘First Stop, One Stop Tool’ for researchers and policy makers to identify research gaps for purposes of resource allocation and beefing up existing research activities. KAINet partners and other stakeholders are able to exchange metadata sets amongst themselves through the adoption of systems that meet specific architectural and functional requirements for information exchange. This has been achieved by using AgriDruapl and AgriOceanDspace that have integrated the AGRIS Application Profile (AGRIS AP) and AGROVOC.

Interoperability within KAINet has increased the accessibility of the information/data and has resulted in partner institutions harvesting metadata from the FAO AGRIS database and vice versa. For example, KALRO has shared its metadata with CABI. Access of real time news and events, using RSS Feeds, from other agricultural related sources like AgriFeeds and e‐Agriculture has improved the information/data on KAINet.

Visibility has been enhanced with the KAINet repository being registered with the CIARD Ring and the KALRO repository being among the few repositories from Africa being listed and accessed through the OPENDOAR (http://www.opendoar.org).

E) Vocabulary maintenance

KALRO and KAINET both use AGROVOC which is a controlled vocabulary covering all areas of interest of the Food and Agriculture Organization (FAO) of the United Nations, including food, nutrition, agriculture, fisheries, forestry, environment etc. It is published by FAO and edited by a community of experts.

The vocabulary is normally integrated, customized and configured as modules onto the AgriDrupal and AgriOceanDspace hence maintained by mainly FAO. Its use on KAINet portal ensured that content and repository metadata reside on the platform increasing usability and accessibility of both content and the repository.

F) Interoperability and future visions (no limit)

To see semantics in agriculture play a key and enabling role to publish and organize agricultural data so as to be compatible and integrated with the data produced by other entities.

The absence of appropriate information management policies make it difficult for information managers to collect data from generators for the repositories. There is also lack of awareness of the importance of semantics in improving the value and use of open data among the key stakeholders.

Page 33: Improving Semantics in Agriculture Workshop Pre‐workshop ...

To address some of the challenges there is need to develop a platform that would sensitize agricultural stakeholders on the need to adopted appropriate tools and standards to ensure datasets are easily accessible, visible and exchangeable.

We also need to promote semantics in agriculture by documenting successful case studies that show the benefits, value and profitable of sharing data and making it available.

Agriculture plays a very vital role in many GDPs in the developing countries and hence the need to keep up with the demands of market and feeding of the ever growing human populations.

To achieve its targets the agricultural stakeholders need to generate datasets on climate change, population density, technologies, and best science practices. The development of particular geographic information systems is key with issues of climate change.

The workshop should look at strategies that would close the gap of datasets accessible and available from developing. Also looks into ways of setting up a support community that would offer technical support, advice and resources to interested agriculture stakeholders to take advantage of the semantic in agriculture movement.

With the milestone made in semantics in agriculture, KALRO is strategizing to build a common and freely accessible information system in partnerships with Kenya Open Data, OCSDNet (http://ocsdnet.org/) and Kenya Agricultural Information Network (KAINET). This effort is aimed to facilitate the generation, collection, processing, archival, and dissemination of agricultural datasets.

Page 34: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Association for Technology and Structures in Agriculture (KTBL) A) About The Association for Technology and Structures in Agriculture (KTBL) is a registered association in Darmstadt, Germany with staff of approximately 70 people. KTBL is supported by the German Federal Ministry of Food and Agriculture. The activities are based on the government's agricultural and environmental policies and on the needs of our target groups. Its mandate is knowledge transfer of scientific findings into agricultural practice. The main objective is to promote environmentally friendly and society‐embedded agriculture in accordance to the needs of consumers, particularly with regard to agricultural engineering techniques and methods in cultivation, species adapted livestock farming, landscape management, agricultural raw materials and energy production and recycling of organic wastes. With its activities, it supports policymakers and administrators in drafting legislation and helps to create regulations and legal instruments. Apart from printed publications containing planning and decision support data for farmers, consultants and administration, focus is increasingly leaning towards internet based information provision. With increasing value attributed to information, technologies of data exchange and information and knowledge management in agriculture have become a major field of activity for KTBL. B) Datasets maintained KTBL maintains a large collection of planning data for agriculture. It provides standard calculation values e. g. for investment, ressource allocation and process planning for plant production as well as for livestock farming. Included are e. g. average purchase prices, useful lifespans of machinery and equipment or average execution times of tasks like e.g. harvesting, fertilization, depending upon crop, machinery used etc. Recently, also basic values neccessary for calculation of nutrient balances or carbon footprint of production have gained increasing importance and have thus been added to the database. Descriptive labels and text fields in the dataset are mostly in german, but part of it has been translated into english. However, we received requests for further multilingualization and are preparing for it. We started work to publish the data as linked open data. A subset containing data about agricultural machinery is available at http://srv.ktbl.de/data/MachineClass/ with a semantic search system on top at https://search.ktbl.de. Data is currently published using the Creative Commons BY‐NC‐SA 4.0 license, we however consider switching to the more permissive BY‐SA variant. The server mentioned above delivers data in RDF/XML, Turtle, JSON and XML format plus an HTML view for human consumption. It runs an instance of the Linked Data API as specified at https://github.com/UKGovLD/linked‐data‐api plus a SPARQL endpoint at http://srv.ktbl.de/query. Among others, the datasets use the RDF, RDFS and SKOS vocabularies, but also QUDT (http://qudt.org) for representation of units and dimensions. C) Vocabularies maintained For a research project, we drafted the agroRDF vocabulary, which provides classes and properties for representation of process planning and documentation data as commonly captured in farm management information systems. It was published in an early draft stage during the iGreen project at

Page 35: Improving Semantics in Agriculture Workshop Pre‐workshop ...

http://data.igreen‐services.com already but is currently unavailable due to a larger refactoring and redesign. The linked data server above also contains a vocabulary definition and machine taxonomy within their own namespaces http://srv.ktbl.de/vocabulary# and http://srv.ktbl.de/taxonomy/ respectively. They can currently however not yet be resolved at the URLs given and only be queried via the SPARQL endpoint. Where possible, the vocabulary and taxonomy have been mapped to AGROVOC concepts. D) Uses of datasets and vocabularies As the service can currently be considered to be within a beta testing phase, there are not many known applications already consuming the dataset. However, we know of two farm management system providers in Germany that are using data via the web services provided. Another three have expressed interest. As far as we know, there are not yet any smartphone applications available, but we are currently building one ourselves for the Android platform. It is using the JSON serialization of the service and offers possibilities to do basic depreciation, fixed and variable cost and operating supply consumption calculations for agricultural machines. E) Vocabulary maintenance On the vocabulary and taxonomy side, we rely only on text editors and standard source code versioning tools. To facilitate maintenance, the material is split into a number of modules. For the data set, which is much larger than the vocabularies, we have to move along a different path: we use d2rq (http://d2rq.org) for generating an RDF dump out of an Oracle database. This is then fed alongside the vocabularies into a Jena Fuseki instance to provide a SPARQL endpoint. When we reference an external vocabulary, that is imported either fully or as a subset into the Fuseki instance as well. Mappings are created manually. F) Interoperability and future visions Regarding interoperability, I think things have turned out to improve during the last few years. At least with the facilities provided by graph‐oriented ‐ and thus much more flexible, extensible and transferable ‐ models like RDF and concept mapping constructs in SKOS, the convergence path is much more open than it was a few years ago, when everybody was relying on relational and/or tree‐oriented models like XML. Nevertheless, there are still a few things to be done. A notable problem in our context of farm management information systems is e. g. the use of different and within themselves as well as amongst each other inconsistent crop code systems. Variety registration agencies use another one than pesticide registrations agencies and still another one is used for the agricultural subsidy and statistics system within the EU. Most of them bear serious conceptual inconsistencies like mixing biological species with land usage or different production forms. Fostering vocabulary reuse is another issue that should be addressed. There are a number of simple and proven vocabularies out there like the work done at the schema.org initiative, Dublin Core or FOAF and wherever possible these should be used and only extended where necessary. I think a tool to support reuse should be designed oriented on e. g. the taginfo.osm.org website maintained for the openstreetmap initiative. There, users of the platform can get statistical information about usage of certain tags within openstreetmap. It is thus easily possible to e. g. find out, which tags are used the most for annotating certain objects like shops, roads etc. It is then possible to react accordingly and only propose new tags when there are no alternatives already existing. In the semantic web

context, such a tool would have to crawl as many datasets as possible and provide statistics on the usage of

certain rdf classes and properties. People providing new datasets could then see, which properties are used most often and orient themselves in their own development on a common ground that is already available.

Page 36: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Apart from that, most of the bottlenecks and problems we face are more basic and practical. For example, the lack of n‐ary relations in the RDF data model (or at least a common recommendation of how to deal with this) is in some contexts rather problematic. Think of e. g. representing physical quantities as properties: dimension + value. We solve the problem for our own datasets using blank nodes, as it provides the simplest solution also in terms of being able to generate other serializations of the data later on. Using datatypes would be another approach, that however deprives us of the possibility to assign an XML schema datatype (xsd:double, xsd:decimal) to the value and a rdfs:label to the unit. We're fully aware that other people are skeptical of using blank nodes in linked data, but from a pragmatic point of view, that seemed to be the best solution. Also, we have not found a web service frontend component that fulfills all of our requirements with reasonable setup effort. We're currently using ELDA (Epimorphics Linked Data API, http://github.com/epimorphics/elda). That provides nice features for adjusting the HTML view to the organizations corporate identity (using apache velocity templates) and is sufficiently easy to configure. But it does neither support content negotiation natively nor does the delivery of a localized rdfs:label (i. e. delivering the label with the respective language code) for the properties to be shown based on e. g. accept‐language headers in the HTTP request work nor can you apply property selection to the data provided by the SPARQL endpoint (i. e. property selection by CONSTRUCT queries does not work as advertised in the documentation). We are therefore currently in the process of rolling our own linked data server implementation

that supports all of these features while still being sufficiently easy to adapt and configure.

The most pressing need in agriculture in my opinion is an increase in resource efficiency, i. e. using less inputs to produce higher outputs.

Chemical pest control is facing more and more problems from being expensive in development and usage to resistencies etc. etc. Mineral fertilizers might either run out in some time in the future or are washed out into the groundwater. So we have to learn how to produce more (or at least on the same level) with less of these inputs by applying them sparingly and in a very directed, controlled manner. Also, we have to think about how to put sparse sites into production and how to prevent landscapes from degradation, so that they can be sustainably used (e. g. prevent desertification). As a consequence, datasets needed would be (no claim for

completeness):

‐ variety production site/regime suitability datasets, i. e. variety trial results

‐ pest spread datasets (real time)

‐ nutrient supply data sets

‐ base data on agricultural supplies

‐ ... All in all, a higher degree of cooperation and openness to share data amongst each other would be required. Integration of semantics in geographic information systems is for sure a question to be worked at, although in my opinion, that problem is not that difficult to solve from a technical/scientific point of view: there are already geospatial vocabularies (OGC geosparql, also others) out there that would allow for proper representation. Tagging systems and object identification systems using URLs like the ones used in openstreetmap can relatively easily be linked to semantic information. It's thus more a matter of organizing resources and simply doing a prototype/demonstrator to convince people rather than having to solve hard representation problems or needing a large research project with accompanying overhead. Future actions should in my opinion focus on doing: it is not as hard to get a proper semantically enabled service system up and running than it was ‐ say ‐ ten years ago. The basic technology is available. It is possible to build convincing systems if you focus on simplicity and reuse. The hardest part may be teaching the IT departments to think in globally distributed graphs rather than in central, relational databases and in making management

Page 37: Improving Semantics in Agriculture Workshop Pre‐workshop ...

recognize that information is much too valuable for survival of the human race to be kept behind locked doors and that there is still more than enough room for earning money with services and applications based upon open data instead of trying to make money with the data itself. Although the latter seems to (unfortunately) work for large corporations, I have never seen medium to small sized organizations succeed with such an approach.

Page 38: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Syngenta A) About Syngenta is one of the world's leading companies with more than 28,000 employees in some 90 countries dedicated to our purpose: Bringing plant potential to life. Derek Scuffell: R&D Data Strategist, Sensors Innovation Lead for R&D B) Datasets maintained Syngenta data sets that we maintain are in‐house captured as part of the Syngenta R&D innovation pipeline: They include: Small molecule bioactivity data for crop protection Genomic sequence and phenotype data Regulatory study information for crop protection and seed breeding development Environmental profile data, such as soil profiles, climate and weather. In addition we interact with external datasets typically those which might be serviced by the European Bioinformatics Institutes and similar public institutions. These data are serialized in RDBMS’ and RDF. Our Good Growth Plan data is published on the web as in both csv and RDF. We use different Creative Commons licenses when we publish our data, for example NonCommerical‐NoDerivatives, or Attribution‐ShareAlike, an open data license. Each year, we report our progress on all six commitments and provide detailed data and definitions in accordance with the best data practices. http://www.syngenta.com/global/corporate/en/GOODGROWTHPLANDATA/Pages/progress.aspx C) Vocabularies maintained by your organization (2‐3 paragraphs) Syngenta R&D maintans a vocabulary system, which hosts both public and internal reference terms. Public sources: FAO, obofoundry, bioportal, CGIAR http://www.obofoundry.org/ https://sweet.jpl.nasa.gov/ http://www.qudt.org/ http://bioportal.bioontology.org/ http://www.w3.org/2005/Incubator/ssn/ssnx/ssn http://www.w3.org/2004/02/skos/ http://www.w3.org/TR/vocab‐org/ D) Uses of your datasets and vocabularies (1‐2 paragraphs) Table below shows an overview of general data use in Syngenta R&D

Page 39: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Data Type Roles using the data Data Access for users Small molecule bioactivity

Crop Protection Research Scientists and Analysts

In‐house and OTS tools for data capture and analysis

Genomic sequence and phenotype data

Crop Protection and plant breeding research Scientists and Analysts, and plant breeders

In‐house and OTS tools for data capture and analysis

Regulatory study information for crop protection and seed breeding development

Crop Protection and plant breeding research Scientists and Analysts, and toxocologists

Mainly OTS tools for and analysis

Environmental profile data, such as soil profiles, climate and weather

All researchers in R&D This data is used across the whole data landscape and integrated with our phenotype data.

E) Vocabulary maintenance Vocabularies are maintained in house using an in‐house application. There is a new vocab system based on RDF/OWL due for delivery in 2015. Based on Virtuoso store and top quadrant EVN http://www.topquadrant.com/products/topbraid‐enterprise‐vocabulary‐net/ The new system will provide URIs that can be used for linking between RDF resources and augmented by identity resolution services like Unichem (https://www.ebi.ac.uk/unichem/) and http://identifiers.org/ F) Interoperability and future visions (no limit) Semantics is the only way that the diversity and homogeneity of the agricultural systems can be reconciled to meet the analysis and development needs for duiut5e crop production. A semantic web approach is the only way that data producers and consumers can be broken free from the shackles of vendor lock‐in in order to bring together the vast amounts of diverse data that are needed to meet the analytical challenges of agriculture. There are three primary goals that we need to achieve:

1) Accept that data production for agriculture will never be carried out in a single way. There will always be pockets of activity. Our objective is to make sure that those pockets of data collection and activity can be joined together to form a web of data that can be reused for the benefit of food security.

2) Set a direction for how we are going to use meta‐data and the semantic web to “glue together” disparate activities that are creating and using value data assets. In particular I would like to see GODAN step up to set this direction, through engaging across all of the GODAN partners, to ensure we are able to interoperate data across industry, NGOs, governments and academic research. This strategy needs to describe not just the data, but the concepts that it represents and the provenance and qualities of the data that are of interest to a consumer of the data.

3) Set out a strategy for publishing and sharing data so that ensures that access and use entitlement is either clearly stated or able to be determined.

Page 40: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Agricultural Sustainability Institute (ASI) UC DAVIS

A) About

At the Agricultural Sustainability Institute at University of California, Davis, we educate the next generation of leaders through hands­on experience at our 10 hectare Student Farm. We advance knowledge of water, energy, and soil management for our Century Experiment on the 120 hectare Russell Ranch Sustainable Agriculture Facility. We work with farmers to save them money on nitrogen fertilizer and, at the same time, to reduce pollution. ASI leads statewide efforts to understand how agricultural research, education, and extension can improve well­being of farmworkers and food system workers. Additionally, we work with small­scale and ethnic farmers to expand their markets in California as well as with some of the biggest food companies to ensure sustainable sources of raw materials across the planet.

Sustainable Sourcing of Agricultural Raw Materials The “Sustainable Sourcing of Global Agricultural Raw Materials” project platform was initially funded as a gift to the Agricultural Sustainability Institute at UC Davis (ASI) by Mars, Incorporated. The goal is to promote the long­term sustainability of the global food system by providing publicly­available and scientifically­validated information and tools to the broader food industry, using advances in information technology to measure and improve its sustainability. With this mandate, ASI brought in the Information Center for the Environment (ICE) at UC Davis for its informatics expertise, and the project team did an intensive assessment other issues important to sustainability that are directly related to the sourcing of agricultural raw materials (currently there are 44 issues identified, comprised of over 300 component issues) through use of AGROVOC and indicators that can be used as measurement tools for these issues and how effective they are for different purposes (our database currently includes more than 2000 indicators)

The team built a prototype of a sophisticated semantic web informatics structure to manage the data and developed a process for quickly winnowing the expansive set of indicators to a manageable set (approximately 10­20) that can be used to support a comprehensive understanding of sustainability in multiple dimensions for a given commodity in a given location. With additional support from Kraft Foods, Inc. and other collaborators, including a new partnership with the World Food System Center at ETH Zürich, the project is currently expanding and enhancing the underlying technical infrastructure and developing specific use cases with our partners.

Russell Ranch The Russell Ranch Sustainable Agriculture Facility houses a 100­year study referred to as the Century Experiment, formerly called Long Term Research in Agricultural Sustainability. Initiated in 1993, the Century Experiment, comprised of 72 0.4 ha replicated plots, is a critical member of the world­wide set of long­term agricultural research sites. The original experiment

Page 41: Improving Semantics in Agriculture Workshop Pre‐workshop ...

design of the Century Experiment varied the amount of water, nitrogen and carbon for ten different cropping systems on a gradient from a rainfed, unfertilized wheat/fallow system to an irrigated organic corn/tomato system. The cropping systems in the Century Experiment were designed to compare resource use efficiency, productivity, environmental effect (soil quality, movement of pollutants) and economic return from cropping systems that differ in crop rotation and degree of reliance on rainfall and fertilizer nitrogen. Core variables measured include: crop yields, soil properties, weed populations, weather data and economic indicators. Representatives Ruthie Musker is the program representative for the Sustainable Sourcing project. She helped with the initial use of controlled vocabularies for the platform and continues to update the list of terms used. She is also managing the use of controlled vocabulary of a dataset for the Century Experiment at the Russell Ranch. Thomas Tomich is the founding director of the Agricultural Sustainability Institute and has emphasized the importance of controlled vocabularies throughout ASI’s numerous projects.

B) Datasets maintained by your organization (2­3 paragraphs)

Sustainable Sourcing Project The Agricultural Sustainability Institute has a large database of over 2000 indicators that are linked to controlled vocabulary terms. This information is hosted on a wiki. We are in the process of connecting these indicators to open­access spatial and tabular datasets through our wiki. Our checklist generator, currently under development, is open source, the code repository for it residing at https://github.com/adhollander/checklist.

http://asi.ice.ucdavis.edu/sustsource/wiki/index.php/Main_Page

Our sustainable sourcing wiki. If you click on “2000+ indicators” you can see all indicators and the issues we have attached to each. All “issues” are controlled vocabulary terms.

Our ontology and data instances are available as OWL and RDF files at http://asi.ice.ucdavis.edu/sustsource/schemas/sustsource.owl and http://asi.ice.ucdavis.edu/sustsource/schemas/sustsourceindiv.rdf.

Our intent is to make our datasets available as linked open data but we have not yet adopted a particular license for the data and also need to improve the quality of the formal representation of the data to be strictly formal linked open data.

Russell Ranch

Data for the Russell Ranch project includes the long­term core data from the Century Experiment from 1993­2014. The core data includes: crop yield, biomass and moisture data; cover crop biomass and moisture; crop elemental content; Normalized Difference Vegetation Index (NDVI) data, soil properties (bulk density, moisture, carbon, nitrogen, phosphorus, sulfur, potassium), weather, water elemental content, winter weed populations, and operational data including fertilizer and pesticide application amounts and dates, planting dates, planting quantity and crop

Page 42: Improving Semantics in Agriculture Workshop Pre‐workshop ...

variety, and harvest dates. In addition to the core data described here, a physical sample archive is maintained that contains the crop and biomass samples collected annually and soil samples collected every ten years (1993, 2003, 2012).

Data for the Russell Ranch study is hosted in an Access database (.accdb) and data files take the form of .csv files. Currently, the study is in preparation for publication in Ecological Archives but will become open­access once published.

C) Vocabularies maintained by your organization (2­3 paragraphs)

ASI has used AGROVOC, the Library of Congress Subject Headings, and the MARC 21 Code List for Geographic Areas as controlled vocabularies.

D) Uses of your datasets and vocabularies (1­2 paragraphs)

To date, we have used our controlled vocabularies internally primarily to help analyze our database linking environmental issues with indicators. Software we've used for this analysis has included Marxan (a heuristic optimization program common in conservation planning), Matlab, and R, all for the undertaking the integer linear programming calculations at the heart of our analyses. We are providing these datasets as RDF and OWL files at the links given above but we are unaware of consumers of these vocabularies other than ourselves.

E) Vocabulary maintenance (1­2 paragraphs)

To develop and maintain these vocabularies we have used a combination of Google Spreadsheets, a Django­fronted web database, Protege, and a wiki developed using the Semantic MediaWiki platform.

F) Interoperability and future visions

Problems: Currently, ASI is building a partnership with Esri, an online GIS supplier, to link our indicators with their maps. Esri manages a lot of datamaps, but does not use a controlled vocabulary. We hesitate to justify the links between some of our indicators and datasets because our two groups may be interpreting the information in different ways. If Esri used the controlled vocabulary, the link between our two datasets would be incredibly useful. In this case, the labor to convert Esri’s names of datasets and maps to a controlled vocabulary is the bottleneck.

Uses: It is perhaps not datasets that are needed but the usability of datasets. Is it up to the user to be conscious of choosing a platform that uses a controlled vocabulary? Or is it up to the platform­creators to manage and agree on the usage of controlled vocabularies? We envision continuously updating our database with indicators and linking these indicators to controlled vocabularies. As new issues emerge, we will continue to add them and link them to others that already exist. Hopefully, we could also link to other users who are using the same terms.

We hope to keep a strong relationship with AGROVOC to help update the thesaurus as we find these new issues. Another use of controlled vocabularies would be through the application to

Page 43: Improving Semantics in Agriculture Workshop Pre‐workshop ...

certification standards. This would be very useful for consumers and large food companies to maintain understanding of key terms.

Priorities: A main priority is awareness of controlled vocabulary. Many who work in food systems do not even know that controlled vocabularies exist! Ease of convertibility from one vocabulary to another would also be very useful (i.e. “people often search for (x), the term that AGROVOC uses instead is “y”). Publicizing success stories of controlled vocabulary usage will help motivate groups without a controlled vocabulary to integrate it into their methodology.

Page 44: Improving Semantics in Agriculture Workshop Pre‐workshop ...

USDA Agricultural Research Service (ARS) A) About The Agricultural Research Service (ARS) is the U.S. Department of Agriculture's chief scientific in‐house research agency. Our job is finding solutions to agricultural problems that affect Americans every day from field to table. Here are a few numbers to illustrate the scope of our organization:

750 research projects within 17 National Programs 2000 scientists and post docs 6,000 other employees 90+ research locations, including overseas laboratories $1.1 billion fiscal year budget

My own research deals with using simulation modeling and related tools to understand potential impacts of climate change and adaptation to climate. Access to high quality datatsets is key to such work, so I have long supported efforts to coordinate how field trial data are organized. B) Datasets maintained by your organization (2‐3 paragraphs) USDA ARS publishes a wide range of data through on‐line resources. My efforts have mainly been through three initiatives that provide data access: GRACEnet ‐‐ http://www.ars.usda.gov/research/programs/programs.htm?np_code=212&docid=21223 AgMIP – agmip.org DSSAT – http://dssat.net/data/exchange I don’t understand the reference to language – human or machine. Our interfaces and terminology is in English. We use multiple digital formats. None of these are open data, and given the complexity of the datasets, we see open data as a low priority compared to improving data organization and data acquisition. Data from ARS is US Federal data and has no protections except for proprietary rights for a reasonable period prior to publication and for where security or specifically recognized commercial interests are a concern. C) Vocabularies maintained by your organization AgMIP and DSSAT are working from the ICASA dictionary, which is maintained as a Google spreadsheet at https://docs.google.com/spreadsheets/d/1MYx1ukUsCAM1pcixbVQSu49NU‐LfXg‐Dtt‐ncLBzGAM/pub?output=html. We have done some preliminary work to map variables to CropOntology.org. D) Uses of your datasets and vocabularies Our datasets are primarily used in crop simulation modeling and general meta‐analyses. AgMIP is an especially eager consumer of data for climate change scenario studies. No on smartphones… E) Vocabulary maintenance We currently use Google spreadsheets for routine maintenance of the ICASA dictionary. For major checking and revisions, I have used SAS. We would be very interested in a shared resource to host and maintain the ICASA dictionary.

Page 45: Improving Semantics in Agriculture Workshop Pre‐workshop ...

F) Interoperability and future visions I would start with harmonization of vocabularies and as this advances, improve tools to help people record and manage data. From there, it will be much easier to move data into queryable/discoverable data resources. I see ontologizing vocabularies and seeking linked open data as low priorities given the pressing needs for quality data from field trials.

Page 46: Improving Semantics in Agriculture Workshop Pre‐workshop ...

USDA National Agricultural Library (NAL) Institution: Agricultural Research Service, United States Department of Agriculture A) About ABOUT ARS and NAL: The Agricultural Research Service (ARS) is the United States Department of Agriculture’s (USDA) principal in‐house research agency. ARS works to ensure that American have reliable, adequate supplies of high‐quality food and other agricultural products. ARS accomplishes its goals through scientific discoveries that help solve problems in crop and livestock production and protection, human nutrition and the interaction of agriculture and the environment. Within ARS, the National Agricultural Library (NAL) is one of the four national libraries of the United States and houses one of the world’s largest collections devoted to agriculture and its related sciences. The mission of NAL is to collect, organize, preserve and provide access to global agricultural information. PARTICIPANTS: Dr. Simon Y. Liu is Associate Administrator for Operations and Management at ARS, USDA, and provides leadership, manages, and is responsible for all aspects of ARS’s research operations. Simon recently served as Director of the National Agricultural Library, and was a principal collaborator in the Global Agricultural Concept Scheme. Previously, he was responsible for the Unified Medical Language System, which facilitates the development of computer systems that behave as if the “understand” the language of biomedicine and health. Lori Finch is Chief, Indexing and Informatics Branch at NAL, and is responsible for the automated indexing of agricultural literature and the NAL Agricultural Thesaurus. Lori serves on the Working Group for GACS. B) Datasets maintained by your organization AGRICOLA and PubAg URL: http://agricola.nal.usda.gov and http://pubag.nal.usda.gov Description: AGRICOLA (AGRICultural OnLine Access) is a bibliographic database that serves as the catalog and index to the collections of the National Agricultural Library, as well as a primary public source for world‐wide access to agricultural information. The database covers materials in all formats and periods, including printed works from as far back as the 15th century. PubAg is a portal to USDA‐authored and other highly relevant agricultural research. PubAg contains full‐text articles relevant to the agricultural sciences, along with citations to peer‐reviewed journal articles. NAL Thesaurus is the controlled vocabulary. Languages: primarily English Audiences: All 6 user group listed in “D” Open data: yes, CC0 Applications: Fedora repository Life Cycle Assessment Commons (LCA Commons) URL: http://www.lcacommons.gov Description: The LCA Commons provides open access to life cycle assessment (LCA) data sets and tools. The project makes North American agricultural data more accessible to the community of researchers, policy‐makers, industry process engineers, and LCA practitioners. Language: English Audiences: Researchers, Agribusiness Executive Open data: yes Applications: uses Open LCA Framework and in‐house app for web delivery Insect 5,000 Genome “i5K”

Page 47: Improving Semantics in Agriculture Workshop Pre‐workshop ...

URL: https://i5k.nal.usda.gov/ Description: The Insect 5,000 genome (i5K) project is an international effort to sequence the genomes of insects of importance to agriculture and human health. NAL’s i5K project mission is to support applied genomics by providing tools and resources to help scientists identify protein coding genes and other features in otherwise raw genome assemblies. Language: English Audiences: Researchers Open data: yes Applications: i5K workspace is built on Drupal, Tripal, and Chado and feeds data to Web Apollo for community gene annotation. Long‐Term Agro‐ecosystem Research (LTAR) URL: not available until October 1, 2015 Description: The vision for ARS’ LTAR network is: “Transdisciplinary science conducted over decades on the land in different regions, geographically scalable, enhancing the sustainability of agro‐ecosystem goods and services.” Eighteen sites from across the U.S. participate in the network. Data is at the heart of the LTAR initiative. The National Agricultural Library (NAL) is a data management partner in the LTAR network. In the first phase of the project, NAL will build a database for near‐real time meteorological data; this data is geospatial and will be displayed appropriately. As the overall initiative progresses, NAL will also manage data from the common agricultural management experiments and other data as determined by the network. Public access to near real‐time meteorological data is expected by October 1, 2015. Language: English Audiences: most likely Researchers Open data: yes Vocabulary: Uses Global Change Master Directory Ag Data Commons URL: http://data.nal.usda.gov Description: The Ag Data Commons is a hybrid repository/catalog that leverages and links existing repositories while providing a platform for datasets that require specialized metadata/search, visualizations or analytics, or don’t otherwise have a home that can ensure long‐term access, stewardship and preservation. A prototype is now available with 46 datasets open to public access. Additional datasets are in the pipeline. System functionality currently deployed and/or planned includes: ingesting, organizing, managing, disseminating, and enabling compliance enforcement for open government‐funded agricultural research data. Language: English Audiences: Researchers Open data: yes Applications: is built on DKAN and is customized for scientific data. National Nutrient Database for Standard Reference URL: http://ndb.nal.usda.gov Description: Standard Reference for food composition. The database consists of several sets of data: food descriptions, nutrients, weights and measures, footnotes, and sources of data. Language: English Audiences: Researchers, Citizens, Agribusiness Open data: yes Applications: Dietary Reference Intake calculator App for Apple and Android devices C) Vocabularies maintained by your organization NAL Agricultural Thesaurus – URL: http://agclass.nal.usda.gov. NALT is published in English/Spanish parallel versions in cooperation with the Inter‐American Institute for Cooperation on Agriculture (IICA). The thesaurus has been annually updated since 2002 and is available for public searching. Downloadable in multiple formats (SKOS, XML, PDF) at the thesaurus website. The 2015 edition contains over 98,000 terms. NALT is part of the Global Agricultural Concept Scheme Demo, which links NALT, CAB Thesaurus and AGROVOC terminologies. Previous mappings with AGROVOC and GEMET were done in 2006 and 2007. NALT is under a Creative Commons license CC0 Public Domain Dedication.

Page 48: Improving Semantics in Agriculture Workshop Pre‐workshop ...

D) Uses of your datasets and vocabularies Response to this question is included in response to “B” for each dataset. Generally, 6 user groups (personas) are used for NAL products: Researcher, Policy Maker, Small Producer, Information Professional, Agribusiness Executive and Citizen. E) Vocabulary maintenance NAL uses MultiTes for thesaurus maintenance and Web Development Kit for web delivery. In‐house application enables the delivery of RDF for each skos:Concept URI. F) Interoperability and future visions To meet the desired ends of semantic interoperability (exchange of data with common meaning between sender and receiver) among disparate datasets or information systems, we need:

1) An indexing vocabulary with richer semantics such as what was done for the Unified Medical Language System for the medical community,

2) More research on semantic search that will take full advantage of these semantic vocabularies, and 3) Tools to implement and integrate semantic vocabularies in search, to “ease” search and query for

seekers of agricultural information and data. Problems that need our attention:

1) The search engine of most systems do not support natural language processing of queries, 2) The language of the search (e.g. English) may not match the language of a relevant resource (e.g.

Chinese), 3) Resources found may need further translation into a language that is understood by the searcher, 4) The search engine of most systems do not “understand” the meaning of the search, that is, it is

word‐based not concept‐based and so does not give targeted results, 5) Synonyms and spelling variants (such as British English vs. American English spelling) are not included

in search automatically and so relevant information resources are missed, 6) Search engines do not exploit hierarchical structures to enable the searcher to broaden or narrow

search queries, 7) Homographs are not disambiguated in search and so does not give targeted results, 8) Searcher may only know the common term “e.g., cancer or E. coli” and may not be familiar with the

technical jargon “e.g., neoplasms or Escherichia coli” and this may or may not match what is in the indexing language,

9) Searcher can only query one information system at a time and needs to customize their query to each separate information system,

10) Indexes can be incomplete or lacking detail due to lack of specificity in the indexing language, 11) Data sets may not have adequate description, 12) Searchers are frustrated with finding the “right” dataset among many, 13) Fear of data misuse is voiced among researchers who share data, 14) Our data folks note that most searchers they observed used “author names” to find the data they

wished since they already “knew” who was doing the work. How can we make datasets more transparent to their use and context?

To solve these problems, these tools or actions need to evolve: 1) Natural language search engines that “understand” meaning of the search,

Page 49: Improving Semantics in Agriculture Workshop Pre‐workshop ...

2) Automatic language translation within information systems for queries and for resources, 3) Indexing vocabulary that has richer relationships, 4) Search engines that take advantage of semantic relationships, 5) Shared vocabulary across disparate information systems or datasets. 6) More research on semantic search and semantic tools.

As a provider of a controlled vocabulary, we desire to create a common vocabulary for agriculture as we see this is an advantage to our organization and also to patrons of our information systems:

1) To gain linkages in other languages beyond English and Spanish. 2) To add relationships that will enrich the semantics of the vocabulary. 3) To gain the advantage of distributed work load among partners. 4) To find methods with partners to keep taxonomic data up‐to‐date and reduce the amount of

resources doing this task. 5) To use this information to improve our automated indexing tools for entity and relationship

extraction. 6) To find an infrastructure, governance and quality control policy that will enable the vocabulary to be

extended 24/7 by many people with different language expertise from all over the globe.

Page 50: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Wageningen UR Library (Wageningen UR Lib) A) About Wageningen UR Library does the usual things that university libraries do. We also maintain the research information system for Wageningen UR (http://library.wur.nl/WebQuery/wurpubs/show) and on the cbasis of that we do bibliometric analyses. We maintain (as part of the output registration) a database of published datasets http://library.wur.nl/WebQuery/wurpubs?A240=OK&wq_inf_pre=pluspre&wq_rel=AND&wurpublikatie/publikatietype/hoofdtype==Dataset and we maintain databases for Dutch vocational training http://www.groenkennisnet.nl/nl/groenkennisnet.htm . We have an emphasis on research data management planning (see http://www.Wageningenur.nl/en/Expertise‐Services/Data‐Management‐Support‐Hub.htm and give a course on the subject http://datamanagementplancourse.pbworks.com/ primarily for PhD researchers Hugo Besemer’s group is primarily involved with bibliometric assessments and research data management. B) Datasets maintained See above. C) Vocabularies maintained We used to maintain the Dutch version of CAB thesaurus but I do not expect that we will be able to continue. (see last set of questions) . We maintain a subject category tree that may be useful elsewhere; it is in the left side of our resource browser http://library.wur.nl/WebQuery/catbrowser/journal?recordtype=seriewerk%20OR%20monografie%20OR%20deel%20OR%20koepel (“ Filter by subject”) D) Uses of datasets and vocabularies Archived datasets are meant to be consumed by research peers for verification and re‐use E) Vocabulary maintenance We use a locally developed thesaurus maintenance tool that we will probably discontinue when we will move to Worldcat. F) Interoperability and future visions (no limit) Libraries are shifting their bias from bringing information from outside in, to briging out organisation’s output in external systems. Resource discovery by our internal users is in the cloud in systems that we do not control. Therefore we can no longer afford to maintain vocabularies for resource discovery. What do you see as the most pressing needs in agriculture for the coming decade? What sort of datasets are needed, and what sorts of vocabularies are needed to support access to and use of those datasets? Do particular areas need to be strengthened, such as integration of semantics in geographic information systems? Hard to speak on behalf of agriculture as a whole What priorities should the organizations represented at the workshop set for future actions? Are you aware of, or involved in, other relevant projects or initiatives in related areas? In what direction should we try to head over the coming decade?

Page 51: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Re: new projects that we should try to get on board: there are text / data mining projects like https://www.esciencecenter.nl/project/prediction‐of‐candidate‐genes‐for‐traits‐using‐interoperable‐genome‐annotat and https://www.esciencecenter.nl/project/creation‐of‐food‐specific‐ontologies‐for‐food‐focused‐text‐mining I think that for the coming decade linking different models (climate models, crop growth models etc) will be an important challenge. Also genomic / phenotypes / germplasm information may be a challenge. In general I think that the resource discovery world (that is maintaining thesauri like AGROVOC, CAB thesaurus) and specific communities that do their standardization , like ICASA did for agronomic trials http://research.agmip.org/display/dev/ICASA+Master+Variable+List

Page 52: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Wageningen UR Alterra (Alterra) A) About your organization (1‐2 paragraphs) I am representing Alterra, part of Wageningen UR and the AgMIP network. For Alterra, this is research institute in the Netherlands, that is developing into a knowledge brokerage organisation. Alterra offers a combination of practical, innovative and interdisciplinary scientific research across many disciplines related to the green world around us and the sustainable use of our living environment. Aspects of our environment that Alterra focuses on include soil, water, the atmosphere, the landscape and biodiversity ‒ on a global scale as well as regionally, from the Dutch polders to the Himalaya’s and from Amsterdam to the Arctic. AgMIP is the Agriculture Model Intercomparison and Improvement Program, which was formed in 2010 as a global network of agricultural modellers to provide better evidence of the impacts of climate change and food security. By now 700 researchers from across the globe have signed on working on different aspects of agricultural systems modelling, which includes elements of regional and global integrated assessment of climate change impacts, modelling of sustainable farming systems and designing the next generation of agricultural models and data products. B) Datasets maintained by your organization (2‐3 paragraphs) Alterra manages different data sets and websites targed at data‐visualization:

Land use Netherlands, versions 1 – 7 Object Height Netherlands, height of objects (buildings, trees, etc) in the land scape www.groenmonitor.nl: processed satellite images of the Netherlands, with some key indicators. Bodem Informatie Nederland (Soil Information for the Netherlands), together with TNO www.yieldgap.org, together with partners on yield gaps across Europe Spatially explicitly geo‐enabled clearinghouse for forestry data, with partners, from Trees4Future

project, with some linked data components Historic Weather Viewer, together with MeteoGroup

AgMIP has different data sets of relevance, and is building up its infrastructure to disclose these. Most advanced is the infrastructure for inputs and outputs of crop models in the ACE database, with links to AgTrials (CGIAR) and USDA, see data.agmip.org Alterra, and Wageningen UR, are involved in the establishment of the Open Data Journal for Agricultural Research as a new and peer reviewed publication avenue for data sets from agricultural research. C) Vocabularies maintained by your organization (2‐3 paragraphs) Alterra: From history, we (Alterra) have the Seamless ontologies, made together with Ioannis Athanasiadis, and these are only available. From the Trees4Future project, there is a vocabulary structure developed, that is linked to some ontologies, and on the meta‐data level. It is described in an MTSR paper

Page 53: Improving Semantics in Agriculture Workshop Pre‐workshop ...

We are managing as a National centre some of the data structures on nature and biodiversity, although they have not been fully developed as vocabularies. AgMIP: There is the ICASA variable list, which is a data dictionary with relevant descriptors of variables, mostly for crop modelling. D) Uses of your datasets and vocabularies Alterra: We used most of the vocabularies to link data sets to models, or make data sets better discoverable online. The users are mostly researchers with a domain background and some developers, who have an interest to develop these services. We used some vocabularies to discover more data and information on a meta‐data level for websites for the general public or practitioners in a relevant domain. For the datasets, we made many different applications for all sort of audiences, with smart phones apps as part of the products. We worked with SME’s delivering services, government departments (of national government, EC), large private companies, development projects to share data. These platforms are either for data visualization, data analysis, or collecting more data in crowdsourcing applications. There are too many to list. Note that the semantically enabled applications are in the minority, and we are still discovering as part of research projects the added value of such applications. AgMIP: data set and vocabularies are being used to connect models to data. E) Vocabulary maintenance Alterra: we used protege, but also other tools for editing. We don’t have very formal processes of updating the vocabularies F) Interoperability and future visions My vision: Semantics could be the glue for improved discovery of data for use in applications. Ideally it should provide the means for any user to get data without understanding all the complexities. Ideally we would have a semantically enabled google, where you can just write queries like: ‘provide me with the data of the warmest month in South of the Netherlands, for those regions that have the majority of land in potatoes’. The data analytical possibilities would become infinite in combining all sorts of data, thereby offering much better possibilities to analyze the relationships in agriculture (cause‐effect), and develop new services to farmers, processors, food chains and policy makers. To some extent, the question of agro‐semantics is thus tied intimately to create the big‐data reservoirs for the future. However, we are yet very far from this situation, and I think the next decade should give us some of the building blocks to get there. At the moment, as mentioned above, we use very few semantic tools in our applications, so the usefulness is not that high. Obstacles as we experience them, mostly from being a consumer of data in applications and combining it with different data sets is:

1. Vocabularies are around, but only cover (small) parts of the spectrum and are quite intimately linked to their original purpose. For example, Agrovoc is mainly focused on describing meta‐data for literature, while the AgMIP ICASA variable list is mainly targeting crop models. We tested their broader and more integrated application, and this proved very difficult ‐‐> Proposed step: we need some bridging way to link vocabularies one to another, creating the whole. This should not happen

Page 54: Improving Semantics in Agriculture Workshop Pre‐workshop ...

through designing the super‐vocabulary, but by aliasing the different concepts across the vocabularies and discover where the missing elements (white spots)

2. Vocabularies are not yet focused on raw quantitative data, but mostly on meta‐data level. In our experience the strongest use of data is in applications to analyse them as part of monitorings‐ or decision making processes, however, this part is not covered by the vocabularies, which could facilitate it. ‐‐> proposed step: we need ways to better describe and relationally describe the variables in (commonly used) data sets for some high profile real world problems, and design tutorials for other to use these standardized variable descriptors.

3. Vocabularies are not frequently used in applications, and are mostly in the research sphere, except for bibliographic information. They need to have a stronger link to some real world problem, and demonstrate their added value, in developing things more efficiently and increase the common goods ‐‐> proposed step: develop some vocabulary enabled applications for real world problems, we think of agricultural monitoring in relation to GeoGLAM, as this is strong initiative already underway, where the semantics could help to link agricultural monitoring information to relevant knowledge on problems (drought, excess rain, shortfalls in production), or plant/animal diseases and rapidly build the information base in case of incidents for governments and the supply chain.

Ultimately, in our view this should lead to sort of a semantically‐enabled data cube of frequently used data sources, that is available as a public good, that others can use as a basis for their analysis and development of applications, and are incentivized to contribute their data to further extend the data cube. There are many smaller problems under way, as poor semantics of many data sets, lack of meta‐data elements, and a lacking perspective on what descriptors enables users to make a good assessment of the usefulness of data for application purposes. Furthermore, a lot of the applications stay in one discipline, so discovery across the disciplinary boundaries would be a huge step forward.

Page 55: Improving Semantics in Agriculture Workshop Pre‐workshop ...

U Aston Christopher Brewster A) About your organization (1­2 paragraphs) Operations and Information Management group, Aston Business School, Aston University, Birmingham, UK Aston University (AST) is a long established research­led University known for its world­class teaching quality and strong links to industry, government and commerce. Aston Business School (ABS) is a triple­accredited business school (AMBA, EQUIS and AACSB), and has been ranked in the top 10 in the UK in the Eduniversal rankings for the 3rd year running in 2013. ABS works in collaboration with industry, governments and the academic community to produce new research initiatives, with a particular focus on applied research that contributes to sustainable economic growth and development. The Business School has over 140 faculty members teaching and researching across all areas of business. The Operations and Information Management group within the Business School has a world­class reputation in systems modelling and simulation, supply chain management, knowledge management, technology and operations management. The group is organised into four research teams concerning Business Analytics, Operations Management, Global ICT Management, and Systems Modelling and Simulation. It is a highly inter­ disciplinary group and has a growing body of research concerning agrifood, sustainability and the role of ICT in supply chains and operations. B) Datasets maintained by your organization (2­3 paragraphs) N/A C) Vocabularies maintained by your organization (2­3 paragraphs) As a result of the FIspace project, we have formalised the GS1 EPCIS standard in the form of two ontologies: The EPCIS Event Model ­ http://fispace.aston.ac.uk/ontologies/eem.html# The Core Business Vocabulary ­ http://fispace.aston.ac.uk/ontologies/cbv.html# Furthermore there is a mapping between the the W3C PROV namespace (http://www.w3.org/ns/prov#) and the EPCIS Event Model available here: http://fispace.aston.ac.uk/ontologies/eem_prov.html# D) Uses of your datasets and vocabularies (1­2 paragraphs) There ontologies mentioned above are intended to be used for the agri­food supply chain. There is a functioning Java Library for the generation of RDF triples from RFID/Barcode events. Currently there has been no commercial uptake.

Page 56: Improving Semantics in Agriculture Workshop Pre‐workshop ...

E) Vocabulary maintenance (1­2 paragraphs) N/A F) Interoperability and future visions (no limit) The problems we are interested in are how to achieve greater interoperability in the food supply chain. We believe that greater use of semantic technologies and standardised vocabularies are essential for better integrated agri­food supply chains. There are such a great many different types of actors in the agri­food supply chain which makes it all the more complex to comprehend. Not just different types of farmers (large, small, etc.) but also a multiplicity of associated actors all along the supply chain down to different types of retailers and consumers. One of the key consequences of this number and heterogeneity of actors is the very poor information flow which exists along the supply chain. This is compounded by a very conservative “need­to­know” attitude such that essentially information flows only “one­up, one down”. Thus for example the farmer might communicate with the wholesaler or food processor but not directly with the retailer. The retailer communicates with the consumer and wholesaler but (typically) few other actors. This lack of information flow has been ``solved'' so far by a combination of government or EC level regulation (food standards, health and safety) and third party certification (organic food certification bodies, GlobalGAP, etc.). Although there is a very large number of such bodies and regulations, the overall result has been a series of either/or categories i.e. either food is safe or not, either it is organic or not, either it is fairtrade or not, with a corresponding dearth of numerical values. No information is available as to how much water was used to produce a pint of beer, or even in the ingredients on packaged goods they are listed in order of quantity but without details of how much. The lack of information has been recognised as a critical issue for a long time in the agri­food sector expressed partly in the need for greater \emphtransparency, but also in the importance given to tracking and tracing of foods in the context of health and safety and in order to both prevent and respond to food emergencies (mad cows disease, and most recently E. Coli). Another major factor is the growing desire on the part of food consumers to know more about their food, a desire for greater food awareness. However ``the complexities in reaching transparency are due to complexities in products and processes but also due to the dynamically changing open net­ work organization of the food sector with its multitude of SMEs, its cultural diversity, its differences in expectations, its differences in the ability to serve transparency needs, and its lack of consistent appropriate institutional infrastructure that could support coordinated initiatives towards higher levels of transparency throughout the food value chain'' (SmartAgriFood internal document 2011).

Page 57: Improving Semantics in Agriculture Workshop Pre‐workshop ...

The natural expectation would be that IT offerings in the agri­food area would attempt to solve these problems. For a variety of reasons the \emphoverall sector has remained very conservative with respect to technology. While supermarket chains have very sophisticated IT systems these do not connect up very well with logistics let alone with farm producers. Equally while much farm machinery includes computers in one form or another, machinery, hardware and software have tended to be sold together as a unit and thus offer little interoperability. Furthermore given the complexity and variety of actors, any solutions would have to be extremely flexible and scalable to achieve a significant degree of penetration. As Martini et al. note ``in practice the principle architecture of information systems in agriculture has roughly stayed the same over the years. ... With rigid process models, it is often impossible to adapt flexibly and quickly enough to new situations.'' (Martini et al., 2010b). The end result of this is that attempts by companies to provide overarching IT based solutions have failed so far. There are a considerable number of tracking and tracing IT solutions available on the market (for a survey cf. Martini et al. (2010a)) but these have been successful only in specific sub­areas or geographical regions. Furthermore, one the one hand, there appears to be a great lack of adherence to international standards such as EPC/GS1, and one the other there is total absence of ``semantic harmonisation'' i.e. adherence to some international standards in description of the data (e.g. one or more ontologies). In the agri­food sector, information and knowledge in any form is a ``high cost'' item, i.e. costly to gather by any actor, costly to communicate, costly to certify and thereby provide assurance to subsequent actors on the supply chain, and costly to deliver to the information user. In summary, we can describe the agri­food system as a highly heterogenous loosely coupled large­scale network of entities with variable but largely minimal degrees of communication and trust between the actors. It is in this context that we believe the use of semantic technologies, especially the Linked Data paradigm, as a flexible, distributed and open system that can provide the technological underpinnings for a gradual increase in communication, and corresponding flow of information and knowledge through the system. Semantic technologies integrate well with the growing use of sensors (for example in precision farming and agri­logistics) and the growing availability of data from third parties (for example concerning environmental impact). The point of using semantic technologies is to reduce the effort and cost of making information available across the supply chain. cf. Monika Solanki and Christopher Brewster. Enhancing visibility in EPCIS governing Agri­food Supply Chains via Linked Pedigrees, International Journal on Semantic Web and Information Systems Volume 10, Issue 3, 2015 http://windermere.aston.ac.uk/~solankm2/papers/Enhancing­Visibility­in­EPCIS­Governing­Agri­Food­Supply­Chains­via­Linked­Pedigrees.pdf

Page 58: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Cornell

John Fereira

A) About your organization (1‐2 paragraphs)

I work at Cornell University Albert R. Mann Library in the Information Technology Services department. I am a programmer/analyst/technology strategist with a focus on Agriculture Information Systems and Semantic Web technologies.

B) Datasets maintained by your organization (2‐3 paragraphs)

Our department manages a dataset called CUGIR. CUGIR is an active online repository in the National Spatial Data Clearinghouse program. CUGIR provides geospatial data and metadata for New York State, with special emphasis on those natural features relevant to agriculture, ecology, natural resources, and human‐environment interactions. It is only available in English and is not published as Linked Data.

C) Vocabularies maintained by your organization (2‐3 paragraphs)

We do not maintain any agriculture specific vocabularies. However, the VIVO semantic web application includes the vivo ontology which is primarily used for representing person, department, and publication information across multiple domains. The application also makes use of many other ontologies such as foaf, skos, bibo, the UN geopolitical ontology, and many more.

D) Uses of your datasets and vocabularies (1‐2 paragraphs)

My perspective for the workshop is more as a dataset consumer than a dataset provider. I developed a portion of the External Concept Service in the VIVO application. This provides an integration layer for external datasets such as Agrovoc, Mesh, and Gemet and a lookup and assignment of concepts from these vocabularies for indicating research areas for a users profile.

I also created a wrapper service that uses the same underlying code used by the AgroTagger for autotagging approximately 480 thousand research articles in the TEEAL project and have just started using it for mapping freetext expertise and area of interest terms to Agrovoc concepts.

E) Vocabulary maintenance (1‐2 paragraphs)

Page 59: Improving Semantics in Agriculture Workshop Pre‐workshop ...

Although we did not develop the vocabulary, I am using VIVO to maintain and edit a version of the ONLD (Organization Name Linked Data) dataset for disambiguating organization names provided from multiple data sources providing user profile data.

F) Interoperability and future visions (no limit)

While Agrovoc and GACS can provide a robust vocabulary of agriculture related concepts, my interest is in being able to use those concepts to link People with Organizations, Publications, and Events.

One the most common issues that I have encountered is an accurate mechanism for disambiguating people, as authors of publications. One of the challenges is that most of the user profile data provided is freetext in nature when captured at the source rather than selected from a controlled vocabulary. For example, when users are asked to specify the name of the affiliated organization name, many variations on the name of the organization might be provided. Disambiguating “FAO” and “Food and Agriculture of the United Nations” as the same organization can be a challenge.

What do you see as the most pressing needs in agriculture for the coming decade? What sort of datasets are needed, and what sorts of vocabularies are needed to support access to and use of those datasets? Do particular areas need to be strengthened, such as integration of semantics in geographic information systems?

I would like to see better data for geospatial information so that relationship between people, the areas of research, and events can be view spatially.

Although Agrovocs and GACS provides a good vocabulary for topical areas I would like to see other data such as Agriculture related “job titles” as a controlled vocabulary.

What priorities should the organizations represented at the workshop set for future actions? Are you aware of, or involved in, other relevant projects or initiatives in related areas? In what direction should we try to head over the coming decade?

We need better tools for consuming data. Web services and APIs that can be easily used in a variety of platforms are needed. Services which can provide data which interoperates with large amounts data in an automated fashion as well as services which can provide interactive data quickly are needed.