Top Banner
Lexicall.org: A web-site for complete and convenient access to lexical resources Abstract This paper introduces Lexicall.org, a website which aims to become a trusted repository of shared lexical resources, committed to the long- term accessibility of the archived resources and their appropriate indexing for research and educational use in the psycholinguistics community. This repository provides access to four types of materials: (a) data files providing lexical statistics, (b) scripts and tools for the manipulation of lexical data, (c) documentation about these resources and their use, and (d) links to materials relevant to research and teaching activities in psycholinguistics. A large range of linguistic materials is covered, including parts of words, words and non-words, textual and visual material, and experimental datasets from psycholinguistic studies. An originality of our approach is that on top of the usual mechanisms for the listing and download of lexical resources, we also provide mechanisms for the direct querying (for data files) or running (for Unix-compatible scripts) of files held in the archived resource on the lexicall.org website. To ensure complete interoperability and re-use in different contexts and to anticipate the possibility of having this archive become a distributed repository within a peer-to-peer sharing system, metadata are used to hold information about the lexical resources and parameters for the mechanisms in place. Introduction Psycholinguistics studies typically present one item at a time (word segment, word, non- word, image, paragraph) on a
20

Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

Sep 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

Lexicall.org: A web-site for complete and convenient access to lexical resources

AbstractThis paper introduces Lexicall.org, a website which aims to become a trusted repository of shared lexical resources, committed to the long-term accessibility of the archived resources and their appropriate indexing for research and educational use in the psycholinguistics community.

This repository provides access to four types of materials: (a) data files providing lexical statistics, (b) scripts and tools for the manipulation of lexical data, (c) documentation about these resources and their use, and (d) links to materials relevant to research and teaching activities in psycholinguistics. A large range of linguistic materials is covered, including parts of words, words and non-words, textual and visual material, and experimental datasets from psycholinguistic studies.

An originality of our approach is that on top of the usual mechanisms for the listing and download of lexical resources, we also provide mechanisms for the direct querying (for data files) or running (for Unix-compatible scripts) of files held in the archived resource on the lexicall.org website.

To ensure complete interoperability and re-use in different contexts and to anticipate the possibility of having this archive become a distributed repository within a peer-to-peer sharing system, metadata are used to hold information about the lexical resources and parameters for the mechanisms in place.

IntroductionPsycholinguistics studies typically present one item at a time (word segment, word, non-word, image, paragraph) on a computer screen and collects the response time and error rate from a group of University students participants when given the task to name or categorize the item. The control of variables is an essential aspect of such experiments. To test a prediction made as to the impact of a one or two variables on the reading performance, a researcher attempts to select items (often words) that present a clear contrast on the chosen variables (words of either low or high frequency, which are either regular and irregular) while being as similar as possible with respect to other word characteristics that are known to influence the reading performance (the size of the orthographic neighbourhood, that is

the number of words that exist that share all letters but one; the average bigram frequency; the number of syllables; etc.). It is only the belief that the items in the different experimental conditions (HF-regular, HF-irregular, LF-regular, LF-irregular) are equivalent that allows the researcher to reach firm conclusion as to the correctness of their prediction. If the experimental manipulation entails systematic variations in orthographic properties, then it cannot be unambiguously demonstrated that it is differences in the levels of the variable rather than differences in poorly controlled orthographic properties that influenced the participants’ performance.

The importance of lexical statistics for psycholinguistics research has certainly long be recognized. Numerous resources exist, contributed by a number of teams (CELEX

Page 2: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

lemma and lemme database of Baayen et al., MRC word database of Coltheart, 1981, ARC non-word of Rastle et al.; Brulex of Content et al., 1991; Lexique of New et al., 2001) and new resources are regularly created.

These resources are usually advertised in journals and the raw data made publicly available on the website of the resource manager. For instance, the publishers of the BRMIC journal had the initiative to set-up an on-line archive of Norms, Stimuli, and Data, to group the resources presented in Publications of the Psychonomic Society (Behaviour Research Methods, Memory & Cognition, Perception & Psychophysics). Balota and colleagues. Rastle. Boris New.

However, in contrast with the Natural Language Processing (NLP) community no centralized resource distribution systems has been set up yet. Instead, we only have independent disconnected archives maintained by individual researchers and this presents the following disadvantages: (1) Although journals and other forms of scholarly publication enable information about these resources to circulate, at a certain level, page limit constraint force a separation of the information about resource information and resource content. (2) There is no place to host resources created by a lab which does not have the resources to create a dedicated web-site. (2) The creation of distributed resources encourage the set up of different solutions to slightly different problems. (4) With resources are stored at the many individual research institutes, there is the risk for resources to disappear when the creator of this resource leaves the academic community.

The aim of the Lexicall.org project is to address these limitations and to offer researchers in psycholinguistics with:

Sharing – lexicall.org offers researchers in the fields a platform on which they can easily share the results of their work with colleagues, if appropriate

Preservation – resources are stored for the long term.

Ease of identification – Each resource will be adjoined a file with a file providing extensive information for each lexical resource. This file will be used to dedicated searching facilities will help scholars rapidly locate specific resources and evaluate their quality and suitability.

Visibility – Resource information will be encoded using standard-compliant Metadata (that is data about data) embedded in XML markup. The adoption of such standards will facilitate the interchange of resource information with repositories of closely related fields (e.g., Natural Language Processing community) or the visibility of the resource in popular search engines (e.g., Google).

Usability and interoperability – Resources will be stored in open formats in order to guarantee platform independence and encourage the easy exchange of resources between users and platforms.

Multiple modes of access – information about existing resources, download facility as well as mechanisms for the access and manipulation in lexicall.org. Specifically, the introduction, in the metadata file, of a section describing the resource content would allow the automatic creation of a query interface to present to let the user directly interact with the content of the archived data file.

Protection of Access restrictions – mechanisms are set up to be in a position respect and implement any sensitivities or restrictions

Acknowledgment – there would be clear information about the creators of the resources, licensing information and the reference to cite when using this work.

Quality and standards – Lexicall.org

Page 3: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

provides an integrated environment with access to documentation or tutorials as well as access to a forum on which advice can be sought for assuring your materials are of the highest quality and conform to robust standards.

In what follows, we will provide an overview of the different formats and use of lexical statistics data files and describe the available standards that describe (metadata) language resources and linguistic tools. Then, we will consider mechanisms susceptible to offer the user one integrated view and access of these two complementary domains of resources and tools.

Requirements imposed by lexical resources

Accommodate for the variety of forms for lexical resourcesThe implementation of such a comprehensive service requires an understanding of the different forms that lexical resources can take.

Linguistic resources can take many forms such as wordlists, lexical databases, terminology databases, dictionaries, glossaries, and thesauri. Psycholinguists are usually concerned only by a subset of them.

Statistics about parts of words are traditionally stored in tables which list the word segment along with one or two variables (e.g., a bigram and its frequency of occurrence, a rime or VC part of a monosyllabic word and its consistency of pronunciation in the language); statistics about words or non-words are traditionally stored in huge lexical databases which list all the words of the language alongside with a large range of lexical statistics extracted from a corpus (for instance, word frequency, word grammatical class, word age of acquisition). Shorter files would eventually list a limited set of norms for a more specific material (words or non-words,

visual or textual material) collected after experimental testing. Finally, other files will provide information about the experimental or simulation results for a set of words used in a specific study.

Fortunately, despite these different sizes and formats, a common representation scheme can be used for their storage, with a text file organized in columns, with one column that contains the main entry (typically a lexical headword) and other columns containing either text (the part of word, the word, the name of the pictorial stimulus) or numbers (estimates of the frequency of occurrence, age-of-acquisition, and so forth).

Similarly, the actions typically carried out on the data files are fairly limited. When preparing material for a new experiment, a research typically search for the items that follow a specific orthographic pattern (all words of two syllables which have either PB or BP as transition between the two syllables) or the items that fall into specific ranges of values for a given variable (all words which have a frequency value higher than 300 or lower than 10, to create a contrast between low and high frequency words).

Portability and interoperabilityAs mentioned by Baker & Crowell (2xxx). Some format must be chosen for data storage that should be readable by the maximum number of researchers and should still be able to be read in 50 or 100 years. With that in mind, proprietary formats like Excel should be avoided as they have the tendency to either change or disappear (the once popular “dbf” format, for instance, has now almost completely disappeared).

Instead, data storage and query formats should be selected that are (1) simple to implement, (2) open and non proprietary, and (3) in wide use in the community.

For lexical data, which usually come in table or matrix format, the recommended format is simple ASCII (American standard code for information interchange) format, with columns

Page 4: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

of data separated by tabulations.

It is worth to note, however that despite this extreme simplicity, this format does not necessarily guarantees maximum portability and durability out of the box. Notably, characters other than a-z are rendered differently on Mac and Windows platform and this has implications for the encoding of the lexical resources of alphabetic writing systems like French, which adds diacritic signs on top or bottom of letters to better specify their pronunciation, like the “é” in “café” or the “ï” in “naïve”). An even more complex problem is associated with the representation of non-alphabetic writing systems (Chinese, Japanese).

However, plain ASCII presents the advantage that openly available translation programs exist to allow migration between Mac and Windows format and that well accepted standards exist for the encoding of non-alphabetic language in Unicode format.

Plain ASCII is usually fast enough to process for the usually relatively small size of files with lexical resources, by today’s standards (50 MB maximum). For larger files, it may be necessary to consider the conversion into another format to conserve a reasonable speed of processing. A tar archive for faster download time; a MySQL format could be adopted for faster querying of the resource. However, all files should be kept in ASCII format and conversion only operated by conversion programs.

Similarly, code for tools should preferably be in plain text format, written in source code in an open source language such as R (Ihaka & Gentleman, 1996). Opensource languages may change over time but are likely to provide migration paths to many computer platforms and operating systems, thereby reducing the possibility of having the source code become obsolete or be usable only on an obsolete computer.

Indexing for the rapid identification and evaluation of resourcesAn archive is of value only if resources can easily be identified, evaluated, and accessed. The quality of the indexing is central.

The accepted way of describing language resources is by using metadata (literally, data about data). The metadata consists of descriptions of the resources, what the resource is about, how the data is coded. Certainly, keywords are an obvious form of metadata that offer handles to index resources and facilitate their identification according to a user’s specific need. For instance, if one tag encodes information about the language level (from part of words to running text), another one the category of material (data or scripts), and still another one the language of the resource (English, French, Spanish, or others), it is very easy to provide the user with the facility to select resources that match any given value for these three parameters. The use of unified metadata standards for all types of resources further allows a user to map available resource types onto tools for the manipulation of these data or documentation about the material.

Other forms of metadata involve general information about the resource, such as its name, description, or the journal article in which the resource has been described. These are essential for the evaluation of the resource.

These metadata are best stored as ASCII text embedded in an XML document. The ASCII format guarantees the long-term access to the resource. The XML format helps overlay a clear structure on the text content, in a format that is readable by humans and computers alike.

These metadata are best stored in a separate text file. Motivations for this are simple. First, this makes the metadata held in the XML file faster to access for identification and evaluation of the resource by a human reader as well as easier to process with a variety of searching, indexing, and associational tools now in wide use on the

Page 5: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

Web.

Second, eventual access restrictions can only be guaranteed if the information to be used for access restriction is independent from the resource itself. Hence, we have to be aware that sometimes, highly sophisticated resources have been created, that have required years of team effort (Celex, for instance). These resources cannot really be expected to be made available for free. The creator of lexical resources may wish to restrict the distribution of the data file but still provide free access to information about the resource. Also, researchers who already provide access to their resources locally are not keen to see their resources duplicated in different websites, as it tends to occult the source as well as make update difficult.

Only a system that separate the metadata from the data file can successfully accommodate these two situations, with at least two requirements:

(1) The located resources do not need to be locally available and can be identified by an URL. There should be the possibility to eventually link to an external website for download and interfacing facilities.

(2) For each resource, specifications and terms on which it can be acquired must be clearly indicated.

Compatibility with standards in useFor maximum impact, the information content that is to be exchanged must be stored in a format compliant with the standards already in use within the community. The larger the visibility of the resource, the higher its usefulness to members of the community and the higher the chances the speed of scientific discovery in psycholinguistics.

It is important as well to ensure compatibility with the standards in use in other communities, like for instance NLP community. To bring the data to the notice of users who are not in the

same field as the original researcher could lead to novel and important analyses that the original researcher had not foreseen.

With respect to data files, there are two initiatives that claim to provide a relevant metadata set for the linguistic domain. The first is the ISLE Metadata Initiative (IMDI) that proposes a complex set of tags that can be considered the beginning of an ontology for the domain. The second relevant initiative is the Open Language Archive (OLAC) that builds up on the Dublin Core (DC) set for use in the linguistic domain. Both the IMDI and OLAC aim to describe linguistic tools as well as language resources. The DFKI ACL Natural Language Software registry uses a well-described taxonomy for the more specific definition of tools.

However, it is not necessarily the case that we should stick to the standards they propose. What imports is to ensure that our metadata can be used by other archiving bodies (Google, NLP archive, etc.), either directly or after conversion.

TrustworthinessEvery step should be taken to provide clear information on the reliability and trustworthiness of the information resource. The materials will made publicly accessible only after approval and it will be clearly indicated whether a publication on the resource appeared in a peer-reviewed journal.

Multiple methods of deliveryThe least that a typical user would expect from our web-based archive is rapid identification of suitable resources either by browsing or searching, preferably with the possibility to download one of the selected resources.

The possibility for querying and subsetting the selected resource prior to delivery of the data would certainly be seen as a certain advantage for the success of the website. Often, researchers and students in psycholinguistics are only

Page 6: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

interested in a few variables or a subset of records within a larger data file.

Providing easy-to-use Web-based query mechanisms embedded within the archive itself would however obviously requires an additional expense of time and computing resources.

As Baker and Crowell stated "[…T] there is the risk that if the work for the data archivist is too great, the archive would not be instituted at all."

In a website managed by volunteer, it is not worth to consider initiatives that provide a better user experience if these initiatives cause maintenance duties such as the requirement for some manual manipulation to be done on each file.

This does not necessarily mean that it would be impossible altogether, however. Both user satisfaction and archivist peace of mind could be guarantee with the setting up of appropriate mechanisms and protocols to give way to an automatic procedure for resource querying.

Lexicall.orgOur contribution is the effort to set up with the structure and mechanism. For the rest, the success depends on the ongoing collaboration of members of the community.

1. Resource storageWe have taken the option to make the format of the data as simple as possible so to facilitate future migration of the data to new platforms and operating systems. By default, all data are stored in ASCII text file and lexical data are stored as fields of data organized in columns separated by tabulations. When required, Unicode format or conversion programs are used. In this, we followed Boker and Crowell’s recommendations to avoid proprietary format like Excel

2. Metadata for the specification of lexical resourcesMetadata are stored in a XML format, as a publicly accessible text file, with functions to convert the content of the metadata file into HTML format, for immediate metadata viewing in the browser.

Included in the XML file are information about the resource, information about its content (including explicit descriptions of the meaning of the variables), and information about code conversion.

In this section, we describe the metadata used to index and classify the resources in our repository. Existing initiatives were used as guidelines.

Metadata will consists of three sections. One with information about the resource, for its advertisement in our repository, one with information about the variables or parameters contained in the resource.

3. Mechanism for the specification of metadataGiven the time constraints of an academic researcher, it seems unlikely that any scenario that requires the producers or users of lexical resources to go through difficult or lengthy processes would result in widespread adoption. Therefore, we have undertaken an effort to write a procedure that would make the creation of metadata files and resource upload a task that would be self-explanatory for a graduate research assistant.

The submitted resource will be listed in the repository only after the lexicall.org has approved the relevance of the product. In any case the Lexicall.org Team will be in contact with the authors submitting their resource. A facility for updating already listed products will also be made available.

4. Basic mechanismsThe lexicall.org website heavily relies on PHP

Page 7: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

and MySQL. Both are server-based applications. PHP is a server-side scripting language that makes it very easy to generate dynamical content (content that can be adapted as a function of the context or the user). MySQL is a relational database engine that provides super-fast access to small to medium size database.

Together, they can be used to create a database-driven Web site, with content dynamically pulled from the database to create Web pages that can be viewed with any regular Web browser and impose no requirement for the user to have any plug-in pre-installed. Combined with the use of cascading style-sheet (CSS) for page formatting, this favours a highly modular approach in which changes to the content is independent to changes in appearance (and conversely). In this model, Some extra time need to be dedicated to the design and organisation of the website at the set-up stage but this is usually rewarded with quite easy on-going management of the archive. Mainly, changes to one element of the website template (such as the addition of a new item in the menu bar) are automatically applied to all pages in the website.

The advantages of using these techniques are numerous. First, MySQL and PHP are now the most widely used online database and scripting technologies. This means that documentation is easy to find. Second, they are both open source technologies. Not only very powerful tools come for free, but also many ready-made content management applications built with these resources can be found on the web for free. At the same time, they propose in-built functions that simplify the provision of features highly desirable in a repository fairly easy to set up. In particular:

Handle file uploads using HTML forms

Build a Web-based file repository or photo gallery

Utilize sessions and cookies to track site visitors

Implement query processes or utilize available Web-enabled systems, using Oracle, PostgreSQL, or MySQL, that have built-in query tools.

Boker and Crowder? suggested to create a password-protected Web page to contain the data and then create an e-mail link on the homepage so that potential [users] can easily e-mail for permission to use the data. This however impose heavy support and maintenance loads on the part of the volunteer archivist. We have taken a different approach. All resources are stored in a protected area that cannot be accessed directly from a web-browser. The specifications in the metadata file are used to automatically create download and query links when appropriate.

When an author has requested the users to contact him for access authorization, a login interface is automatically set in place, which request the user to enter a keyword provided by the author.

5. Mechanisms for resource access We rely on a combination of adequate metadata information provided by the authors and a suitable interfacing protocol for the automatic provision of a querying process. For data and script files, specification of the variable names, description, and query to operate on each one of them are also specified in the metadata file.

6. VisibilityFor increased visibility, we followed Boker and Crowell (2004) recommendation to place the keyword “PsychologyDataArchive” onto any web page in the data archive that are deemed accessible to search engines.

A RSS feed will also be added, so to provide other distributed archive website the opportunity to inform their users on the latest additions in our archive.

Page 8: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

7. Forum and community toolsSome community tools like forums and wikis have been set up to encourage sharing of information and expertise within the community. Some material has already been contributed by the archivist (announces of conferences, tutorials on easy to use scripting languages, on regular expressions, and on resources useful for collecting and analysing experimental data). However, the forum and wiki are not expected to communicate to the community materials contributed by the archivist but rather to facilitate and encourage contributions from members of the community (support and information exchange in the forum, tutorials and informative documents in the wiki). As already expressed, a website like this has a long term viability only if the load of supporting the users is shared within the community rather than assumed by a single person maintaining the website on a purely voluntary basis.

ConclusionsThis paper has presented an overview of techniques that can be used to create usable data archives and store them into a heterogeneous meta-archive on the Web. In such an archive, individual researchers and their academic

institutions retain responsibility for maintaining the resources held in the repository (trustwortiness, documentation, support); the volunteer archivist takes responsibility for the correct functioning of all mechanisms in place for the identification, distribution, and querying of lexical resources.

ReferencesBoker, S. M. & Crowell, C. R. (2004). Proposal for the Creation of a Web-Based Heterogeneous Distributed Archive for Psychological Data. Behavior Research Methods, 36(3), 670–677.

Oehlmann (1997). A web-based archive of psychological experiments: Challenges for client server interactions. Iassist Quarterly.

ISLE Metadata Initiative (2003). Metadata Elements for Lexicon Descriptions. Draft Proposal Version 1.1c. MPI Internal version. [on line: http://www.mpi.nl/IMDI/documents/Proposals/IMDI_Lexicon_1.1c.pdf

Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational & Graphical Statistics, 5, 299-314.

Page 9: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources
Page 10: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

AppendixStandards used for the description of resources

Each item in our repository will be accompanied with a description file, which provide useful information about the resource. This document will be stored as a plain text XML document.

The definition file is a simple text file divided into four parts.1. Keywords that classify the resource2. Information about the resource.3. Information about the contact person for this resource.

4. Parameters used for the management of the resource. [Required for a download link or a query interface to be provided by lexicall.org, optional otherwise]

5. Definition of the variables held into the resources, to be used for information purpose as well as for the automatic generation of a query interface to the resource.

6. Code conversion information (for example, matching each key used to code the phonology onot the equivalent DISC code).

1. Keywords

<keywords><material> Type of material.[text, one of “data”, “tools”, “docs”, “links”]</material><category>Category under which the resource should appear, in our repository.[text, one of: "Part of words", "Words", "Nonwords", “Running Text”, "Visual Material", "Associations", "Performance Measures"]</category><language>One of: any language,english,dutch,french,japanese,spanish</language>

</keywords>

2. Resource Description

<resource_description><name>

Name of the Database.[text, up to a maximum of 100 characters]

</name><version>

Version of the resources (by default 1.0)[text, up to a maximum of 10 characters]

</version><description>

Short description of the Database[text, up to a maximum of 255 characters]

</description><reference_to_cite>

Reference to a peer-reviewed paper or report which presents this resource and which should be cited in any work that makes uses of this resource.[Bibliographical reference in APA format: Smith, J. (xxx). Title. Journal, Vol(Iss), pages]

</reference_to_cite>

Page 11: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

<url_information>Link to a page on the web which provides information about the resource. Ideally, this would be a link to a manual in html or pdf format.[url|Psychonomic.org|Lexique.org|none]

</url_information><url_download>

Link to a webpage where the resource file can be downloaded. If the value is "lexicall.org", a link should be created to the copy held on the lexicall.org website. In this case, the authors should make sure that the data file is also uploaded.If the authors prefer the users to be directed to another webpage (for instance, psychonomic.org or lexique.org), simply provides the url of the website on which a link to the data file can be found. If the authors do not wish the resource to be publicly available for download, they should simply mention none.[url|Lexicall.org|Psychonomic.org|Lexique.org|none]

</url_download><url_query_interface>

Link to a webpage with an interface that lets users query the resource. If the value is "lexicall.org" a link will be provided to the automatic interface generator of the lexicall website. In this case, the contributor should make sure to provide information about the variables in the resource file and make sure that the the data file is also uploaded. If an interface already exists, simply provide the url of the website on which this interface can be found. If the authors do not wish the resource to be publically interfaced, they should simply mention none.[url|Lexicall.org|Psychonomic.org|Lexique.org|none]

</url_query_interface> <notes_public>

Notes visible to the users.[text up to 255 characters]

</notes_public><notes_private>

Any information that authors prefer to keep attached to the database but do not wish to be seen publicly.[text up to 255 characters]

</notes_private></resource_description>

3. Contact details

<contact_details><contact_person>

Name of the person to contact about this resource (note that users of this website are invited to consult the on-line documentation and forums before taking contact with the authors).[text, up to a maximum of 30 characters]

</contact_person><contact_email>

Email of the contact person. This email will be listed in the information box only if the author specifies it is for public access (next field).[text, up to a maximum of 30 characters, of a format [email protected], where x can be either alphanumeric or "."]

</contact_email><contact_email_ispublic>

Indicates whether the email information should be made public (1) or not (0)[digit, either 1 or 0]

</contact_email_ispublic><contact_lab>

Name of the lab or department to which the principal author belongs.[text, up to 50 characters, alphanumeric characters only]

</contact_lab><contact_url>

Link to the contributor personal webpage or to its lab or department webpage.[text, up to 50 characters, of a format xxx.xxx where x can be either alphanumeric, ".", or "/"]

</contact_url></contact_details>

Page 12: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

4. Lexicall management

<lexicall_management><file_format>

File format for the resource (it must be one of the options proposed if the resource is to be interfaced automatically on the lexicall website). [text, one of "text file", "awk script", "perl script", "other"]

</file_format><file_copy_at_lexicall>

Indicates whether a copy is held in the lexicall repository. The value must be 'yes' if the contributor indicated that links for download or query interface should be created in the lexicall.org website. [text, either "yes" or "no"]

</file_copy_at_lexicall><file_size>

Indicates the approximate size of the file (rounded to the top). This will be used, among other things, to automatically convert big data files into a mySQL database, for faster querying. [number, up to a maximum of 999999999]

</file_size><variables_defined>

Indicates whether a description of the variables is also provided. We strongly encourage the contributors to do so, as it guarantees a better use of their resource. The value must be 'yes' if the contributor indicated that links to a query interface should be created in the lexicall.org website.[text, either "yes" or "no"]

</variables_defined><nb_variables>

Indicates the number of variables that will be described. Typically, there should be one variable per column in a data file or one variable per parameter in a script file.[number, up to a maximum of 99]

</nb_variables></lexicall_management>

5. Variables Details

(column number Variable Name Variable Description Query Type Query specifications)<variables_definition>

[00-99] [text] [text] [text] [text][00-99] [text] [text] [text] [text][00-99] [text] [text] [text] [text][00-99] [text] [text] [text] [text][00-99] [text] [text] [text] [text]

</variables_definition>

Page 13: Lexicall - homepages.widged.comhomepages.widged.com/mlange/publications/drafts/lexic…  · Web viewLexicall.org: A web-site for complete and convenient access to lexical resources

This part is organized as 5 columns of data separated by tabulations. This can be easily created in Excel and copy pasted in a text file.

Column 1: A digit that indicates the column, in the data file, in which the variable is found.

Column 2: A word or two naming the variable.

Column 3: A short text (maximum 255 characters) describing the variable.

Column 4: One of a set of predefined option indicating the type of query to be provided for that column of data. Current options are:

[Regular Expression] or [RE]: Typically of use for text string, let the user find the data in that column that match a specific search pattern, the search pattern being defined with regular expressions (see in-site documentation for details on these).

[Min-Max] or [MM]: Typically of use for continuous values, let the user find any matching data inside a range of values. To speed up computation time, the 5th column should contain information about the absolute minimum and maximum values in the database. If no data is provided, these values will be automatically computed (but this will affect processing time).

[Single Choice] or [SC]: Typically of use for categorical data, let the user define the value to match as one of a list of options. The fifth column should then contain information about the keys and their meaning. The format for this is key1: value-key2: value-key3: value. The values will be displayed in a drop down menu and the keys used to find matching data. This can be used, for instance, to define syntactic class (V: verb-N: Noun-A: Adj.).

[Multiple Choice] or [MC]: Quite similar, except that the user can select multiple values. In this case, the data retrieved will be the ones that match any of the values provided by the user. An example of use is the selection of words that are either CVC or CVCC in structure.

[Tick Box] or [TB]: Of use mainly in scripts, let the user determine whether an option should be on or off.

[Word List] or [WL]: Of use mainly in scripts, let the user enter a list of words or items to process.

[None] or [NO]: No query to provide. Typically of use for variables like standard deviations.

Column 5: Specifications for the query. With Single and Mutliple Choices query options, a description of the (keys: values) pairs. With Min-Max query options, an indication of the absolute Minimum and Maximum values in the database.

6. Code Conversion -- Not provided yet<code_conversion>

<apply_to_variables>x y z</apply_to_variables><code_table>

this_file DISC x x y y z z

</code_table></code_conversion>

Files in a text format can be easily processed and exchanged between platforms or used on any platform. However, two problems need to be addressed: (1) characters do not always get displayed the same way on different platforms; (2) in current databases different coding options are often adopted.

However, with databases stored as text file, it is fairly easy to come up with a program that converts codes from non-standard formats to the standard formats, using a code conversion table.

The way this information will be coded still needs to be defined.