Publishing EC archival corpus as Open Data

Posted:

In the context of a paper currently under preparation, the MaSTIC research group has received the permission from the European Commission to publish the collection of the so-called COM files. These holdings were produced by the Registry of the Executive Secretariat of the Commission of the European Economic Community from 1958 to 1967 and of the General Secretariat of the Commission of the European Communities from 1967 to 1986.

COM documents are most often legislative proposals, but they may also consist of reports, communications to the Council and/or other institutions, white or green papers, etc. The COM file compiles the different versions of a text, allowing to follow its drafting process, as well as its different linguistic versions.

This collection is currently available through a full-text search, but which of course has its limitations due to problems related to polysemy and synonymy. By using a combination of Topic Modeling and Word Embeddings, the MaSTIC research group is exploring the possibilities to semantically index the collection with EUROVOC, making it possible for researchers to access the collection in a more meaningful manner.

All of the components of this research project (method, tools, results, original source data and the preprocessed data) will be made available in June 2018 along with a pre-print of the first A1 article written about the research project. By doing so, other researchers and practitioners world-wide will be able to verify the results and compare how other methods and tools potentially deliver different results.

Subscribe via RSS