Hands-on

What you need to prepare

The morning session is based on the assumption that you bring your own laptop on which you pre-install some software and download the data with which we will work. This page summarises everything you need to know to prepare for the workshop.

1. Hardware requirements:
  • Please ensure your laptop has a minimum of 2Gb of RAM and at least 3Gb of available storage space.
  • Make sure that you have an updated version of Java - the last version of Java SDK can be downloaded HERE.
2. Sofwares requirements:

Both morning workshops require that you download a total of four files: a corpus sample, a word vector model and two software packages. The software in question does not require installation and can therefore be used without an administrator account on the computer. They work on any operating system.

During the entire morning session, we’ll work with a subset of the EC archive corpus, which can be downloaded here:

For the first hands-on (using topic modelling & word embeddings):
For the second hands-on (supervised machine learning):

Once the five files downloaded and unzipped, we recommend to put them in a folder named "workshop" somewhere on your desktop.

Detailed instructions will be handed out during the workshop for each step of the process, so apart from downloading the softwares and the test data, no other preparation is required.

Support: if you have trouble downloading the softwares and test data, please send a tweet or an email to @sethvanhooland.

3. Skills required

The tutorials do not require any prior knowledge of computer science or programming.

For some of the software tools we will be using the command line, but all of the specific commands will be given and explained.

However, in order to save time, it would be important if you at least know how to open a terminal from a specific folder. In Windows or Linux, the operation is very simple.

If you have a Mac, please enable the "New terminal at folder" option. To do this, go to "System Preferences", then "Keyboard", then "Shortcuts", then "Services", and check the box "Enable New Terminal at Folder". This operation will save you a lot of time in the future.

4. Download results

In order for the workshop to run smoothly and to ensure a common base for the group discussions, we are already making available some results based on the test corpus:

  • topic_distribution.csv : this file allows to understand the probability distribution of the topics per document
  • overview_topics.csv : this file gives an overview of the 250 topics
  • results_jex.csv : this file gives an overview of the Eurovoc descriptors attached to each document

The three files are gathered in a ZIP archive downloadable HERE.