Labeling text using Doccano

Doccano is an open source text annotation tool. It can be used to create labeled datasets for:

  • Text classification
  • Entity extraction
  • Sequence to sequence translation

Doccano can be used to create labeled data for training the EntityRecongnizer model in arcgis.learn.

This software is created by: Hiroki Nakayama and Takahiro Kubo and Junya Kamura and Yasufumi Taniguchi and Xu Liang

Deploying doccano for data labeling

For Windows

Method 1 (Using docker desktop requires Microsoft Windows 10 Professional or Enterprise 64-bit):

  1. Install docker for desktop.
  2. Launch Command Prompt(cmd.exe) as Administrator and run the below commands:
    • docker pull doccano/doccano:1.2.4
    • docker container create --name doccano -e "ADMIN_USERNAME=admin" -e "ADMIN_EMAIL=admin@example.com" -e "ADMIN_PASSWORD=password" -p 8000:8000 doccano/doccano:1.2.4
    • docker start doccano
  3. You can now access Doccano at http://localhost:8000

Method 2:

  1. Download or clone the arcgis-python-api githup repo.
  2. Navigate to misc/tools/doccano_deployment folder.
  3. Run install.bat as administrator.
  4. On the command prompt, you will be asked to create your username and password for accessing Doccano.
  5. Once the install script completes, you should have Doccano running on your local system.
  6. Open your browser and go to http://localhost:8000/

For Linux

  1. Install Docker Engine (Community) for your linux distribution.
  2. Launch a terminal and run the below commands:
    • sudo docker pull doccano/doccano:1.2.4
    • sudo docker container create --name doccano -e "ADMIN_USERNAME=admin" -e "ADMIN_EMAIL=admin@example.com" -e "ADMIN_PASSWORD=password" -p 80:8000 doccano/doccano:1.2.4
    • docker start doccano
    • You can modify the ADMIN_USERNAME, ADMIN_EMAIL and ADMIN_PASSWORD values.
  3. You can now access Doccano at http://localhost:8000

How to label training data for named entity recognition with doccano.

  1. After Doccano has been deployed to the local machine, go to Doccano hompage and login with your credentials.
  2. Create new project with project type 'Sequence labeling':
  1. To import data for annotation, go to Dataset from the left panel then click on Actions > Import dataset.
  1. Select 'JSONL' and then click on 'Select file(s)' and point it to the reports file (docanno_deployment\reports_label.jsonl). Alternatively, text documents can also be uploaded using the ‘Plain text’ option.

  2. After the file has been imported, you will see the documents loaded on the screen.

  3. Click on 'Start annotation' from the top menu bar.

  4. All the documents are pre-labeled, just 3 (document number 2,3 & 4) are intentionally left unlabeled for you to try labeling. Analyze the first labeled document and then move on to second document (use the bottom navigation bar for sifting through the docs). Mark sequences with your mouse and select the relevant title.

  5. New labels can also be created by navigating to ‘Labels’ from the ;eft panel.

  6. Once all the documents have been labeled, go to 'Dataset' > 'Actions' > 'Export dataset'.

  7. Select JSONL(Text-Labels).

  8. Set an export file name.

  9. Click Export.

The downloaded file can be used to train an EntityRecognizer model from arcgis.learn. You can find a sample notebook here

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.