How to create a corpus from scratch
-
Create a directory with the name that your corpus should have. Avoid special characters and spaces when naming your directory.
-
[Optional] Create subdirectories for any subcorpora you want to create.
-
Add plain text files ending in
.txt
that contain the raw text of your corpus. Create one file for each text that should make up a separate document in the corpus. Sort the files into the respective subfolders for subcorpora, should you have any. Make sure to save the text files with the UTF-8 character encoding. If you use Windows, you cannot use the default Notepad. Use another text editor instead, for example the free and open source Notepad++ that uses UTF-8 per default.
A simple corpus structure may, for example, may now look like this:
my-corpus/
├── document1.txt
└── document2.txt
A (slightly) more complex corpus with two subcorpora may now look like this:
gutenberg/
├── authors_A-M
│ ├── jane_eyre.txt
│ ├── moby_dick.txt
│ ├── sense_and_sensibility.txt
│ └── sherlock_holmes.txt
└── authors_N-Z
├── dracula.txt
├── frankenstein.txt
└── time_machine.txt
- Open Hexatomic, and import the newly created main corpus directory (from step 1), using the text importer as described in Importing and exporting corpora.