Spanish Linguistic Pipeline – Guide

1 Contents
2 1. Access TACC Account
3 2. Enter Jupyter
4 3. Create a folder
5 4. Upload input file
6 5. Go to the terminal
7 7. Type the commands
8 8. Run the python script
9 9. Find output files
10 10. Download output files
11 11. Clear cache and log out
12 12. Sanity Check REQUIRED for researcher in charge of project
- 12.1 Added information about the output:

Before beginning this pipeline, you MUST complete the first TWO steps of the Sanity Check pipeline.

1. Access TACC Account	Access the shared credentials to enter TACC Analysis Portal Go to https://stache.utexas.edu/ Enter with your UT credentials Click on “secret”
1.1	Access TACC Analysis Portal https://tap.tacc.utexas.edu/ Enter Dr. Grasso’s account information generated by Stache and log in.
1.2	Submit a job on TACC by clicking on the dropdowns and selecting: Lonestar 6 Jupyter Notebook DBS23006 vm-small Nodes 1; Tasks 1 Job name (can be anything, we will use SpanishLing_Trial in this tutorial guide) Time limit (I will use 2 hours in this tutorial guide) Click on “submit” Note. If you're struggling to get nodes on the vm-small queue, try the development queue. Note. If you aren’t able to submit a job (see error on the right as an example), follow the steps in this recording (Video Conferencing, Web Conferencing, Webinars, Screen Sharing ) to clear cache on TACC.
2. Enter Jupyter	If there are available nodes (picture A), you will be able to enter Jupiter right away. In that case, follow these steps: Click on “connect” Click on “work” Click on “MADR_LingFeatPipelines” If there are no available nodes (picture B), you will have to wait in a queue until it’s available.
3. Create a folder	Click on “new” Click on “folder” Name folder. In this tutorial guide, I will use the name of InputFolder_Trial If you cannot see the folder you created, click on “last modified” a couple of times. Sometimes it doesn’t update immediately.
4. Upload input file	Enter the folder you just created Click on “upload” and upload the input files
5. Go to the terminal	Once the files have been uploaded, click on the new menu dropdown Click on terminal
7. Type the commands	Once you are in the terminal, follow these steps (if running several audio files): Type cdw, press enter Type cd MADR_LingFeatPipelines, press enter Type conda activate spanishPipeline, press enter Type or copy the command below, then press enter: python spanishLingPipeline.py InputFolder_Trial/ spanishLing_Trial Keep in mind that the red and green sections change depending on your input and output files. The purple section always remains the same. IF AN ERROR SHOWS UP: If you get the “IsADirectory” error, as in: IsADirectoryError: [Errno 21] Is a directory: 'YourFileHere/.ipynb_checkpoints' This means that there is a hidden folder that is disrupting the code. To fix this, you need to enter into the terminal: rm -r YourFileHere/.ipynb_checkpoints This will remove the hidden folder. If you get the Zero Division error, as in: ZeroDivisionError: division by zero This means that one (or more) of your files has none or too little text remaining. You will need ot go through your files and identify the file(s). It is likely you will need to exclude them from the pipeline.	The parts circled in red, green, and gold change depending on the name of your input and output files. InputFolder_Trial is the name of my input file. Your command will change depending on the name you give your file. Remember that the exact ortography must match. If the name of the file that you uploaded is all in lowercase, the command in the terminal must be all in lowercase. spanishLing_Trial is the name of the output file and output folder. You can change this part of the command (in the terminal) depending on the name you want to give your output file and folder.
8. Run the python script	After writing your command and pressing enter, wait a few seconds. You will know when it’s done running when you see this at the bottom of the terminal (see picture, circled in red).
9. Find output files	Once the files are finished running, follow the next steps: Go back to the notebook Click on the refresh button (if needed) If the generated files did not pop up, click on “last modified,” sometimes it takes a minute to update. You will 1 output file in coma separated value version and 1 folder (see picture).	There will be two outputs: a spreadsheet named spanishLing_Trial_LinguisticFeatures.csv and a folder named PosMorphDepTree_spanishLing_Trial
10. Download output files	First, we want to download the linguistic features. Check the box with the csv file and click on download (picture A). Then, enter the supplementary output folder and download the files one by one (picture B). If you want/need to create a .zip file of all the output in a given folder, you can use the following code: zip -r YourOutputFolderHere.zip YourOutputFolderHere For example: zip -r PosMorphDepTree_AndrewProjOut.zip PosMorphDepTree_AndrewProjOut
11. Clear cache and log out	Clear cache by typing the code rm -r ~/.cache/ Press enter Then, log out from the terminal by typing the command logout (see picture A) Press enter Go back to the original TACC page and click on “end job” (see picture B). IT IS VERY IMPORTANT TO END THE JOB AS THE NODES ARE VERY LIMITED. THE JOB WILL KEEP RUNNING UNLESS YOU COMPLETE THIS STEP.
12. Sanity Check REQUIRED for researcher in charge of project	If you are the researcher in charge of the project that uses the data you just ran through the linguistic pipeline, you MUST go through the sanity check process found here.

Spanish Linguistic Pipeline Tutorial Guide

Contents

1. Access TACC Account

3. Create a folder

4. Upload input file

5. Go to the terminal

7. Type the commands

8. Run the python script

9. Find output files

10. Download output files

11. Clear cache and log out

12. Sanity Check REQUIRED for researcher in charge of project