LaCour! Generation
December 13, 2023 ยท View on GitHub
Companion code to the arXiv preprint presenting the LaCour! corpus.
If you are looking for the dataset, please visit LaCour! Corpus.
Note
This repo is still a work in progress and will be updated in the coming days!
Installation
For installation with Miniconda:
conda create -n lacour-generation python=3.9
conda activate lacour-generation
git clone https://github.com/trusthlt/lacour-generation.git
cd lacour-generation
pip install -r requirements.txt
Running the scraper
Producing the hearing transcripts and associated documents is divided into several steps. The code for all scrapers is located in scrape.
- Download video files and video information by running
scrape_webcast_videos.py, producesall_webcasts_{date}.json - Find associated files in HUDOC and download them by running
scrape_case_html_matching_webcast.py - Find related press releases in HUDOC and download them by running
scrape_press_releases.py
Downloading the videos
Warning
Due to changes to the webcast website, the scraper for videos no longer works. You can instead skip the first step and download the last scraped file all_webcasts.json
Transcribing the videos
Transcribing a video into a hearing transcript requires several steps, with one manual annotation step. The code for transcription is located in transcribe.
- Diarize the video by running
diarize.py. This requires a huggingface token to access the models of pyannote/speaker_diarization@2.1 - Generate a speaker schedule, clustering the diarization output by running
generate_speaker_schedule.py. This will result in one text file with a speaker schedule per hearing webcast - (MANUAL) Annotate the speaker schedule with the correct tags
- Generate a transcript by passing the annotated speaker schedule with the video to
transcribe_segmented_whisper.py