SmartWeb Video Corpus (SVC) – META-SHARE

Last view: 2025-07-01

79 Last view: 2025-07-01

Last update: 2013-06-26

1 Last update: 2013-06-26

SmartWeb Video Corpus (SVC)

View resource name in all available languages

Corpus SVC (SmartWeb Video Corpus)

http://catalog.elra.info/product_info.php?products_id=1070

ID:

ELRA-S0280

The SMARTWEB UMTS data collection was created within the publicly funded German SmartWeb project in the years 2004-2006. It comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC, ref. ELRA-S0278), field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC, ref. ELRA-S0279), as well as mobile recordings performed on a BMW motorbike (one speaker, SmartWeb Motorbike Corpus SMC, ref. ELRA-S0280).

This multimodal corpus corresponds to the video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus) and contains 99 recordings each containing a human-human-machine dialogue: one speaker (which is being recorded) interacts with a human partner as well with a dialogue system via a smart phone (SmartWeb system).

The speaker uses a client-server based dialogue system (SmartWeb) for spoken access to Internet contents in a natural environment (office, hallway, street, park, cafe, etc.). Speech was captured over a Bluetooth headset and transferred via an UMTS cellular line to the server; a second collar attached microphone was captured on a portable iRiver recorder to yield an undisturbed, high quality reference signal. The face of the speaker was captured by the build-in face camera of the smart phone. The speech signal was segmented into queries (automatically by the prompting system) and a second time manually into turns and transcribed according to Verbmobil transliteration standard. The video signal was labelled manually into OnView / OffView and - partly - spatially segmented for face detection.

The motivation for this corpus was to capture realistic multimodal (speech + face) data in a realistic human machine interaction as well as to capture as many OffTalk situations as possible (OffTalk being all speech uttered by the speaker that is not intended as input to the system).

The corpus contains:
- number of dialogues / recorded speakers: 99
- number of segmented turns: 2,218
- total duration: 971 minutes
- formats:
o collar mic: WAV 44,1kHz, 16 bit
o Bluetooth/UMTS-channel: ALAW 8kHz 8bit
o video: 176x144, 24bpp, 15fps, 3GPP + MPEG1
o Verbmobil Transliteration (TRS), BAS Partitur Format (BPF), ATLAS Annotation Graph (XML)
o meta data: speaker and recording protocol (XML)
- segmentation: automatic segmentation into input queries by the prompting system; manual segmentation into turns; OffTalk labelling; OffView labelling, spatially segmentation of face (partly manually)
- distribution: 5 DVD-R

See also ELRA-S0278 and ELRA-S0279.

View resource description in all available languages

La collection de données SMARTWEB UMTS a été produite dans le cadre du projet SmartWeb financé par le gouvernement allemand de 2004 à 2006. Il comprend une collection de questions utilisateurs posées à une interface web de parole naturelle et avec comme thème principal la coupe du monde de football 2006. La collection comprend des enregistrements de champs via un appareil portable UMTS (une personne, corpus SHC-SmartWeb Handheld, réf. ELRA-S0278), des enregistrements de champs avec une capture vidéo d’un locuteur premier et d’un locuteur secondaire (corpus SVC-SmartWeb Video, réf. ELRA-S0279), ainsi que des enregistrements via des téléphones portables réalisés sur une moto BMW (un locuteur, corpus SMC-SmartWeb Motorbike, réf. ELRA-S0280).

Ce corpus multimodal correspond aux captures vidéo d’un locuteur premier et d’un locuteur secondaire (SmartWeb Video) et contient 99 enregistrements de dialogues personne-personne-machine : un locuteur (qui est enregistré) interagit avec un partenaire humain ainsi qu’avec un système de dialogue via un smartphone (système SmartWeb).

Le locuteur utilise un système de dialogue client-serveur (SmartWeb) pour accéder vocalement au contenu d’internet dans un environnement naturel (bureau, hall d’entrée, rue, parc, bar, etc.). La parole a été enregistrée via un micro-casque Bluetooth et transférée vers le serveur via une ligne cellulaire UMTS ; un deuxième micro collier a été utilisé sur un lecteur enregistreur portable iRiver afin de produire un signal de référence sans distortion de haute qualité. Le visage du locuteur a été filmé par la caméra du smartphone. Le signal de parole a été une première fois segmenté en requêtes (automatiquement réalisé par le système de prompt) et une deuxième fois en tours de parole, puis transcrit selon le standard de transcription Verbmobil. Le signal vidéo a été étiqueté manuellement en situations OnView / OffView (selon si le locuteur regarde ou non la caméra) et – partiellement – segmenté au niveau spatial pour la détection du visage.

L’objectif moteur pour la constitution de ce corpus était de collecter des données multimodales réalistes (parole + visage) dans une interaction humaine réaliste, mais également de les collecter dans autant de situations OffTalk que possible (l’OffTalk étant composé de toutes les données de parole prononcées par le locuteur mais n’ayant pas pour but d’être utilisées comme entrées du système).

Le corpus comprend :
- nombre de dialogues / locuteurs enregistrés : 99
- nombre de tours segmentés : 2218
- durée totale : 971 minutes
- formats :
o micro collier : WAV 44,1kHz, 16 bit
o canal Bluetooth/UMTS : ALAW 8kHz 8bit
o video : 176x144, 24bpp, 15fps, 3GPP + MPEG1
o transcription Verbmobil (TRS), format de partition BAS (BPF), graphe d’annotation ATLAS (XML)
o méta-données : locuteur et protocole d’enregistrement (XML)
- segmentation : segmentation automatique en requêtes d’entrée de système par le système de prompt ; segmentation manuelle en tours de parole ; étiquetage OffTalk ; étiquetage OffView, segmentation spatiale du visage (en partie manuelle)
- distribution : 5 DVD-R

Voir aussi ELRA-S0278 et ELRA-S0279.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 07/11/2008

Licence

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

User Nature: Academic

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

User Nature: Commercial

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

User Nature: Commercial

Contact Person

Mapelli Valérie

audio
video

Monolingual audio corpusLanguages

German

Linguality

Linguality type: Monolingual

Size

no size available

Monolingual video corpusLanguages

German

Linguality

Linguality type: Monolingual

Size

no size available

Resource Creation

Creation ended: 01/01/2006

Funding Project

SmartWeb

Funding Type: National Funds

Metadata

Created: 05/12/2005

Version

Version: 1.0

Last Updated: 07/11/2008

Usage

Actual Use - Nlp Applications

Use NLP Specific: Speech Recognition

People who looked at this resource also viewed the following:

Resources from the same project