Pattern Recognition and Human Language Technology

Pattern Recognition and Human Language Technology Research Center
Centro de Investigación en Reconocimiento de Patrones y Tecnología del Lenguaje Humano
Abbreviation	PRHLT
Type	University research centre
Purpose	Research in pattern recognition and human language technology
Headquarters	Valencia, Spain
Parent organization	Universitat Politècnica de València
Website	www.prhlt.upv.es

The Pattern Recognition and Human Language Technology Research Center (PRHLT) is a research centre of the Universitat Politècnica de València (UPV) in Valencia, Spain.^[1] Its research covers pattern recognition, machine learning and human language technology, including machine translation, speech recognition, handwritten text recognition, document analysis, information retrieval and the computational analysis of written and spoken language.

Researchers associated with PRHLT have participated in European and Spanish research projects concerning interactive machine translation, speech transcription and the recognition and retrieval of historical handwritten documents. The centre contributed research on handwriting recognition, keyword spotting and document indexing to European projects that preceded and supported the development of Transkribus.^[2]

History

PRHLT developed from research conducted within the Department of Computer Systems and Computation at the Universitat Politècnica de València. Its work brought together research in statistical pattern recognition, speech and language processing, machine translation and document-image analysis.

Francisco Casacuberta and Enrique Vidal were among the researchers who led work in these fields at UPV. Their research included grammatical inference, finite-state models, speech translation, computer-assisted translation and handwritten-text recognition.

Vidal received the Aritmel National Informatics Award in 2011 for contributions to pattern recognition, grammatical inference and language technologies.^[3] Casacuberta received the José García Santesmases National Informatics Award in 2022 for his research career in areas including pattern recognition, machine learning and machine translation.^[4]

Research

Machine translation

PRHLT researchers have worked on statistical and neural machine translation, speech translation, language modelling and interactive computer-assisted translation. Part of this research has examined systems in which a machine translation model responds to corrections supplied by a human translator and updates its predictions during the translation process.^[5]

Earlier work by Casacuberta and Vidal applied stochastic finite-state transducers to machine translation. In this approach, translation is modelled as a probabilistic transformation between sequences in a source and a target language.^[6]

The centre participated in “CasMaCat”, a European research project that developed an open-source workbench for interactive and adaptive computer-assisted translation. The project investigated online learning, confidence estimation, active learning and interaction between translators and machine translation systems.^[7]^[8]^[9]

PRHLT researchers also developed resources for multilingual speech translation. The ‘’‘Europarl-ST’’’ corpus contains aligned parliamentary audio, transcriptions and translations in several European languages and was designed for research into speech translation and multilingual speech processing.^[10]

Handwritten-text recognition and document analysis

A major area of research at PRHLT is the automatic recognition, transcription, indexing and retrieval of handwritten historical documents. Historical manuscripts present difficulties for automated recognition because of differences in handwriting, spelling, document layout, image quality and the physical condition of the source material.

Researchers at the centre developed methods for probabilistic indexing and keyword spotting in document images. Rather than requiring an exact transcription of every page, probabilistic indexing estimates where words or sequences of words may occur and assigns probabilities to alternative readings.^[11]

These methods were applied in the ‘’‘Carabela’’’ project, undertaken by PRHLT and the Centre for Underwater Archaeology of the Andalusian Institute of Historical Heritage. The project processed digitised collections concerning Spanish maritime activity and shipwrecks between the fifteenth and nineteenth centuries. It was designed to allow researchers to search manuscript images for names, locations and other expressions without first requiring a complete manual transcription of the documents.^[12]

PRHLT participated in ‘’‘tranScriptorium’’’, a European project focused on the indexing, searching and transcription of historical handwritten document images.^[13] It subsequently participated in ‘’‘READ’’’—Recognition and Enrichment of Archival Documents—which extended handwriting-recognition and document-analysis technology and supported the development of Transkribus.^[14]

Transkribus was developed through a multi-institutional European research programme and is maintained by READ-COOP. PRHLT’s work formed one part of that broader collaboration, particularly in handwritten-text recognition, keyword spotting and probabilistic document indexing.

Speech and multimodal processing

The centre has conducted research in automatic speech recognition, speech translation and multimodal human–computer interaction. This work has included systems combining text, speech, pen input and visual information in interactive recognition and translation workflows.^[15]

PRHLT led and coordinated "transLectures",^[16] a European project that investigated the automatic transcription and translation of recorded lectures. The project developed methods for adapting speech-recognition and machine-translation systems to educational recordings and different subject domains.

Natural-language processing

Research at PRHLT has also addressed text classification, authorship analysis, plagiarism detection, information retrieval, question answering and the identification of harmful or deceptive online content.

Researchers associated with the centre have participated in multilingual evaluation campaigns concerning hate speech, author profiling, irony, humour and misinformation. Some of this work has been conducted through the ‘’‘Iberifier’’’ and ‘’‘Iberifier Plus’’’ projects, which study digital media and disinformation in Spain and Portugal.^[17]^[18]^[19]

European research projects

Selected European projects involving PRHLT include:

“CasMaCat”, concerning interactive and adaptive computer-assisted translation;
“transLectures”, concerning the transcription and translation of recorded lectures;
“tranScriptorium”, concerning the recognition, retrieval and transcription of historical manuscripts;
“READ”, concerning the recognition and enrichment of archival documents and the development of technology used in Transkribus;
“IBERIFIER” and “IBERIFIER Plus”, concerning the analysis of digital media and disinformation in Spain and Portugal.

These projects were conducted by multi-institutional consortia under different European research programmes.

Technology transfer

Research undertaken by members of PRHLT has been transferred through software, research demonstrators, collaborations with public and private organisations and university spin-off companies.

In 2020, researchers associated with the centre established ‘’‘tranSkriptorium AI’’’, a spin-off of the Universitat Politècnica de València. The company develops systems for transcribing, indexing and extracting information from digitised handwritten, printed and typewritten documents.^[20]

Selected publications

Casacuberta, Francisco; Vidal, Enrique (2004). “Machine translation with inferred stochastic finite-state transducers”. ‘‘Computational Linguistics’’. ‘’‘30’’’ (2): 205–225. doi:10.1162/089120104323093294.
Vidal, Enrique; Casacuberta, Francisco; Rodríguez, Luis; Civera, Jorge; Martínez-Hinarejos, Carlos D. (2006). “Computer-assisted translation using speech recognition”. ‘‘IEEE Transactions on Audio, Speech, and Language Processing’’. ‘’‘14’’’ (3): 941–951.
Toselli, Alejandro H.; Pastor, Moisés; Romero, Verónica; Vidal, Enrique (2007). “Computer assisted transcription of handwritten text”. ‘‘Proceedings of the International Conference on Document Analysis and Recognition’’. pp. 944–948.
Casacuberta, Francisco; Vidal, Enrique (2009). “Human interaction for high-quality machine translation”. ‘‘Communications of the ACM’’. ‘’‘52’’’ (10): 135–138. doi:10.1145/1562764.1562798.
Toselli, Alejandro H.; Romero, Verónica; Pastor, Moisés; Vidal, Enrique (2010). “Multimodal interactive transcription of text images”. ‘‘Pattern Recognition’’. ‘’‘43’’’ (5): 1814–1825.
Romero, Verónica; Toselli, Alejandro H.; Vidal, Enrique (2012). ‘‘Multimodal Interactive Handwritten Text Transcription’’. World Scientific.
Sánchez, Joan Andreu; Romero, Verónica; Toselli, Alejandro H.; Villegas, Mauricio; Vidal, Enrique (2019). “A set of benchmarks for handwritten text recognition on historical documents”. ‘‘Pattern Recognition’’. ‘’‘94’’’: 122–134.
Iranzo-Sánchez, Javier; et al. (2020). “Europarl-ST: A multilingual corpus for speech translation of parliamentary debates”. ‘‘Proceedings of ICASSP 2020’’. pp. 8229–8233.
Vidal, Enrique; Romero, Verónica; Toselli, Alejandro H.; Sánchez, Joan Andreu; Bosch, Vicente; Quirós, Lorenzo (2020). “The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification”. ‘‘Proceedings of the International Conference on Frontiers in Handwriting Recognition’’. pp. 85–90.

References

^ "Pattern Recognition and Human Language Technology Research Center". UPV Innovación. Retrieved 21 June 2026.
^ "Machine learning and big data are unlocking Europe's archives | Horizon Magazine". projects.research-and-innovation.ec.europa.eu. 10 December 2020. Retrieved 21 June 2026.
^ "El catedrático de la UPV Enrique Vidal, Premio Nacional de Informática 2011". www.20minutos.es - Últimas Noticia (in Spanish). 22 September 2011. Retrieved 21 June 2026.
^ "El catedrático de la UPV Casacuberta recibe el Premio Nacional de Informática". Valencia Plaza (in Spanish). Retrieved 21 June 2026.
^ Casacuberta, Francisco; Civera, Jorge; Cubel, Elsa; Lagarda, Antonio L.; Lapalme, Guy; Macklovitch, Elliott; Vidal, Enrique (1 October 2009). "Human interaction for high-quality machine translation". Communications of the Association for Computing Machinery. 52 (10): 135–138. doi:10.1145/1562764.1562798. ISSN 0001-0782.
^ Casacuberta, Francisco; Vidal, Enrique (June 2004). "Machine Translation with Inferred Stochastic Finite-State Transducers". Computational Linguistics. 30 (2): 205–225. doi:10.1162/089120104323093294. ISSN 0891-2017.
^ Alabau, Vicent; Buck, Christian; Carl, Michael; Casacuberta, Francisco; García-Martínez, Mercedes; Germann, Ulrich; González-Rubio, Jesús; Hill, Robin; Koehn, Philipp; Leiva, Luis; Mesa-Lao, Bartolomé; Ortiz-Martínez, Daniel; Saint-Amand, Herve; Sanchis Trilles, Germán; Tsoukala, Chara (April 2014). Wintner, Shuly; Tadić, Marko; Babych, Bogdan (eds.). "CASMACAT: A Computer-assisted Translation Workbench". Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics: 25–28. doi:10.3115/v1/E14-2007.
^ "Pangeanic collaborates with machine translation post-editing EU program CASMACAT". blog.pangeanic.com. Retrieved 21 June 2026.
^ "Cognitive Analysis and Statistical Methods for Advanced Computer Aided Translation". CORDIS | European Commission (in Spanish). Retrieved 21 June 2026.
^ Iranzo-Sanchez, Javier; Silvestre-Cerda, Joan Albert; Jorge, Javier; Rosello, Nahuel; Gimenez, Adria; Sanchis, Albert; Civera, Jorge; Juan, Alfons (May 2020). Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates. IEEE. pp. 8229–8233. doi:10.1109/ICASSP40776.2020.9054626. ISBN 978-1-5090-6631-5.
^ Vidal, E.; Romero, Verónica; Toselli, A. H.; Sánchez, J.A.; Bosch, V.; Quirós, L.; Benedí, José Miguel; Prieto, Jose Ramón; Pastor, Moisés; Casacuberta, F.; Alonso, C.; García, C.; Márquez, L.; Orcero, C. (2020). "The Carabela Project and Manuscript Collection: Large-Scale Probabilistic Indexing and Content-based Classification". 17th International Conference on Frontiers in Handwriting Recognition (ICFHR): 85–90. doi:10.1109/ICFHR2020.2020.00026.
^ Cañas, Jesús A. (31 October 2019). "Así funciona la 'piedra de Rosetta' de los legajos". El País (in Spanish). ISSN 1134-6582. Retrieved 21 June 2026.
^ "tranScriptorium | tranScriptorium | Project | Fact Sheet | FP7". CORDIS | European Commission. Retrieved 21 June 2026.
^ "Recognition and Enrichment of Archival Documents | H2020". CORDIS | European Commission. Retrieved 21 June 2026.
^ Vidal, Enrique; Rodríguez, Luis; Casacuberta, Francisco; García-Varea, Ismael (2008), "Interactive Pattern Recognition", in Popescu-Belis, Andrei; Renals, Steve; Bourlard, Hervé (eds.), Machine Learning for Multimodal Interaction, vol. 4892, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 60–71, doi:10.1007/978-3-540-78155-4_6, ISBN 978-3-540-78154-7, retrieved 21 June 2026
^ "Transcription and Translation of Video Lectures | FP7". CORDIS | European Commission. Retrieved 21 June 2026.
^ "Iberifier | Iberian Digital Media Observatory". iberifier.eu. Retrieved 21 June 2026.
^ "Paolo Rosso". scholar.google.com. Retrieved 21 June 2026.
^ "NLP applications at UPV to fight against disinformation and sexism". ELLIS Alicante. Retrieved 21 June 2026.
^ "La innovacion es cosa de empresas grandes, pequeñas y de todos los sectores", Como gestionar la innovacion en las PYMES, Netbiblo, pp. 39–82, retrieved 21 June 2026

External links

[1] "Pattern Recognition and Human Language Technology Research Center". UPV Innovación. Retrieved 21 June 2026.

[2] "Machine learning and big data are unlocking Europe's archives | Horizon Magazine". projects.research-and-innovation.ec.europa.eu. 10 December 2020. Retrieved 21 June 2026.

[3] "El catedrático de la UPV Enrique Vidal, Premio Nacional de Informática 2011". www.20minutos.es - Últimas Noticia (in Spanish). 22 September 2011. Retrieved 21 June 2026.

[4] "El catedrático de la UPV Casacuberta recibe el Premio Nacional de Informática". Valencia Plaza (in Spanish). Retrieved 21 June 2026.

[5] Casacuberta, Francisco; Civera, Jorge; Cubel, Elsa; Lagarda, Antonio L.; Lapalme, Guy; Macklovitch, Elliott; Vidal, Enrique (1 October 2009). "Human interaction for high-quality machine translation". Communications of the Association for Computing Machinery. 52 (10): 135–138. doi:10.1145/1562764.1562798. ISSN 0001-0782.

[6] Casacuberta, Francisco; Vidal, Enrique (June 2004). "Machine Translation with Inferred Stochastic Finite-State Transducers". Computational Linguistics. 30 (2): 205–225. doi:10.1162/089120104323093294. ISSN 0891-2017.

[7] Alabau, Vicent; Buck, Christian; Carl, Michael; Casacuberta, Francisco; García-Martínez, Mercedes; Germann, Ulrich; González-Rubio, Jesús; Hill, Robin; Koehn, Philipp; Leiva, Luis; Mesa-Lao, Bartolomé; Ortiz-Martínez, Daniel; Saint-Amand, Herve; Sanchis Trilles, Germán; Tsoukala, Chara (April 2014). Wintner, Shuly; Tadić, Marko; Babych, Bogdan (eds.). "CASMACAT: A Computer-assisted Translation Workbench". Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics: 25–28. doi:10.3115/v1/E14-2007.

[8] "Pangeanic collaborates with machine translation post-editing EU program CASMACAT". blog.pangeanic.com. Retrieved 21 June 2026.

[9] "Cognitive Analysis and Statistical Methods for Advanced Computer Aided Translation". CORDIS | European Commission (in Spanish). Retrieved 21 June 2026.

[10] Iranzo-Sanchez, Javier; Silvestre-Cerda, Joan Albert; Jorge, Javier; Rosello, Nahuel; Gimenez, Adria; Sanchis, Albert; Civera, Jorge; Juan, Alfons (May 2020). Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates. IEEE. pp. 8229–8233. doi:10.1109/ICASSP40776.2020.9054626. ISBN 978-1-5090-6631-5.

[11] Vidal, E.; Romero, Verónica; Toselli, A. H.; Sánchez, J.A.; Bosch, V.; Quirós, L.; Benedí, José Miguel; Prieto, Jose Ramón; Pastor, Moisés; Casacuberta, F.; Alonso, C.; García, C.; Márquez, L.; Orcero, C. (2020). "The Carabela Project and Manuscript Collection: Large-Scale Probabilistic Indexing and Content-based Classification". 17th International Conference on Frontiers in Handwriting Recognition (ICFHR): 85–90. doi:10.1109/ICFHR2020.2020.00026.

[12] Cañas, Jesús A. (31 October 2019). "Así funciona la 'piedra de Rosetta' de los legajos". El País (in Spanish). ISSN 1134-6582. Retrieved 21 June 2026.

[13] "tranScriptorium | tranScriptorium | Project | Fact Sheet | FP7". CORDIS | European Commission. Retrieved 21 June 2026.

[14] "Recognition and Enrichment of Archival Documents | H2020". CORDIS | European Commission. Retrieved 21 June 2026.

[15] Vidal, Enrique; Rodríguez, Luis; Casacuberta, Francisco; García-Varea, Ismael (2008), "Interactive Pattern Recognition", in Popescu-Belis, Andrei; Renals, Steve; Bourlard, Hervé (eds.), Machine Learning for Multimodal Interaction, vol. 4892, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 60–71, doi:10.1007/978-3-540-78155-4_6, ISBN 978-3-540-78154-7, retrieved 21 June 2026

[16] "Transcription and Translation of Video Lectures | FP7". CORDIS | European Commission. Retrieved 21 June 2026.

[17] "Iberifier | Iberian Digital Media Observatory". iberifier.eu. Retrieved 21 June 2026.

[18] "Paolo Rosso". scholar.google.com. Retrieved 21 June 2026.

[19] "NLP applications at UPV to fight against disinformation and sexism". ELLIS Alicante. Retrieved 21 June 2026.

[20] "La innovacion es cosa de empresas grandes, pequeñas y de todos los sectores", Como gestionar la innovacion en las PYMES, Netbiblo, pp. 39–82, retrieved 21 June 2026

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]