Petrolês is a public repository of artifacts for Natural Language Processing applications in the petroleum domain in Portuguese.
This repository aims to serve as a reference for artificial intelligence research groups and companies related to the oil and gas sector.

Petrolês is a partnership of Petrobras Research and Development
Center (CENPES), Applied Computational Intelligence Lab (PUC-Rio/ICA), UFRGS and PUC-RS, and aims to promote research initiatives related to Natural Language Processing and Computational Linguistic.

arrow_downward
Available Artifacts
arrow_upward

Select the category from the navegation panel below. On each tab, select the desired itens by activating their corresponding pills.

Domain Corpora

Domain corpora are provided as combinations of sub-corpora, intended to train specialized Natural Language Processing (NLP) models.

The corpora were preprocessed only to eliminate noise, numeric tokens and special characters.

When citing Petrolês Corpora in academic papers or thesis, please use this BibTex Entry [Download .bib].

Corpora Description Sentences Tokens
IBICT-BDTD Academic theses and dissertations on petroleum-related subjects, obtained from the Brazilian Digital Library of Theses and Dissertations 2.672.927 63.424.309
Petrolês - domain-specific Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports). 7.152.493 146.996.520
Petrolês - hybrid corpus Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC. 49.310.552 829.350.869
Petro1 and Petro2 Gold standard corpora, entirely revised, annotated with information on lemma, pos and syntactic dependencies according to the framework of the Universal Dependencies project. Corpora are available separately because they were created in different ways, but they can be grouped together. Content is a subset of the Petroles corpus - domain specific.. 818 27.536
PetroTok Small gold standard corpus, revised only in terms of pre-processing, specifically the sentencing stage. Content is a subset of the Petroles corpus - domain specific. The corpus does not contain sentences in the sequence in which they appear in the original texts, but a selection of sentences that can be especially difficult for automatic processing.. 1.139 38.472
PetroGold v1 Gold standard treebank, with revision of automatic lemma annotation, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petroles corpus. A material presentation is available at: de Souza, E., Silveira, A., Cavalcanti, T., Castro, M. C., & Freitas, C. (2021, Novembro). PetroGold – Corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (pp. 29-38). Disponível em https://sol.sbc.org.br/index.php/stil/article/view/17781.. 9.127 253.640
PetroGold-v2 Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus. The material is described in: de Souza, E., Freitas, C. (2022). Polishing the gold – how much revision do we need in treebanks?. I Universal Dependencies Brazilian Festival (UDFest-BR). Available at: https://aclanthology.org/2022.udfestbr-1.2/.. 8.949 250.595
PetroGold-v3 Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus. 8.946 250.605
PetroNer Gold standard corpus annotated with named entities in the oil & gas domain. It was built from a set of 11 Technical Reports from Petrobras, which are part of the Petrolês corpus [Cordeiro, 2020]. The corpus is described in: Freitas, C., De Souza, E., Castro, M. C., Cavalcanti, T., Ferreira da Silva, P., & Corrêa Cordeiro, F. (2023). Recursos linguísticos para o PLN específico de domínio: o Petrolês. Linguamática, 15(2), 51-68. 24.035 615.418
Corpora for Embedding Models

Domain preprocessed corpora are provided as combinations of sub-corpora, intended to train specialized word embedding models. We tested different corpora compositions as training strategies, aiming for the best representation quality for domain-secific vocabulary.

The presented corpora were preprocessed considering the following steps: lowercasing; removal of stopword, diacritics, punctuation and special characters; numeric tokens were replaced by the tag <TOKEN>.

When citing the Petrolês Corpora in academic papers or thesis, please use this BibTex Entry [Download .bib].

Corpora Description Sentences Tokens
IBICT-BDTD Academic theses and dissertations on petroleum-related subjects, obtained from the Brazilian Digital Library of Theses and Dissertations 2.558.837 37.825.743
Petrolês - domínio específico Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports). 6.295.231 85.725.834
Petrolês - domain-specific Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC. 43.622.972 451.021.003
Knowledge Organization Systems - KOS

Knowledge Organization Systems (KOS) are semantically structured schemes that can be applied to information retrieval tasks, comprising terms, definitions, relationships and concepts. Include glossaries, lists, thesaurus, taxonomies, ontologies (See more).

KOS Desccription
Multiword Expressions List containing Multiword expressions (MWEs) for the Oil & Gas domain in Portuguese. It was extracted from the Petroles specialized corpora, using Pointwise mutual information (PMI) and manually checked by domain specialists. The file contains nearly 65,2 thousand unique MWEs.
Oil and Gas vocaulary and frequency List containing unique words extracted from Petroles domain-specific corpora, comprising 237.9 thousand words and their corresponding occurence count from the corpora.
Word Embeddings Models PETROVEC

Word Embedding models are mathematic representations of vocabulary. Unsupervised learning algorithms encode words in dense vectors, in a way that may capture syntactic and semantic properties form their context.

O PetroVec is a set of word embedding models, pre-trained from an O&G specialized corpora: Petrolês. The word embedding models are also available in the PetroVec semantic space projector, where users can experiment an interactive enviroment, being able to test semantic similarities, neighborhood sections, and generate PCA and t-SNE projections in 2D and 3D visualizations.

We trained the models with Word2vec in the following preprocess steps: lowercasing; removal of stopwords, punctuation and special chars; and replacing numerical tokens with the tag <NUMBER>.

Datasets and source-code for training and evaluating the models are publicly available in Github PetroVec.

The research results for generating Petrovec´s word embedding models apre presente in a paper publish in Computers in Industry journal (Elsevier): "Portuguese word embeddings for the oil and gas industry: Development and evaluation".
When citing the PetroVec embedding models in academic papers or thesis, please use this BibTex Entry [Download .bib].

Model Size Description
Petrovec-O&G (Word2vec)
Petrovec-O&G (FastText)
Petrovec-O&G (Word2vec)
100
100
300
Models trained in Word2vec and FastText, vectors with 100 and 300 dimensions, trained from public resources related to the O&G domain (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports).
Petrovec-híbrido (Word2vec)
Petrovec-híbrido (FastText)
Petrovec-híbrido (Word2vec)
100
100
300
Models trained in Word2vec and FastText, vectors with 100 and 300 dimensions, trained from hybrid corpora, composed from both O&G specific corpora (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC.
Iniciativas em Desenvolvimento

This section describes some of the main research initiatives in progress, in collaboration with Petrobras and Universities.

Initiative Description Partnership
OCR corrector Tools used to correct texts extracted from Optical Character Recognition (OCR) methods Petrobras, UFRGS
Socrates Corrector Corrector for texts extracted by OCR from pdfs. Petrobras, UFRGS
GeoCSV GeoCSV is a web solution that allows to manually annotate the images dataset and integrate it with the annotation tool Labelweb. Petrobras, UFRGS
REGIS-system REGIS - Retrieval Evaluation for Geoscientific Information Systems. Tool for generating a test collection for multimodal information retrieval (paper). Petrobras, UFRGS
GeoImageOntology GeoImageOntology - Ontology of Visual Artifacts for the Oil Exploration area. This ontology represents the main forms of representation in figures used in the Exploration chain, such as maps, sections, profiles and diagrams. Petrobras, UFRGS
Image clasification - Geodigital This project aims to automatically classify images using CNN models. Petrobras, UFRGS
OCRAnno - OCR text annotation tool OCRAnno is a textual annotation tool designed to provide annotation data for improving OCR extraction systems. Petrobras, UFRGS
Labelweb The web image annotation system, Labelweb, is a tool which allows users to participate in the annotation process, ie, the assignment from categories to images in the database. Petrobras, UFRGS
PetroBERT Initiative for training and evaluating contextual language models in Portuguese specialized in the Oil and Gas domain, based on the BERT architecture and its variations. Petrobras, UFRGS, ICA/PUC-Rio, PUC-RS, UFF, LNCC
PetroVec Initiative for training and evaluating word embedding models in Portuguese for the Oil and Gas Domain. The products of this project are presented in this paper published in the journal Computers in Industry, Elsevier: "Portuguese word embeddings for the oil and gas industry: Development and evaluation". Petrobras, UFRGS, PUC-RS
Entity Recognition Initiative for training Named Entity Recognition Models for the Oil and Gas Domain in Portuguese. Petrobras, ICA/PUC-Rio
Tornado - text extractor A tool for extracting text from PDF documents, using modern computer vision and optical character recognition (OCR) techniques.
Tornado is a process and corresponding software-based tool that heavily relies on machine learning to selectively extract information from PDF files. It is able to identify individual visual elements on a page, such as blocks of text, figures, charts, or tables, all in a human-like manner.
It automatically selects the best available strategy to process and extract each element. For textual elements, it attempts simple PDF text parsing first, and then, if necessary, it performs state-of-the-art deep learning based image enhancement, prior to OCR, without human intervention.
It is aimed at the efficient parallel processing of large amounts of files. For instance, as a tool for building a Corpus, or in a search engine's indexing pipeline. Tornado is tailored for document extraction in the Oil and Gas industry domain..
Petrobras, ICA/PUC-Rio



arrow_downward
arrow_upward
Publications

De Souza, Elvis

Master's thesis. PUC-Rio, 2023.

DE SOUZA, Elvis & FREITAS, Cláudia

Proceedings of the Universal Dependencies Brazilian Festival, p. 1–11, Fortaleza, Brazil. Association for Computational Linguistics, 2022.

de Souza, E., Silveira, A., Cavalcanti, T., Castro, M. C., & Freitas, C.

Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana , (pp. 29-38). 2021.

Cavalcanti, T., Silveira, A., de Souza, E., & Freitas, C.

Revista Brasileira De Iniciação Científica , , 8, e021033. 2021

Diogo Gomes (Petrobras), Fábio Cordeiro (Petrobras), Bernardo Consoli, Nikolas Santos, Viviane Moreira, Renata Vieira, Silvia Moraes, Alexandre Evsukoff

Computers in Industry, Elsevier. Volume 124, 2021. ISSN 0166-3615

Lucas Lima de Oliveira, Regis Kruel Romeu, Viviane Pereira Moreira

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021

Bernardo Consoli, Joaquim , Santos, Diogo Gomes, Fabio Cordeiro, Renata Vieira, Viviane Moreira

Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, France. 2020

Fábio Cordeiro (Petrobras)

Monografia de conclusão do curso de especialização Business Intelligence Master, 2020

https://doi.org/10.48072/2525-7579.rog.2020.387

de Araújo, J.C.C., Freitas, C., Pacheco, M.A.C., Forero-Mendoza

Computational Processing of the Portuguese Language, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 281–290, L.A., 2020.

Alexandre Tessarollo (Petrobras), Alexandre Rademaker.

LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

Fábio Cordeiro (Petrobras), Diogo Gomes (Petrobras), Flavio Gomes (Petrobras) e Renata Texeira (Petrobras).

OTC Brazil 2019

Alexandre Rademaker, Bruno Cuconato, Henrique Muniz, Alexandre Tessarollo (Petrobras).

Proceedigns of the 10th Global Wordnet Conference, 2019

Aline Silveira (PUC-Rio), Elvis de Souza (PUC-Rio), Tatiana Cavalcanti (PUC-Rio), Cláudia Freitas (PUC-Rio)

VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (VI TILic). pp. 391-394. Outubro, 15-18. Salvador/Bahia, Brasil, 2019<

Mara Abel, Eduardo Simões Lopes Gastal, Cassiana Roberta Lizzoni Michelin, Luiza Gonçalves Maggi, Bruno Eduardo Firnkes, Felix Eduardo Huaroto Pachas and Renata dos Santos Alvarenga (UFRGS)

Ontobras - Seminário de pesquisa em ontologias no Brasil 2019

Alexandre Rademaker, Alexandre Tessarollo (Petrobras), Henrique Muniz, Adam Pease.

CEUR Workshop, 2019

João Marcos Correia Marques, Fabio Gagliardi Cozman, Ismael Humberto Ferreira dos Santos

8th Brazilian Conference on Intelligent Systems (BRACIS), 2019

Diogo Gomes (Petrobras), Fábio Cordeiro (Petrobras) e Alexandre Evsukof (UFRJ)

Rio O&G 2018

arrow_downward
arrow_upward
Team

Petrobras

foto
Regis Kruel Romeu
foto
Fábio Corrêa Cordeiro
foto
Diogo Magalhães
foto
Claudio Marcos Ziglio
foto
Antônio Marcelo Azevedo Alexandre
foto
Max de Castro Rodrigues
foto
Vitor Alcantara Batista
foto
Eugenio Pacelli Ferreira Dias Junior
foto
Luciana Santana



ICA/PUC-Rio

foto
Aline da Silveira Matos
foto
Cristian Munoz
foto
Eleonora Cominato Weiner
foto
Elvis Alves de Souza
foto
Evelyn Batista
foto
Jose Ruiz
foto
Leonardo Mendonza
foto
Marco Aurelio C.
foto
Maria Cláudia de Freitas
foto
Renato Sayão
foto
Tatiana Cavalcanti



UFRGS

foto
Viviane Pereira Moreira
foto
Danny Suarez Vargas
foto
Lucas Lima de Oliveira
foto
Gabriel Vogel Pinto



PUC-RS

foto
Renata Vieira
foto
Sílvia Maria Wanderley Moraes
foto
Bernardo Scapini Consoli
foto
Nikolas Lacerda Santos
foto
Nathan Schneider Gavenski



COPPE/UFRJ - LAMCE and NTT

foto
Alexandre G. Evsukoff