Petrolês

Petrolês is a public repository of artifacts for Natural Language Processing applications in the petroleum domain in Portuguese.
This repository aims to serve as a reference for artificial intelligence research groups and companies related to the oil and gas sector.

Petrolês is a partnership of Petrobras Research and Development
Center (CENPES), Applied Computational Intelligence Lab (PUC-Rio/ICA), UFRGS and PUC-RS, and aims to promote research initiatives related to Natural Language Processing and Computational Linguistic.

arrow_downward

Available Artifacts

arrow_upward

Select the category from the navegation panel below. On each tab, select the desired itens by activating their corresponding pills.

Domain Corpora

Domain corpora are provided as combinations of sub-corpora, intended to train specialized Natural Language Processing (NLP) models.

The corpora were preprocessed only to eliminate noise, numeric tokens and special characters.

When citing Petrolês Corpora in academic papers or thesis, please use this BibTex Entry [Download .bib].

Corpora	Description	Sentences	Tokens
IBICT-BDTD	Academic theses and dissertations on petroleum-related subjects, obtained from the Brazilian Digital Library of Theses and Dissertations	2.672.927	63.424.309
Petrolês - domain-specific	Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports).	7.152.493	146.996.520
Petrolês - hybrid corpus	Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC.	49.310.552	829.350.869
Petro1 and Petro2	Gold standard corpora, entirely revised, annotated with information on lemma, pos and syntactic dependencies according to the framework of the Universal Dependencies project. Corpora are available separately because they were created in different ways, but they can be grouped together. Content is a subset of the Petroles corpus - domain specific..	818	27.536
PetroTok	Small gold standard corpus, revised only in terms of pre-processing, specifically the sentencing stage. Content is a subset of the Petroles corpus - domain specific. The corpus does not contain sentences in the sequence in which they appear in the original texts, but a selection of sentences that can be especially difficult for automatic processing..	1.139	38.472
PetroGold v1	Gold standard treebank, with revision of automatic lemma annotation, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petroles corpus. A material presentation is available at: de Souza, E., Silveira, A., Cavalcanti, T., Castro, M. C., & Freitas, C. (2021, Novembro). PetroGold – Corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (pp. 29-38). Disponível em https://sol.sbc.org.br/index.php/stil/article/view/17781..	9.127	253.640
PetroGold-v2	Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus. The material is described in: de Souza, E., Freitas, C. (2022). Polishing the gold – how much revision do we need in treebanks?. I Universal Dependencies Brazilian Festival (UDFest-BR). Available at: https://aclanthology.org/2022.udfestbr-1.2/..	8.949	250.595
PetroGold-v3	Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus.	8.946	250.605
PetroNer	Gold standard corpus annotated with named entities in the oil & gas domain. It was built from a set of 11 Technical Reports from Petrobras, which are part of the Petrolês corpus [Cordeiro, 2020]. The corpus is described in: Freitas, C., De Souza, E., Castro, M. C., Cavalcanti, T., Ferreira da Silva, P., & Corrêa Cordeiro, F. (2023). Recursos linguísticos para o PLN específico de domínio: o Petrolês. Linguamática, 15(2), 51-68.	24.035	615.418

Corpora for Embedding Models

Domain preprocessed corpora are provided as combinations of sub-corpora, intended to train specialized word embedding models. We tested different corpora compositions as training strategies, aiming for the best representation quality for domain-secific vocabulary.

The presented corpora were preprocessed considering the following steps: lowercasing; removal of stopword, diacritics, punctuation and special characters; numeric tokens were replaced by the tag <TOKEN>.

When citing the Petrolês Corpora in academic papers or thesis, please use this BibTex Entry [Download .bib].

Corpora	Description	Sentences	Tokens
IBICT-BDTD	Academic theses and dissertations on petroleum-related subjects, obtained from the Brazilian Digital Library of Theses and Dissertations	2.558.837	37.825.743
Petrolês - domínio específico	Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports).	6.295.231	85.725.834
Petrolês - domain-specific	Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC.	43.622.972	451.021.003

Knowledge Organization Systems - KOS

Knowledge Organization Systems (KOS) are semantically structured schemes that can be applied to information retrieval tasks, comprising terms, definitions, relationships and concepts. Include glossaries, lists, thesaurus, taxonomies, ontologies (See more).

KOS	Desccription
Multiword Expressions	List containing Multiword expressions (MWEs) for the Oil & Gas domain in Portuguese. It was extracted from the Petroles specialized corpora, using Pointwise mutual information (PMI) and manually checked by domain specialists. The file contains nearly 65,2 thousand unique MWEs.
Oil and Gas vocaulary and frequency	List containing unique words extracted from Petroles domain-specific corpora, comprising 237.9 thousand words and their corresponding occurence count from the corpora.

Word Embeddings Models PETROVEC

Word Embedding models are mathematic representations of vocabulary. Unsupervised learning algorithms encode words in dense vectors, in a way that may capture syntactic and semantic properties form their context.

O PetroVec is a set of word embedding models, pre-trained from an O&G specialized corpora: Petrolês. The word embedding models are also available in the PetroVec semantic space projector, where users can experiment an interactive enviroment, being able to test semantic similarities, neighborhood sections, and generate PCA and t-SNE projections in 2D and 3D visualizations.

We trained the models with Word2vec in the following preprocess steps: lowercasing; removal of stopwords, punctuation and special chars; and replacing numerical tokens with the tag <NUMBER>.

Datasets and source-code for training and evaluating the models are publicly available in Github PetroVec.

The research results for generating Petrovec´s word embedding models apre presente in a paper publish in Computers in Industry journal (Elsevier): "Portuguese word embeddings for the oil and gas industry: Development and evaluation".
When citing the PetroVec embedding models in academic papers or thesis, please use this BibTex Entry [Download .bib].

Model	Size	Description
Petrovec-O&G (Word2vec) Petrovec-O&G (FastText) Petrovec-O&G (Word2vec)	100 100 300	Models trained in Word2vec and FastText, vectors with 100 and 300 dimensions, trained from public resources related to the O&G domain (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports).
Petrovec-híbrido (Word2vec) Petrovec-híbrido (FastText) Petrovec-híbrido (Word2vec)	100 100 300	Models trained in Word2vec and FastText, vectors with 100 and 300 dimensions, trained from hybrid corpora, composed from both O&G specific corpora (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC.

Iniciativas em Desenvolvimento

This section describes some of the main research initiatives in progress, in collaboration with Petrobras and Universities.

Initiative	Description	Partnership
OCR corrector	Tools used to correct texts extracted from Optical Character Recognition (OCR) methods	Petrobras, UFRGS
Socrates Corrector	Corrector for texts extracted by OCR from pdfs.	Petrobras, UFRGS
GeoCSV	GeoCSV is a web solution that allows to manually annotate the images dataset and integrate it with the annotation tool Labelweb.	Petrobras, UFRGS
REGIS-system	REGIS - Retrieval Evaluation for Geoscientific Information Systems. Tool for generating a test collection for multimodal information retrieval (paper).	Petrobras, UFRGS
GeoImageOntology	GeoImageOntology - Ontology of Visual Artifacts for the Oil Exploration area. This ontology represents the main forms of representation in figures used in the Exploration chain, such as maps, sections, profiles and diagrams.	Petrobras, UFRGS
Image clasification - Geodigital	This project aims to automatically classify images using CNN models.	Petrobras, UFRGS
OCRAnno - OCR text annotation tool	OCRAnno is a textual annotation tool designed to provide annotation data for improving OCR extraction systems.	Petrobras, UFRGS
Labelweb	The web image annotation system, Labelweb, is a tool which allows users to participate in the annotation process, ie, the assignment from categories to images in the database.	Petrobras, UFRGS
PetroBERT	Initiative for training and evaluating contextual language models in Portuguese specialized in the Oil and Gas domain, based on the BERT architecture and its variations.	Petrobras, UFRGS, ICA/PUC-Rio, PUC-RS, UFF, LNCC
PetroVec	Initiative for training and evaluating word embedding models in Portuguese for the Oil and Gas Domain. The products of this project are presented in this paper published in the journal Computers in Industry, Elsevier: "Portuguese word embeddings for the oil and gas industry: Development and evaluation".	Petrobras, UFRGS, PUC-RS
Entity Recognition	Initiative for training Named Entity Recognition Models for the Oil and Gas Domain in Portuguese.	Petrobras, ICA/PUC-Rio
Tornado - text extractor	A tool for extracting text from PDF documents, using modern computer vision and optical character recognition (OCR) techniques. Tornado is a process and corresponding software-based tool that heavily relies on machine learning to selectively extract information from PDF files. It is able to identify individual visual elements on a page, such as blocks of text, figures, charts, or tables, all in a human-like manner. It automatically selects the best available strategy to process and extract each element. For textual elements, it attempts simple PDF text parsing first, and then, if necessary, it performs state-of-the-art deep learning based image enhancement, prior to OCR, without human intervention. It is aimed at the efficient parallel processing of large amounts of files. For instance, as a tool for building a Corpus, or in a search engine's indexing pipeline. Tornado is tailored for document extraction in the Oil and Gas industry domain..	Petrobras, ICA/PUC-Rio

arrow_downward

arrow_upward

Publications

Recursos linguísticos para o PLN específico de domínio: o Petrolês.

Freitas, C., De Souza, E., Castro, M. C., Cavalcanti, T., Ferreira da Silva, P., & Corrêa Cordeiro, F.

Linguamática, 15(2), 51-68. 2023.

Building and evaluating a gold-standard treebank.

De Souza, Elvis

Master's thesis. PUC-Rio, 2023.

Polishing the gold – how much revision do we need in treebanks?

DE SOUZA, Elvis & FREITAS, Cláudia

Proceedings of the Universal Dependencies Brazilian Festival, p. 1–11, Fortaleza, Brazil. Association for Computational Linguistics, 2022.

PetroGold – Corpus padrão ouro para o domínio do petróleo

de Souza, E., Silveira, A., Cavalcanti, T., Castro, M. C., & Freitas, C.

Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana , (pp. 29-38). 2021.

Os limites da palavra e da sentença no processamento automático de textos

Cavalcanti, T., Silveira, A., de Souza, E., & Freitas, C.

Revista Brasileira De Iniciação Científica , , 8, e021033. 2021

Portuguese word embeddings for the oil and gas industry: Development and evaluation

Diogo Gomes (Petrobras), Fábio Cordeiro (Petrobras), Bernardo Consoli, Nikolas Santos, Viviane Moreira, Renata Vieira, Silvia Moraes, Alexandre Evsukoff

Computers in Industry, Elsevier. Volume 124, 2021. ISSN 0166-3615

PetroVec: Development and evaluation of Portuguese word embedding models for the oil and gas domain

Diogo Gomes (Petrobras)

PhD Thesis, COPPE/UFRJ, 2021

REGIS: A Test Collection for Geoscientific Documents in Portuguese

Lucas Lima de Oliveira, Regis Kruel Romeu, Viviane Pereira Moreira

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021

Embeddings for Named Entity Recognition in Geoscience Portuguese Literature

Bernardo Consoli, Joaquim , Santos, Diogo Gomes, Fabio Cordeiro, Renata Vieira, Viviane Moreira

Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, France. 2020

Petrolês - How to Build a Specialized Oil and Gas Corpus in Portuguese. ROG 20, 387–388.

Fábio Cordeiro (Petrobras)

Monografia de conclusão do curso de especialização Business Intelligence Master, 2020

https://doi.org/10.48072/2525-7579.rog.2020.387

An Investigation of Pre-trained Embeddings in Dependency Parsing

de Araújo, J.C.C., Freitas, C., Pacheco, M.A.C., Forero-Mendoza

Computational Processing of the Portuguese Language, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 281–290, L.A., 2020.

Expanding the open Wordnets for english and portuguese to geology domain: inclusion of lythology and geological time concepts

Alexandre Tessarollo (Petrobras)

M.Sc thesis, FGV, 2020

Inclusion of Lithological terms (rocks and minerals) in The Open Wordnet for English

Alexandre Tessarollo (Petrobras), Alexandre Rademaker.

LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

Processamento de linguagem natural em Português e aprendizagem profunda para o domínio de Óleo e Gás.

Diogo Gomes (Petrobras), Alexandre Evsukoff (UFRJ)

arXiv, 2019

Technology Intelligence Analysis Based on Document Embedding Techniques for Oil and Gas Domain.

Fábio Cordeiro (Petrobras), Diogo Gomes (Petrobras), Flavio Gomes (Petrobras) e Renata Texeira (Petrobras).

OTC Brazil 2019

Completing the Princeton Annotated Gloss Corpus Project

Alexandre Rademaker, Bruno Cuconato, Henrique Muniz, Alexandre Tessarollo (Petrobras).

Proceedigns of the 10th Global Wordnet Conference, 2019

Do PDF ao TXT: Desafios na extração de informação em textos técnico-científicos.

Aline Silveira (PUC-Rio), Elvis de Souza (PUC-Rio), Tatiana Cavalcanti (PUC-Rio), Cláudia Freitas (PUC-Rio)

VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (VI TILic). pp. 391-394. Outubro, 15-18. Salvador/Bahia, Brasil, 2019<

A knowledge organization system for image classification and retrieval in petroleum exploration domain.

Mara Abel, Eduardo Simões Lopes Gastal, Cassiana Roberta Lizzoni Michelin, Luiza Gonçalves Maggi, Bruno Eduardo Firnkes, Felix Eduardo Huaroto Pachas and Renata dos Santos Alvarenga (UFRGS)

Ontobras - Seminário de pesquisa em ontologias no Brasil 2019

Extending SUMO to Geological Times.

Alexandre Rademaker, Alexandre Tessarollo (Petrobras), Henrique Muniz, Adam Pease.

CEUR Workshop, 2019

Automatic Summarization of Technical Documents in the Oil and Gas Industry

João Marcos Correia Marques, Fabio Gagliardi Cozman, Ismael Humberto Ferreira dos Santos

8th Brazilian Conference on Intelligent Systems (BRACIS), 2019

Word embeddings em português para o domínio específico de óleo e gás.

Diogo Gomes (Petrobras), Fábio Cordeiro (Petrobras) e Alexandre Evsukof (UFRJ)

Rio O&G 2018

arrow_downward

arrow_upward