Petrolês is a public repository of artifacts for Natural Language Processing applications in the petroleum domain in Portuguese.
This repository aims to serve as a reference for artificial intelligence research groups and companies related to the oil and gas sector.
Petrolês is a partnership of Petrobras Research and Development
Center (CENPES), Applied Computational Intelligence Lab (PUC-Rio/ICA), UFRGS and PUC-RS, and aims to promote research initiatives related to Natural Language Processing and Computational Linguistic.
Select the category from the navegation panel below. On each tab, select the desired itens by activating their corresponding pills.
Domain corpora are provided as combinations of sub-corpora, intended to train specialized Natural Language Processing (NLP) models.
The corpora were preprocessed only to eliminate noise, numeric tokens and special characters.
When citing Petrolês Corpora in academic papers or thesis, please use this BibTex Entry [Download .bib].
Corpora | Description | Sentences | Tokens |
---|---|---|---|
IBICT-BDTD | Academic theses and dissertations on petroleum-related subjects, obtained from the Brazilian Digital Library of Theses and Dissertations | 2.672.927 | 63.424.309 |
Petrolês - domain-specific | Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports). | 7.152.493 | 146.996.520 |
Petrolês - hybrid corpus | Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC. | 49.310.552 | 829.350.869 |
Petro1 and Petro2 | Gold standard corpora, entirely revised, annotated with information on lemma, pos and syntactic dependencies according to the framework of the Universal Dependencies project. Corpora are available separately because they were created in different ways, but they can be grouped together. Content is a subset of the Petroles corpus - domain specific.. | 818 | 27.536 |
PetroTok | Small gold standard corpus, revised only in terms of pre-processing, specifically the sentencing stage. Content is a subset of the Petroles corpus - domain specific. The corpus does not contain sentences in the sequence in which they appear in the original texts, but a selection of sentences that can be especially difficult for automatic processing.. | 1.139 | 38.472 |
PetroGold v1 | Gold standard treebank, with revision of automatic lemma annotation, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petroles corpus. A material presentation is available at: de Souza, E., Silveira, A., Cavalcanti, T., Castro, M. C., & Freitas, C. (2021, Novembro). PetroGold – Corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (pp. 29-38). Disponível em https://sol.sbc.org.br/index.php/stil/article/view/17781.. | 9.127 | 253.640 |
PetroGold-v2 | Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus. The material is described in: de Souza, E., Freitas, C. (2022). Polishing the gold – how much revision do we need in treebanks?. I Universal Dependencies Brazilian Festival (UDFest-BR). Available at: https://aclanthology.org/2022.udfestbr-1.2/.. | 8.949 | 250.595 |
PetroGold-v3 | Gold standard treebank, with revision of the automatic annotation of lemma, POS and syntactic dependencies according to the framework of the Universal Dependencies project. The content is a subset of the Petrolês corpus. | 8.946 | 250.605 |
PetroNer | Gold standard corpus annotated with named entities in the oil & gas domain. It was built from a set of 11 Technical Reports from Petrobras, which are part of the Petrolês corpus [Cordeiro, 2020]. The corpus is described in: Freitas, C., De Souza, E., Castro, M. C., Cavalcanti, T., Ferreira da Silva, P., & Corrêa Cordeiro, F. (2023). Recursos linguísticos para o PLN específico de domínio: o Petrolês. Linguamática, 15(2), 51-68. | 24.035 | 615.418 |
Domain preprocessed corpora are provided as combinations of sub-corpora, intended to train specialized word embedding models. We tested different corpora compositions as training strategies, aiming for the best representation quality for domain-secific vocabulary.
The presented corpora were preprocessed considering the following steps: lowercasing; removal of stopword, diacritics, punctuation and special characters; numeric tokens were replaced by the tag <TOKEN>
.
When citing the Petrolês Corpora in academic papers or thesis, please use this BibTex Entry [Download .bib].
Corpora | Description | Sentences | Tokens |
---|---|---|---|
IBICT-BDTD | Academic theses and dissertations on petroleum-related subjects, obtained from the Brazilian Digital Library of Theses and Dissertations | 2.558.837 | 37.825.743 |
Petrolês - domínio específico | Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports). | 6.295.231 | 85.725.834 |
Petrolês - domain-specific | Consolidated file containing all Petrolês' O&G domain-specific corpora (Petrobras Technical Bulletins, Theses and Dissertations from IBICT-BDTD in petroleum-related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC. | 43.622.972 | 451.021.003 |
Knowledge Organization Systems (KOS) are semantically structured schemes that can be applied to information retrieval tasks, comprising terms, definitions, relationships and concepts. Include glossaries, lists, thesaurus, taxonomies, ontologies (See more).
KOS | Desccription |
---|---|
Multiword Expressions | List containing Multiword expressions (MWEs) for the Oil & Gas domain in Portuguese. It was extracted from the Petroles specialized corpora, using Pointwise mutual information (PMI) and manually checked by domain specialists. The file contains nearly 65,2 thousand unique MWEs. |
Oil and Gas vocaulary and frequency | List containing unique words extracted from Petroles domain-specific corpora, comprising 237.9 thousand words and their corresponding occurence count from the corpora. |
Word Embedding models are mathematic representations of vocabulary. Unsupervised learning algorithms encode words in dense vectors, in a way that may capture syntactic and semantic properties form their context.
O PetroVec
is a set of word embedding models, pre-trained from an O&G specialized corpora: Petrolês.
The word embedding models are also available in the PetroVec semantic space projector, where users can experiment an interactive enviroment, being able to test semantic similarities, neighborhood sections, and generate PCA and t-SNE projections in 2D and 3D visualizations.
We trained the models with Word2vec
in the following preprocess steps: lowercasing; removal of stopwords, punctuation and special chars; and replacing numerical tokens with the tag <NUMBER>
.
Datasets and source-code for training and evaluating the models are publicly available in Github PetroVec.
The research results for generating Petrovec´s word embedding models apre presente in a
paper
publish in Computers in Industry journal (Elsevier): "Portuguese word embeddings for the oil and gas industry: Development and evaluation".
When citing the PetroVec embedding models in academic papers or thesis, please use this BibTex Entry [Download .bib].
Model | Size | Description |
---|---|---|
Petrovec-O&G (Word2vec)
Petrovec-O&G (FastText) Petrovec-O&G (Word2vec) |
100
100 300 |
Models trained in Word2vec and FastText, vectors with 100 and 300 dimensions, trained from public resources related to the O&G domain (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports). |
Petrovec-híbrido (Word2vec)
Petrovec-híbrido (FastText) Petrovec-híbrido (Word2vec) |
100
100 300 |
Models trained in Word2vec and FastText, vectors with 100 and 300 dimensions, trained from hybrid corpora, composed from both O&G specific corpora (Petrobras Technical Bulletins, Theses and Dissertations in petroleum related subjects; ANP's technical reports), plus a general-context corpus in Portuguese from NILC. |
This section describes some of the main research initiatives in progress, in collaboration with Petrobras and Universities.
Initiative | Description | Partnership |
---|---|---|
OCR corrector | Tools used to correct texts extracted from Optical Character Recognition (OCR) methods | Petrobras, UFRGS |
Socrates Corrector | Corrector for texts extracted by OCR from pdfs. | Petrobras, UFRGS |
GeoCSV | GeoCSV is a web solution that allows to manually annotate the images dataset and integrate it with the annotation tool Labelweb. | Petrobras, UFRGS |
REGIS-system | REGIS - Retrieval Evaluation for Geoscientific Information Systems. Tool for generating a test collection for multimodal information retrieval (paper). | Petrobras, UFRGS |
GeoImageOntology | GeoImageOntology - Ontology of Visual Artifacts for the Oil Exploration area. This ontology represents the main forms of representation in figures used in the Exploration chain, such as maps, sections, profiles and diagrams. | Petrobras, UFRGS |
Image clasification - Geodigital | This project aims to automatically classify images using CNN models. | Petrobras, UFRGS |
OCRAnno - OCR text annotation tool | OCRAnno is a textual annotation tool designed to provide annotation data for improving OCR extraction systems. | Petrobras, UFRGS |
Labelweb | The web image annotation system, Labelweb, is a tool which allows users to participate in the annotation process, ie, the assignment from categories to images in the database. | Petrobras, UFRGS |
PetroBERT | Initiative for training and evaluating contextual language models in Portuguese specialized in the Oil and Gas domain, based on the BERT architecture and its variations. | Petrobras, UFRGS, ICA/PUC-Rio, PUC-RS, UFF, LNCC |
PetroVec | Initiative for training and evaluating word embedding models in Portuguese for the Oil and Gas Domain. The products of this project are presented in this paper published in the journal Computers in Industry, Elsevier: "Portuguese word embeddings for the oil and gas industry: Development and evaluation". | Petrobras, UFRGS, PUC-RS |
Entity Recognition | Initiative for training Named Entity Recognition Models for the Oil and Gas Domain in Portuguese. | Petrobras, ICA/PUC-Rio |
Tornado - text extractor |
A tool for extracting text from PDF documents, using modern computer vision and optical character recognition (OCR) techniques. Tornado is a process and corresponding software-based tool that heavily relies on machine learning to selectively extract information from PDF files. It is able to identify individual visual elements on a page, such as blocks of text, figures, charts, or tables, all in a human-like manner. It automatically selects the best available strategy to process and extract each element. For textual elements, it attempts simple PDF text parsing first, and then, if necessary, it performs state-of-the-art deep learning based image enhancement, prior to OCR, without human intervention. It is aimed at the efficient parallel processing of large amounts of files. For instance, as a tool for building a Corpus, or in a search engine's indexing pipeline. Tornado is tailored for document extraction in the Oil and Gas industry domain.. |
Petrobras, ICA/PUC-Rio |
arrow_downward
Publications
Linguamática, 15(2), 51-68. 2023.
Master's thesis. PUC-Rio, 2023.
Proceedings of the Universal Dependencies Brazilian Festival, p. 1–11, Fortaleza, Brazil. Association for Computational Linguistics, 2022.
Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana , (pp. 29-38). 2021.
Revista Brasileira De Iniciação Científica , , 8, e021033. 2021
Computers in Industry, Elsevier. Volume 124, 2021. ISSN 0166-3615
PhD Thesis, COPPE/UFRJ, 2021
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021
Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, France. 2020
Monografia de conclusão do curso de especialização Business Intelligence Master, 2020
Computational Processing of the Portuguese Language, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 281–290, L.A., 2020.
M.Sc thesis, FGV, 2020
LREC 2020 Workshop on Multimodal Wordnets (MMW2020)
arXiv, 2019
OTC Brazil 2019
Proceedigns of the 10th Global Wordnet Conference, 2019
VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (VI TILic). pp. 391-394. Outubro, 15-18. Salvador/Bahia, Brasil, 2019<
Ontobras - Seminário de pesquisa em ontologias no Brasil 2019
CEUR Workshop, 2019
8th Brazilian Conference on Intelligent Systems (BRACIS), 2019