GitHub Profile Data (fetched 2026-03-05)
GitHub Profile Data (fetched 2026-03-05)
Profile
- Username: ruanchaves (Ruan Chaves)
- Bio: Senior AI Engineer with 5+ years of experience delivering real-world solutions using Generative AI, LLMs, and NLP.
- Location: Brazil
- Website: https://ruanchaves.github.io/
- LinkedIn: in/ruanchaves
- Email: ruanchaves93@gmail.com
- Followers: 62
- Following: 30
- Repositories: 75
- Stars given: 67
Achievements
- Pull Shark (x3)
- YOLO
- Quickdraw
- Starstruck
- Arctic Code Vault Contributor
Pinned Repositories
hashformers (77 stars, 5 forks)
- Description: Accurate word segmentation for hashtags and text, powered by Transformers and Beam Search. A scalable alternative to heuristic splitters and massive LLMs.
- Language: Python
- License: MIT
- Created: 2020-05-21
- Last updated: 2026-02-21
- Topics: deep-learning, hashtag-segmentor, large-language-models, llms, natural-language-processing, nlp, paper, segmentation, sentiment-analysis, sentiment-classification, sentiment-polarity, spacy, spacy-extension, spacy-extensions, transformers, tweet-analysis, tweets-classification, twitter, twitter-sentiment-analysis, word-segmentation
- Notable: Recognized as state-of-the-art at LREC 2022 in the paper “HashSet - A Dataset For Hashtag Segmentation” by researchers from IIT. Leverages GPT-2 and beam search for accurate, multilingual hashtag and text segmentation.
napolab (72 stars, 3 forks)
- Description: The Natural Portuguese Language Benchmark (Napolab). Stay up to date with the latest advancements in Portuguese language models and their performance across carefully curated Portuguese language tasks.
- Language: Python
- License: MIT
- Created: 2023-03-29
- Last updated: 2026-01-26
- Topics: benchmarks, catalan, datasets, english, galician, hate-speech, huggingface, huggingface-transformers, large-language-models, nlp, portuguese, python, question-answering, semantic-similarity, spanish, text-simplification, textual-entailment, transformers
- Notable: Key finding - the performance gap between general-purpose LLMs and Portuguese-specific models is smaller than previously believed. Exposes systemic issues in LLM benchmarking (investigation gap, data contamination).
- Leaderboard: https://huggingface.co/spaces/ruanchaves/napolab
- Master Thesis: https://www.um.edu.mt/library/oar/handle/123456789/120557
elmo (11 stars, 2 forks)
- Description: Supporting code for the paper “Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks”.
- Language: Jupyter Notebook
- Topics: elmo, embeddings, natural-language-processing, natural-language-understanding, nlp, portuguese, portuguese-language, semantic-similarity, textual-entailment
song2vec (6 stars, 0 forks)
- Description: Telegram bot that recommends songs as YouTube playlists through gensim’s word2vec
- Language: Python
assin (5 stars, 3 forks)
- Description: Supporting code for the paper “Multilingual Transformer Ensembles for Portuguese Natural Language Tasks”.
- Language: Jupyter Notebook
- Topics: bert, natural-language-processing, natural-language-understanding, nlp, portuguese, portuguese-language, roberta, semantic-similarity, textual-entailment, transformers
reddit-html-archiver-image-plugin (5 stars, 0 forks)
- Description: image downloader plugin for reddit-html-archiver
- Language: Python
BERT-WS (0 stars, 0 forks)
- Description: Supporting code for the paper “Domain Adaptation of Transformers for English Word Segmentation”.
- Language: Python
Other Notable Repos
Zero-Shot-Entity-Linking (2 stars)
- Zero-shot Entity Linking with blitz start in 3 minutes.
medical-assistant-bot
- A medical question-answering system that can effectively answer user queries related to medical diseases.
- Language: Jupyter Notebook
countgpt
- Language: Python
qa-dataset
- Language: Python
ml-tech-assessment
- Language: Python
Open Source Contributions (from github.md)
argilla-io/argilla
- Fixed bugs and shipped features related to semi-supervised learning (SSL) during internship at Argilla.
- Argilla was acquired by Hugging Face in June 2024 (~$10M deal).
huggingface/transformers (PR #10823)
- Modified the Trainer class for simultaneous Ray Tune and Weights & Biases execution.
nathanshartmann/portuguese_word_embeddings (PR #11)
- Fixed a severe bug in the evaluation procedure. Documented in research paper.
facebookresearch/BLINK (PR #25)
- Fixed a parameter bug in the script for the BLINK benchmark.
awslabs/mlm-scoring (PR #12)
- Addressed an installation instruction issue for the mlm-scoring library.
All Repos (75 total, sorted by stars)
| Name | Stars | Forks | Language | Description |
|---|---|---|---|---|
| hashformers | 77 | 5 | Python | Accurate word segmentation for hashtags and text |
| napolab | 72 | 3 | Python | The Natural Portuguese Language Benchmark |
| elmo | 11 | 2 | Jupyter Notebook | Portuguese Language Models and Word Embeddings |
| song2vec | 6 | 0 | Python | Telegram bot recommending songs via word2vec |
| assin | 5 | 3 | Jupyter Notebook | Multilingual Transformer Ensembles for Portuguese NLT |
| reddit-html-archiver-image-plugin | 5 | 0 | Python | Image downloader plugin for reddit-html-archiver |
| Zero-Shot-Entity-Linking | 2 | 0 | Python | Zero-shot Entity Linking |
| CPS3235-Twitter-Data-Collection | 1 | 0 | - | Twitter data collection |
| old_website | 1 | 0 | HTML | Previous personal website |
| pdfsandwich-cli | 1 | 0 | JavaScript | CLI for Dockerized pdfsandwich on AWS EC2 |
| prawstreams | 1 | 0 | Python | Fetch live Reddit comments/submissions |
| lsystem | 1 | 0 | Processing | Algorithmic Botany, L-Systems |
| srnn-svm | 1 | 0 | Jupyter Notebook | SRNN-SVM |
| hdp | 1 | 0 | Jupyter Notebook | HDP + T-SNE + k-NN topic modeling |
| agendamento | 1 | 0 | - | Task scheduling application |