GitHub Profile Data (fetched 2026-03-05)

GitHub Profile Data (fetched 2026-03-05)

Profile

  • Username: ruanchaves (Ruan Chaves)
  • Bio: Senior AI Engineer with 5+ years of experience delivering real-world solutions using Generative AI, LLMs, and NLP.
  • Location: Brazil
  • Website: https://ruanchaves.github.io/
  • LinkedIn: in/ruanchaves
  • Email: ruanchaves93@gmail.com
  • Followers: 62
  • Following: 30
  • Repositories: 75
  • Stars given: 67

Achievements

  • Pull Shark (x3)
  • YOLO
  • Quickdraw
  • Starstruck
  • Arctic Code Vault Contributor

Pinned Repositories

hashformers (77 stars, 5 forks)

  • Description: Accurate word segmentation for hashtags and text, powered by Transformers and Beam Search. A scalable alternative to heuristic splitters and massive LLMs.
  • Language: Python
  • License: MIT
  • Created: 2020-05-21
  • Last updated: 2026-02-21
  • Topics: deep-learning, hashtag-segmentor, large-language-models, llms, natural-language-processing, nlp, paper, segmentation, sentiment-analysis, sentiment-classification, sentiment-polarity, spacy, spacy-extension, spacy-extensions, transformers, tweet-analysis, tweets-classification, twitter, twitter-sentiment-analysis, word-segmentation
  • Notable: Recognized as state-of-the-art at LREC 2022 in the paper “HashSet - A Dataset For Hashtag Segmentation” by researchers from IIT. Leverages GPT-2 and beam search for accurate, multilingual hashtag and text segmentation.

napolab (72 stars, 3 forks)

  • Description: The Natural Portuguese Language Benchmark (Napolab). Stay up to date with the latest advancements in Portuguese language models and their performance across carefully curated Portuguese language tasks.
  • Language: Python
  • License: MIT
  • Created: 2023-03-29
  • Last updated: 2026-01-26
  • Topics: benchmarks, catalan, datasets, english, galician, hate-speech, huggingface, huggingface-transformers, large-language-models, nlp, portuguese, python, question-answering, semantic-similarity, spanish, text-simplification, textual-entailment, transformers
  • Notable: Key finding - the performance gap between general-purpose LLMs and Portuguese-specific models is smaller than previously believed. Exposes systemic issues in LLM benchmarking (investigation gap, data contamination).
  • Leaderboard: https://huggingface.co/spaces/ruanchaves/napolab
  • Master Thesis: https://www.um.edu.mt/library/oar/handle/123456789/120557

elmo (11 stars, 2 forks)

  • Description: Supporting code for the paper “Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks”.
  • Language: Jupyter Notebook
  • Topics: elmo, embeddings, natural-language-processing, natural-language-understanding, nlp, portuguese, portuguese-language, semantic-similarity, textual-entailment

song2vec (6 stars, 0 forks)

  • Description: Telegram bot that recommends songs as YouTube playlists through gensim’s word2vec
  • Language: Python

assin (5 stars, 3 forks)

  • Description: Supporting code for the paper “Multilingual Transformer Ensembles for Portuguese Natural Language Tasks”.
  • Language: Jupyter Notebook
  • Topics: bert, natural-language-processing, natural-language-understanding, nlp, portuguese, portuguese-language, roberta, semantic-similarity, textual-entailment, transformers

reddit-html-archiver-image-plugin (5 stars, 0 forks)

  • Description: image downloader plugin for reddit-html-archiver
  • Language: Python

BERT-WS (0 stars, 0 forks)

  • Description: Supporting code for the paper “Domain Adaptation of Transformers for English Word Segmentation”.
  • Language: Python

Other Notable Repos

Zero-Shot-Entity-Linking (2 stars)

  • Zero-shot Entity Linking with blitz start in 3 minutes.

medical-assistant-bot

  • A medical question-answering system that can effectively answer user queries related to medical diseases.
  • Language: Jupyter Notebook

countgpt

  • Language: Python

qa-dataset

  • Language: Python

ml-tech-assessment

  • Language: Python

Open Source Contributions (from github.md)

argilla-io/argilla

  • Fixed bugs and shipped features related to semi-supervised learning (SSL) during internship at Argilla.
  • Argilla was acquired by Hugging Face in June 2024 (~$10M deal).

huggingface/transformers (PR #10823)

  • Modified the Trainer class for simultaneous Ray Tune and Weights & Biases execution.

nathanshartmann/portuguese_word_embeddings (PR #11)

  • Fixed a severe bug in the evaluation procedure. Documented in research paper.
  • Fixed a parameter bug in the script for the BLINK benchmark.

awslabs/mlm-scoring (PR #12)

  • Addressed an installation instruction issue for the mlm-scoring library.

All Repos (75 total, sorted by stars)

NameStarsForksLanguageDescription
hashformers775PythonAccurate word segmentation for hashtags and text
napolab723PythonThe Natural Portuguese Language Benchmark
elmo112Jupyter NotebookPortuguese Language Models and Word Embeddings
song2vec60PythonTelegram bot recommending songs via word2vec
assin53Jupyter NotebookMultilingual Transformer Ensembles for Portuguese NLT
reddit-html-archiver-image-plugin50PythonImage downloader plugin for reddit-html-archiver
Zero-Shot-Entity-Linking20PythonZero-shot Entity Linking
CPS3235-Twitter-Data-Collection10-Twitter data collection
old_website10HTMLPrevious personal website
pdfsandwich-cli10JavaScriptCLI for Dockerized pdfsandwich on AWS EC2
prawstreams10PythonFetch live Reddit comments/submissions
lsystem10ProcessingAlgorithmic Botany, L-Systems
srnn-svm10Jupyter NotebookSRNN-SVM
hdp10Jupyter NotebookHDP + T-SNE + k-NN topic modeling
agendamento10-Task scheduling application