Questions tagged [sentencepiece]

The tag has no usage guidance.

Filter by
Sorted by
Tagged with
14 votes
1 answer
17k views

sentencepiece library is not being installed in the system

While using pip install tf-models-official I found the following problem while the library is getting installed:- Collecting tf-models-official Using cached tf_models_official-2.8.0-py2.py3-none-any....
Daremitsu's user avatar
  • 587
10 votes
2 answers
21k views

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased). QUERY: I want to ask a question. ANSWER: Sure, ask away. ...
sid8491's user avatar
  • 6,740
7 votes
1 answer
3k views

why does huggingface t5 tokenizer ignore some of the whitespaces?

I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer ...
Berkay Berabi's user avatar
2 votes
1 answer
348 views

Some doubts about SentencePiece

I recently encountered some questions when I was learning Google’s SentencePiece. BPE, WordPiece and Unigram are all common subword algorithms, so what is the relationship between SentencePiece and ...
korangar leo's user avatar
2 votes
0 answers
441 views

SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode ...
Shital Shah's user avatar
  • 66.2k
2 votes
0 answers
291 views

how to integrate sentencepiece, protobuf into existing android project correctly

I am trying to integrate pytorch model to process language. This is why I need the sentencepiece to tokenize the sentence chunks. But I am unable to do that correctly. I did not find any robust ...
im07's user avatar
  • 396
2 votes
1 answer
3k views

Error while converting pth file to ggml.py format

Error: That I'm getting when I try to convert-pth-to-ggml.py Don't know whether the error is in my file management due to which model is unable to load or it is due to OS Traceback (most recent call ...
Tanish Shah's user avatar
1 vote
1 answer
103 views

Having trouble installing NewsSentiment and RUST and sentencepiece in conda?

I'm trying to install NewsSentiment on anaconda, which gave me this error: (pytorch) C:\Users\chenx>pip3 install newssentiment Collecting newssentiment Using cached NewsSentiment-1.0.7-py3-none-...
Yooshinhee's user avatar
1 vote
1 answer
2k views

SentencePiece in Google Colab

I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece ...
Jose Chavez's user avatar
1 vote
1 answer
2k views

How can I update sentencepiece package to its latest version using conda?

I have installed conda on linux ubuntu 16. When I install or update a package named sentencepiece it install the version 0.1.85 (which I guess is from 2 months ago according to anaconda website). ...
Ahmad's user avatar
  • 9,458
1 vote
1 answer
625 views

(OpenNMT) Spanish to English Model Improvement

I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set ...
Jose Chavez's user avatar
1 vote
1 answer
2k views

How to add new token to T5 tokenizer which uses sentencepieace

I train the t5 transformer which is based on tensorflow at the following link: https://github.com/google-research/text-to-text-transfer-transformer Here is a sample (input, output): input: b'[atomic]:&...
Ahmad's user avatar
  • 9,458
1 vote
0 answers
440 views

Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset?

I am trying to test a simple model using a SentencePieceTokenizer layer over a (HuggingFace) dataset. But I seem unable to get the shape of the dataset's target to agree with the model's output. All ...
rikb's user avatar
  • 652
1 vote
1 answer
221 views

libsentencepiece.so.0: cannot open shared object file: No such file or directory when creating BERTopic model

I am trying to train a BERTopic Model in python. However, I get this error: RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback): ...
kmcclenn's user avatar
1 vote
0 answers
226 views

Got the "Unable to load vocabulary from file." while using pipelines

I have been trying to use the "csebuetnlp/mT5_multilingual_XLSum" model for summarization purposes. The code I tried is listed as below: !pip install transformers !pip install sentencepiece ...
dicloflom's user avatar
1 vote
1 answer
504 views

Saving SentencepieceTokenizer in Keras model throws TypeError: Failed to convert elements of [None, None] to Tensor

I'm trying to save a Keras model which uses a SentencepieceTokenizer. Everything is working so far but I am unable to save the Keras model. After training the sentencepiece model, I am creating the ...
Stefan Falk's user avatar
  • 24.7k
1 vote
0 answers
770 views

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows: ['▁', '</s>', '▁Hello', '▁', '<sep>', '</s>'] But when i use the normal tokenizer, it ...
canP's user avatar
  • 25
1 vote
0 answers
1k views

"OSError: Model name './XX' was not found in tokenizers model name list" - cannot load custom tokenizer in Transformers

I'm trying to create my own tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers. I followed really closely the tutorial on how to train a ...
tlqn's user avatar
  • 379
0 votes
0 answers
15 views

Treat Hawaiian Glottal stop (ʻOkina) as consonant, not punctuation

Iʻm struggling to get the ʻokina, the Hawaiian glottal stop character (U+02BB), treated as a letter and not as punctuation in SentencePiece subword tokenization, whether BPE or Unigram. Can someone ...
HURIMOZ's user avatar
  • 41
0 votes
0 answers
11 views

Keras-NLP Albert Finetuning - Resource localhost/_0_SentencepieceOp/N10tensorflow4text12_GLOBAL__N_121SentencepieceResourceE does not exist

Describe the bug I am fine-tuning the Keras implementation of Albert for my dataset for a classification problem by following the documentations present here - https://keras.io/api/keras_nlp/models/...
Aakash Howlader's user avatar
0 votes
0 answers
20 views

Why sentencepiece tokenizer in nllb returns id's that differ by 1 when I specify model name or file?

When I'm using hf tokenizers and specify the model by name facebook/nllb-200-distilled-600M I get such tokens: 256047, 30311, 104, 253990 and so on. But when I take the sentencepiece.bpe.model from ...
Daniel Kukula's user avatar
0 votes
1 answer
356 views

Getting requirements to build wheel did not run successfully

Windows PowerShell Copyright (C) Microsoft Corporation. All rights reserved. Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows PS D:\Translator> pip ...
Mohit Sapat's user avatar
0 votes
1 answer
137 views

Trying to install bertopic, but Winerror2

I am trying to install bertopic, in VScode using pip and I am using a virtual environment but I am getting a Winerror2 while building sentencepiece. I tried installing sentencepiece separately but ...
unnk's user avatar
  • 21
0 votes
0 answers
17 views

Does a string always be same sentencepiece tokenizer encode result?

Will the tokenizer of sentencepiece always have the same encode result for the same string regardless of the context of the string? For example: sentence 1: abc bcde aa sentence 2: nnnabc bcde zks ...
Zip's user avatar
  • 11
0 votes
0 answers
50 views

How to modify a trained SentencePiece tokenizer to stop splitting the chatml tokens?

We are using a pre-trained SentencePiece tokenizer (the SentencePiece tokenizer from Google, not huggingface), and we would like to preserve the chatML tokens: <|im_start|> and <|im_end|> ...
vgoklani's user avatar
  • 11.3k
0 votes
0 answers
17 views

ImportError: cannot import name 'SentencePieceModel' from 'sentencepiece' (/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py)

ImportError Traceback (most recent call last) in <cell line: 4>() 2 import numpy as np 3 from sklearn.model_selection import train_test_split ----> 4 from ...
Dan Shitkar's user avatar
0 votes
0 answers
105 views

Sentencepiece tokenizer incorrectly concatenating input files

I am trying to use the sentencepiece to tokenize a large amount of source code files in several different languages. # Train SentencePiece model file_paths = [] for dir_name, _, file_list in ...
Zeratul777's user avatar
0 votes
0 answers
160 views

ImportError: CamembertTokenizer requires the SentencePiece library but it was not found in your environment

I try to create a .exe from a Python code. Here is my .spec: # -*- mode: python ; coding: utf-8 -*- from PyInstaller.utils.hooks import copy_metadata datas = [("C:/Users/pierr/OneDrive/Bureau/...
pierrevslouis's user avatar
0 votes
0 answers
395 views

_sentencepiece.SentencePieceProcessor_LoadFromFile No such file or directory

I'm trying to run script of deepparse NN. But got this mistake. _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) OSError: Not found: "C:\Users\MyUserName\.cache\deepparse\multi\multi....
MR.Max's user avatar
  • 56
0 votes
0 answers
197 views

RuntimeError: Graph is finalized and cannot be modified

After running below sample: def embed_muse(module): with tf.Graph().as_default(): sentences = tf.placeholder(tf.string) embed = hub.load(module) embeddings = embed(...
Sweety Tripathi's user avatar