Questions tagged [sentencepiece]
The sentencepiece tag has no usage guidance.
30
questions
14
votes
1
answer
17k
views
sentencepiece library is not being installed in the system
While using pip install tf-models-official I found the following problem while the library is getting installed:-
Collecting tf-models-official
Using cached tf_models_official-2.8.0-py2.py3-none-any....
10
votes
2
answers
21k
views
How to add new special token to the tokenizer?
I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).
QUERY: I want to ask a question.
ANSWER: Sure, ask away.
...
7
votes
1
answer
3k
views
why does huggingface t5 tokenizer ignore some of the whitespaces?
I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer ...
2
votes
1
answer
348
views
Some doubts about SentencePiece
I recently encountered some questions when I was learning Google’s SentencePiece.
BPE, WordPiece and Unigram are all common subword algorithms, so what is the relationship between SentencePiece and ...
2
votes
0
answers
441
views
SentencePiece tokenizer encodes to unknown token
I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode ...
2
votes
0
answers
291
views
how to integrate sentencepiece, protobuf into existing android project correctly
I am trying to integrate pytorch model to process language. This is why I need the sentencepiece to tokenize the sentence chunks. But I am unable to do that correctly.
I did not find any robust ...
2
votes
1
answer
3k
views
Error while converting pth file to ggml.py format
Error:
That I'm getting when I try to convert-pth-to-ggml.py
Don't know whether the error is in my file management due to which model is unable to load or it is due to OS
Traceback (most recent call ...
1
vote
1
answer
103
views
Having trouble installing NewsSentiment and RUST and sentencepiece in conda?
I'm trying to install NewsSentiment on anaconda, which gave me this error:
(pytorch) C:\Users\chenx>pip3 install newssentiment
Collecting newssentiment
Using cached NewsSentiment-1.0.7-py3-none-...
1
vote
1
answer
2k
views
SentencePiece in Google Colab
I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece ...
1
vote
1
answer
2k
views
How can I update sentencepiece package to its latest version using conda?
I have installed conda on linux ubuntu 16. When I install or update a package named sentencepiece it install the version 0.1.85 (which I guess is from 2 months ago according to anaconda website). ...
1
vote
1
answer
625
views
(OpenNMT) Spanish to English Model Improvement
I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set ...
1
vote
1
answer
2k
views
How to add new token to T5 tokenizer which uses sentencepieace
I train the t5 transformer which is based on tensorflow at the following link:
https://github.com/google-research/text-to-text-transfer-transformer
Here is a sample (input, output):
input:
b'[atomic]:&...
1
vote
0
answers
440
views
Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset?
I am trying to test a simple model using a SentencePieceTokenizer layer
over a (HuggingFace) dataset. But I seem unable to get the shape of
the dataset's target to agree with the model's output. All ...
1
vote
1
answer
221
views
libsentencepiece.so.0: cannot open shared object file: No such file or directory when creating BERTopic model
I am trying to train a BERTopic Model in python. However, I get this error:
RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback):
...
1
vote
0
answers
226
views
Got the "Unable to load vocabulary from file." while using pipelines
I have been trying to use the "csebuetnlp/mT5_multilingual_XLSum" model for summarization purposes.
The code I tried is listed as below:
!pip install transformers
!pip install sentencepiece
...
1
vote
1
answer
504
views
Saving SentencepieceTokenizer in Keras model throws TypeError: Failed to convert elements of [None, None] to Tensor
I'm trying to save a Keras model which uses a SentencepieceTokenizer.
Everything is working so far but I am unable to save the Keras model.
After training the sentencepiece model, I am creating the ...
1
vote
0
answers
770
views
Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)
When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows:
['▁', '</s>', '▁Hello', '▁', '<sep>', '</s>']
But when i use the normal tokenizer, it ...
1
vote
0
answers
1k
views
"OSError: Model name './XX' was not found in tokenizers model name list" - cannot load custom tokenizer in Transformers
I'm trying to create my own tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers.
I followed really closely the tutorial on how to train a ...
0
votes
0
answers
15
views
Treat Hawaiian Glottal stop (ʻOkina) as consonant, not punctuation
Iʻm struggling to get the ʻokina, the Hawaiian glottal stop character (U+02BB), treated as a letter and not as punctuation in SentencePiece subword tokenization, whether BPE or Unigram.
Can someone ...
0
votes
0
answers
11
views
Keras-NLP Albert Finetuning - Resource localhost/_0_SentencepieceOp/N10tensorflow4text12_GLOBAL__N_121SentencepieceResourceE does not exist
Describe the bug
I am fine-tuning the Keras implementation of Albert for my dataset for a classification problem by following the documentations present here - https://keras.io/api/keras_nlp/models/...
0
votes
0
answers
20
views
Why sentencepiece tokenizer in nllb returns id's that differ by 1 when I specify model name or file?
When I'm using hf tokenizers and specify the model by name facebook/nllb-200-distilled-600M I get such tokens: 256047, 30311, 104, 253990 and so on. But when I take the sentencepiece.bpe.model from ...
0
votes
1
answer
356
views
Getting requirements to build wheel did not run successfully
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows
PS D:\Translator> pip ...
0
votes
1
answer
137
views
Trying to install bertopic, but Winerror2
I am trying to install bertopic, in VScode using pip and I am using a virtual environment but I am getting a Winerror2 while building sentencepiece.
I tried installing sentencepiece separately but ...
0
votes
0
answers
17
views
Does a string always be same sentencepiece tokenizer encode result?
Will the tokenizer of sentencepiece always have the same encode result for the same string regardless of the context of the string?
For example:
sentence 1: abc bcde aa
sentence 2: nnnabc bcde zks
...
0
votes
0
answers
50
views
How to modify a trained SentencePiece tokenizer to stop splitting the chatml tokens?
We are using a pre-trained SentencePiece tokenizer (the SentencePiece tokenizer from Google, not huggingface), and we would like to preserve the chatML tokens:
<|im_start|> and <|im_end|>
...
0
votes
0
answers
17
views
ImportError: cannot import name 'SentencePieceModel' from 'sentencepiece' (/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py)
ImportError Traceback (most recent call last)
in <cell line: 4>()
2 import numpy as np
3 from sklearn.model_selection import train_test_split
----> 4 from ...
0
votes
0
answers
105
views
Sentencepiece tokenizer incorrectly concatenating input files
I am trying to use the sentencepiece to tokenize a large amount of source code files in several different languages.
# Train SentencePiece model
file_paths = []
for dir_name, _, file_list in ...
0
votes
0
answers
160
views
ImportError: CamembertTokenizer requires the SentencePiece library but it was not found in your environment
I try to create a .exe from a Python code. Here is my .spec:
# -*- mode: python ; coding: utf-8 -*-
from PyInstaller.utils.hooks import copy_metadata
datas = [("C:/Users/pierr/OneDrive/Bureau/...
0
votes
0
answers
395
views
_sentencepiece.SentencePieceProcessor_LoadFromFile No such file or directory
I'm trying to run script of deepparse NN.
But got this mistake.
_sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "C:\Users\MyUserName\.cache\deepparse\multi\multi....
0
votes
0
answers
197
views
RuntimeError: Graph is finalized and cannot be modified
After running below sample:
def embed_muse(module):
with tf.Graph().as_default():
sentences = tf.placeholder(tf.string)
embed = hub.load(module)
embeddings = embed(...