Highest scored 'sentencepiece' questions

14 votes

1 answer

17k views

sentencepiece library is not being installed in the system

While using pip install tf-models-official I found the following problem while the library is getting installed:- Collecting tf-models-official Using cached tf_models_official-2.8.0-py2.py3-none-any....

Daremitsu

587

asked Mar 22, 2022 at 16:12

10 votes

2 answers

21k views

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased). QUERY: I want to ask a question. ANSWER: Sure, ask away. ...

sid8491

6,740

asked Sep 15, 2021 at 10:24

7 votes

1 answer

3k views

why does huggingface t5 tokenizer ignore some of the whitespaces?

I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer ...

Berkay Berabi

2,148

asked May 12, 2022 at 11:04

2 votes

1 answer

348 views

Some doubts about SentencePiece

I recently encountered some questions when I was learning Google’s SentencePiece. BPE, WordPiece and Unigram are all common subword algorithms, so what is the relationship between SentencePiece and ...

korangar leo

49

asked Sep 4, 2023 at 10:07

2 votes

0 answers

441 views

SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode ...

Shital Shah

66.2k

asked Aug 2, 2023 at 8:58

2 votes

0 answers

291 views

how to integrate sentencepiece, protobuf into existing android project correctly

I am trying to integrate pytorch model to process language. This is why I need the sentencepiece to tokenize the sentence chunks. But I am unable to do that correctly. I did not find any robust ...

im07

396

asked Apr 3, 2023 at 4:23

2 votes

1 answer

3k views

Error while converting pth file to ggml.py format

Error: That I'm getting when I try to convert-pth-to-ggml.py Don't know whether the error is in my file management due to which model is unable to load or it is due to OS Traceback (most recent call ...

Tanish Shah

39

asked Mar 18, 2023 at 16:11

1 vote

1 answer

103 views

Having trouble installing NewsSentiment and RUST and sentencepiece in conda?

I'm trying to install NewsSentiment on anaconda, which gave me this error: (pytorch) C:\Users\chenx>pip3 install newssentiment Collecting newssentiment Using cached NewsSentiment-1.0.7-py3-none-...

Yooshinhee

37

asked Nov 16, 2023 at 13:01

1 vote

1 answer

2k views

SentencePiece in Google Colab

I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece ...

Jose Chavez

115

asked Apr 29, 2021 at 4:45

1 vote

1 answer

2k views

How can I update sentencepiece package to its latest version using conda?

I have installed conda on linux ubuntu 16. When I install or update a package named sentencepiece it install the version 0.1.85 (which I guess is from 2 months ago according to anaconda website). ...

Ahmad

9,458

asked Jul 5, 2020 at 10:39

1 vote

1 answer

625 views

(OpenNMT) Spanish to English Model Improvement

I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set ...

Jose Chavez

115

asked May 1, 2021 at 0:09

1 vote

1 answer

2k views

How to add new token to T5 tokenizer which uses sentencepieace

I train the t5 transformer which is based on tensorflow at the following link: https://github.com/google-research/text-to-text-transfer-transformer Here is a sample (input, output): input: b'[atomic]:&...

Ahmad

9,458

asked Apr 21, 2021 at 9:57

1 vote

0 answers

440 views

Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset?

I am trying to test a simple model using a SentencePieceTokenizer layer over a (HuggingFace) dataset. But I seem unable to get the shape of the dataset's target to agree with the model's output. All ...

rikb

652

asked Mar 12 at 19:56

1 vote

1 answer

221 views

libsentencepiece.so.0: cannot open shared object file: No such file or directory when creating BERTopic model

I am trying to train a BERTopic Model in python. However, I get this error: RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback): ...

kmcclenn

87

asked Jul 14, 2023 at 17:54

1 vote

0 answers

226 views

Got the "Unable to load vocabulary from file." while using pipelines

I have been trying to use the "csebuetnlp/mT5_multilingual_XLSum" model for summarization purposes. The code I tried is listed as below: !pip install transformers !pip install sentencepiece ...

dicloflom

11

asked Apr 6, 2023 at 10:52

1 vote

1 answer

504 views

Saving SentencepieceTokenizer in Keras model throws TypeError: Failed to convert elements of [None, None] to Tensor

I'm trying to save a Keras model which uses a SentencepieceTokenizer. Everything is working so far but I am unable to save the Keras model. After training the sentencepiece model, I am creating the ...

Stefan Falk

24.7k

asked Aug 2, 2022 at 8:55

1 vote

0 answers

770 views

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows: ['▁', '</s>', '▁Hello', '▁', '<sep>', '</s>'] But when i use the normal tokenizer, it ...

canP

25

asked Jul 30, 2022 at 14:13

1 vote

0 answers

1k views

"OSError: Model name './XX' was not found in tokenizers model name list" - cannot load custom tokenizer in Transformers

I'm trying to create my own tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers. I followed really closely the tutorial on how to train a ...

tlqn

379

asked Dec 8, 2020 at 12:42

0 votes

0 answers

15 views

Treat Hawaiian Glottal stop (ʻOkina) as consonant, not punctuation

Iʻm struggling to get the ʻokina, the Hawaiian glottal stop character (U+02BB), treated as a letter and not as punctuation in SentencePiece subword tokenization, whether BPE or Unigram. Can someone ...

HURIMOZ

41

asked Apr 19 at 2:29

0 votes

0 answers

11 views

Keras-NLP Albert Finetuning - Resource localhost/_0_SentencepieceOp/N10tensorflow4text12_GLOBAL__N_121SentencepieceResourceE does not exist

Describe the bug I am fine-tuning the Keras implementation of Albert for my dataset for a classification problem by following the documentations present here - https://keras.io/api/keras_nlp/models/...

Aakash Howlader

5

asked Apr 10 at 23:26

0 votes

0 answers

20 views

Why sentencepiece tokenizer in nllb returns id's that differ by 1 when I specify model name or file?

When I'm using hf tokenizers and specify the model by name facebook/nllb-200-distilled-600M I get such tokens: 256047, 30311, 104, 253990 and so on. But when I take the sentencepiece.bpe.model from ...

Daniel Kukula

333

asked Apr 3 at 22:00

0 votes

1 answer

356 views

Getting requirements to build wheel did not run successfully

Mohit Sapat

5

asked Feb 4 at 9:31

0 votes

1 answer

137 views

Trying to install bertopic, but Winerror2

I am trying to install bertopic, in VScode using pip and I am using a virtual environment but I am getting a Winerror2 while building sentencepiece. I tried installing sentencepiece separately but ...

unnk

21

asked Jan 31 at 14:40

0 votes

0 answers

17 views

Does a string always be same sentencepiece tokenizer encode result?

Will the tokenizer of sentencepiece always have the same encode result for the same string regardless of the context of the string? For example: sentence 1: abc bcde aa sentence 2: nnnabc bcde zks ...

Zip

11

asked Jan 5 at 8:56

0 votes

0 answers

50 views

How to modify a trained SentencePiece tokenizer to stop splitting the chatml tokens?

We are using a pre-trained SentencePiece tokenizer (the SentencePiece tokenizer from Google, not huggingface), and we would like to preserve the chatML tokens: <|im_start|> and <|im_end|> ...

vgoklani

11.3k

asked Nov 28, 2023 at 16:44

0 votes

0 answers

17 views

ImportError: cannot import name 'SentencePieceModel' from 'sentencepiece' (/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py)

ImportError Traceback (most recent call last) in <cell line: 4>() 2 import numpy as np 3 from sklearn.model_selection import train_test_split ----> 4 from ...

Dan Shitkar

1

asked Nov 27, 2023 at 8:22

0 votes

0 answers

105 views

Sentencepiece tokenizer incorrectly concatenating input files

I am trying to use the sentencepiece to tokenize a large amount of source code files in several different languages. # Train SentencePiece model file_paths = [] for dir_name, _, file_list in ...

Zeratul777

1

asked Nov 20, 2023 at 1:29

0 votes

0 answers

160 views

ImportError: CamembertTokenizer requires the SentencePiece library but it was not found in your environment

I try to create a .exe from a Python code. Here is my .spec: # -*- mode: python ; coding: utf-8 -*- from PyInstaller.utils.hooks import copy_metadata datas = [("C:/Users/pierr/OneDrive/Bureau/...

pierrevslouis

11

asked Jul 30, 2023 at 1:32

0 votes

0 answers

395 views

_sentencepiece.SentencePieceProcessor_LoadFromFile No such file or directory

I'm trying to run script of deepparse NN. But got this mistake. _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) OSError: Not found: "C:\Users\MyUserName\.cache\deepparse\multi\multi....

MR.Max

56

asked Feb 10, 2023 at 8:25

0 votes

0 answers

197 views

RuntimeError: Graph is finalized and cannot be modified

After running below sample: def embed_muse(module): with tf.Graph().as_default(): sentences = tf.placeholder(tf.string) embed = hub.load(module) embeddings = embed(...

Sweety Tripathi

25

asked Jan 4, 2021 at 8:34

Collectives™ on Stack Overflow

Questions tagged [sentencepiece]

sentencepiece library is not being installed in the system

How to add new special token to the tokenizer?

why does huggingface t5 tokenizer ignore some of the whitespaces?

Some doubts about SentencePiece

SentencePiece tokenizer encodes to unknown token

how to integrate sentencepiece, protobuf into existing android project correctly

Error while converting pth file to ggml.py format

Having trouble installing NewsSentiment and RUST and sentencepiece in conda?

SentencePiece in Google Colab

How can I update sentencepiece package to its latest version using conda?

(OpenNMT) Spanish to English Model Improvement

How to add new token to T5 tokenizer which uses sentencepieace

Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset?

libsentencepiece.so.0: cannot open shared object file: No such file or directory when creating BERTopic model

Got the "Unable to load vocabulary from file." while using pipelines

Saving SentencepieceTokenizer in Keras model throws TypeError: Failed to convert elements of [None, None] to Tensor

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

"OSError: Model name './XX' was not found in tokenizers model name list" - cannot load custom tokenizer in Transformers

Treat Hawaiian Glottal stop (ʻOkina) as consonant, not punctuation

Keras-NLP Albert Finetuning - Resource localhost/_0_SentencepieceOp/N10tensorflow4text12_GLOBAL__N_121SentencepieceResourceE does not exist

Why sentencepiece tokenizer in nllb returns id's that differ by 1 when I specify model name or file?

Getting requirements to build wheel did not run successfully

Trying to install bertopic, but Winerror2

Does a string always be same sentencepiece tokenizer encode result?

How to modify a trained SentencePiece tokenizer to stop splitting the chatml tokens?

ImportError: cannot import name 'SentencePieceModel' from 'sentencepiece' (/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py)

Sentencepiece tokenizer incorrectly concatenating input files

ImportError: CamembertTokenizer requires the SentencePiece library but it was not found in your environment

_sentencepiece.SentencePieceProcessor_LoadFromFile No such file or directory

RuntimeError: Graph is finalized and cannot be modified

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [sentencepiece]

Related Tags