chore: 添加虚拟环境到仓库
- 添加 backend_service/venv 虚拟环境 - 包含所有Python依赖包 - 注意:虚拟环境约393MB,包含12655个文件
This commit is contained in:
@@ -0,0 +1 @@
|
||||
pip
|
||||
@@ -0,0 +1,210 @@
|
||||
Metadata-Version: 2.4
|
||||
Name: tokenizers
|
||||
Version: 0.22.1
|
||||
Classifier: Development Status :: 5 - Production/Stable
|
||||
Classifier: Intended Audience :: Developers
|
||||
Classifier: Intended Audience :: Education
|
||||
Classifier: Intended Audience :: Science/Research
|
||||
Classifier: License :: OSI Approved :: Apache Software License
|
||||
Classifier: Operating System :: OS Independent
|
||||
Classifier: Programming Language :: Python :: 3
|
||||
Classifier: Programming Language :: Python :: 3.9
|
||||
Classifier: Programming Language :: Python :: 3.10
|
||||
Classifier: Programming Language :: Python :: 3.11
|
||||
Classifier: Programming Language :: Python :: 3.12
|
||||
Classifier: Programming Language :: Python :: 3.13
|
||||
Classifier: Programming Language :: Python :: 3 :: Only
|
||||
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
||||
Requires-Dist: huggingface-hub>=0.16.4,<2.0
|
||||
Requires-Dist: pytest ; extra == 'testing'
|
||||
Requires-Dist: pytest-asyncio ; extra == 'testing'
|
||||
Requires-Dist: requests ; extra == 'testing'
|
||||
Requires-Dist: numpy ; extra == 'testing'
|
||||
Requires-Dist: datasets ; extra == 'testing'
|
||||
Requires-Dist: black==22.3 ; extra == 'testing'
|
||||
Requires-Dist: ruff ; extra == 'testing'
|
||||
Requires-Dist: sphinx ; extra == 'docs'
|
||||
Requires-Dist: sphinx-rtd-theme ; extra == 'docs'
|
||||
Requires-Dist: setuptools-rust ; extra == 'docs'
|
||||
Requires-Dist: tokenizers[testing] ; extra == 'dev'
|
||||
Provides-Extra: testing
|
||||
Provides-Extra: docs
|
||||
Provides-Extra: dev
|
||||
Keywords: NLP,tokenizer,BPE,transformer,deep learning
|
||||
Author-email: Nicolas Patry <patry.nicolas@protonmail.com>, Anthony Moi <anthony@huggingface.co>
|
||||
Requires-Python: >=3.9
|
||||
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
||||
Project-URL: Homepage, https://github.com/huggingface/tokenizers
|
||||
Project-URL: Source, https://github.com/huggingface/tokenizers
|
||||
|
||||
<p align="center">
|
||||
<br>
|
||||
<img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
|
||||
<br>
|
||||
<p>
|
||||
<p align="center">
|
||||
<a href="https://badge.fury.io/py/tokenizers">
|
||||
<img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
|
||||
</a>
|
||||
<a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
|
||||
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
|
||||
</a>
|
||||
</p>
|
||||
<br>
|
||||
|
||||
# Tokenizers
|
||||
|
||||
Provides an implementation of today's most used tokenizers, with a focus on performance and
|
||||
versatility.
|
||||
|
||||
Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.
|
||||
If you are interested in the High-level design, you can go check it there.
|
||||
|
||||
Otherwise, let's dive in!
|
||||
|
||||
## Main features:
|
||||
|
||||
- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
|
||||
most common BPE versions).
|
||||
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
|
||||
less than 20 seconds to tokenize a GB of text on a server's CPU.
|
||||
- Easy to use, but also extremely versatile.
|
||||
- Designed for research and production.
|
||||
- Normalization comes with alignments tracking. It's always possible to get the part of the
|
||||
original sentence that corresponds to a given token.
|
||||
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
|
||||
|
||||
### Installation
|
||||
|
||||
#### With pip:
|
||||
|
||||
```bash
|
||||
pip install tokenizers
|
||||
```
|
||||
|
||||
#### From sources:
|
||||
|
||||
To use this method, you need to have the Rust installed:
|
||||
|
||||
```bash
|
||||
# Install with:
|
||||
curl https://sh.rustup.rs -sSf | sh -s -- -y
|
||||
export PATH="$HOME/.cargo/bin:$PATH"
|
||||
```
|
||||
|
||||
Once Rust is installed, you can compile doing the following
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/tokenizers
|
||||
cd tokenizers/bindings/python
|
||||
|
||||
# Create a virtual env (you can use yours as well)
|
||||
python -m venv .env
|
||||
source .env/bin/activate
|
||||
|
||||
# Install `tokenizers` in the current virtual env
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### Load a pretrained tokenizer from the Hub
|
||||
|
||||
```python
|
||||
from tokenizers import Tokenizer
|
||||
|
||||
tokenizer = Tokenizer.from_pretrained("bert-base-cased")
|
||||
```
|
||||
|
||||
### Using the provided Tokenizers
|
||||
|
||||
We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
|
||||
these using some `vocab.json` and `merges.txt` files:
|
||||
|
||||
```python
|
||||
from tokenizers import CharBPETokenizer
|
||||
|
||||
# Initialize a tokenizer
|
||||
vocab = "./path/to/vocab.json"
|
||||
merges = "./path/to/merges.txt"
|
||||
tokenizer = CharBPETokenizer(vocab, merges)
|
||||
|
||||
# And then encode:
|
||||
encoded = tokenizer.encode("I can feel the magic, can you?")
|
||||
print(encoded.ids)
|
||||
print(encoded.tokens)
|
||||
```
|
||||
|
||||
And you can train them just as simply:
|
||||
|
||||
```python
|
||||
from tokenizers import CharBPETokenizer
|
||||
|
||||
# Initialize a tokenizer
|
||||
tokenizer = CharBPETokenizer()
|
||||
|
||||
# Then train it!
|
||||
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
|
||||
|
||||
# Now, let's use it:
|
||||
encoded = tokenizer.encode("I can feel the magic, can you?")
|
||||
|
||||
# And finally save it somewhere
|
||||
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
|
||||
```
|
||||
|
||||
#### Provided Tokenizers
|
||||
|
||||
- `CharBPETokenizer`: The original BPE
|
||||
- `ByteLevelBPETokenizer`: The byte level version of the BPE
|
||||
- `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
|
||||
- `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
|
||||
|
||||
All of these can be used and trained as explained above!
|
||||
|
||||
### Build your own
|
||||
|
||||
Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
|
||||
by putting all the different parts you need together.
|
||||
You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.
|
||||
|
||||
#### Building a byte-level BPE
|
||||
|
||||
Here is an example showing how to build your own byte-level BPE by putting all the different pieces
|
||||
together, and then saving it to a single file:
|
||||
|
||||
```python
|
||||
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
|
||||
|
||||
# Initialize a tokenizer
|
||||
tokenizer = Tokenizer(models.BPE())
|
||||
|
||||
# Customize pre-tokenization and decoding
|
||||
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
|
||||
tokenizer.decoder = decoders.ByteLevel()
|
||||
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
|
||||
|
||||
# And then train
|
||||
trainer = trainers.BpeTrainer(
|
||||
vocab_size=20000,
|
||||
min_frequency=2,
|
||||
initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
|
||||
)
|
||||
tokenizer.train([
|
||||
"./path/to/dataset/1.txt",
|
||||
"./path/to/dataset/2.txt",
|
||||
"./path/to/dataset/3.txt"
|
||||
], trainer=trainer)
|
||||
|
||||
# And Save it
|
||||
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
|
||||
```
|
||||
|
||||
Now, when you want to use this tokenizer, this is as simple as:
|
||||
|
||||
```python
|
||||
from tokenizers import Tokenizer
|
||||
|
||||
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
|
||||
|
||||
encoded = tokenizer.encode("I can feel the magic, can you?")
|
||||
```
|
||||
|
||||
@@ -0,0 +1,45 @@
|
||||
tokenizers-0.22.1.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||
tokenizers-0.22.1.dist-info/METADATA,sha256=3ckvBh_0IvsF0Z2ljpFhxQmNoUPDiqI3eb-KW0jOdso,6779
|
||||
tokenizers-0.22.1.dist-info/RECORD,,
|
||||
tokenizers-0.22.1.dist-info/WHEEL,sha256=EmbG9zyqShfWQD8iunhabgzlMZGQwxFV_zZkaywLDn0,127
|
||||
tokenizers/__init__.py,sha256=ZE5ZagUvobBScrHBQdEobhx4wqM0bsq9F9aLYkBNjYQ,2615
|
||||
tokenizers/__init__.pyi,sha256=4To1kfbT82HE2tSszJJwSKUmy7m3Y5dw6Oqzy_-0Iao,45154
|
||||
tokenizers/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/decoders/__init__.py,sha256=hfwM6CFUDvlMGGL4-xsaaYz81K9P5rQI5ZL5UHWK8Y4,372
|
||||
tokenizers/decoders/__init__.pyi,sha256=72hidxCIWgV-dbJxFe9KReGs2YOXAdjq_kJhJI_mgLY,7395
|
||||
tokenizers/decoders/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/implementations/__init__.py,sha256=VzAsplaIo7rl4AFO8Miu7ig7MfZjvonwVblZw01zR6M,310
|
||||
tokenizers/implementations/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/implementations/__pycache__/base_tokenizer.cpython-313.pyc,,
|
||||
tokenizers/implementations/__pycache__/bert_wordpiece.cpython-313.pyc,,
|
||||
tokenizers/implementations/__pycache__/byte_level_bpe.cpython-313.pyc,,
|
||||
tokenizers/implementations/__pycache__/char_level_bpe.cpython-313.pyc,,
|
||||
tokenizers/implementations/__pycache__/sentencepiece_bpe.cpython-313.pyc,,
|
||||
tokenizers/implementations/__pycache__/sentencepiece_unigram.cpython-313.pyc,,
|
||||
tokenizers/implementations/base_tokenizer.py,sha256=HzK6Nm36LxJaKTZrIM1Cx_Ld9gNjeJdB9E29M0aYGBI,15791
|
||||
tokenizers/implementations/bert_wordpiece.py,sha256=sKCum0FKPYdSgJFJN8LDerVBoTDRSqyqSdrcm-lvQqI,5520
|
||||
tokenizers/implementations/byte_level_bpe.py,sha256=iBepM_z1s5Ky7zFDVrYLc3L5byYrIouk7-k0JGuF10s,4272
|
||||
tokenizers/implementations/char_level_bpe.py,sha256=Nag_HFq8Rvcucqi8MhV1-0xtoR0C7FjHOecFVURL7ss,5449
|
||||
tokenizers/implementations/sentencepiece_bpe.py,sha256=c08fKf6i92E2RsKgsxy7LzZfYX8-MACHSRG8U_I5ytY,3721
|
||||
tokenizers/implementations/sentencepiece_unigram.py,sha256=SYiVXL8ZtqLXKpuqwnwmrfxgGotu8yAkOu7dLztEXIo,7580
|
||||
tokenizers/models/__init__.py,sha256=eJZ4HTAQZpxnKILNylWaTFqxXy-Ba6OKswWN47feeV8,176
|
||||
tokenizers/models/__init__.pyi,sha256=clPTwiyjz7FlVdEuwo_3Wa_TmQrbZhW0SGmnNylepnY,16929
|
||||
tokenizers/models/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/normalizers/__init__.py,sha256=_06w4cqRItveEgIddYaLMScgkSOkIAMIzYCesb5AA4U,841
|
||||
tokenizers/normalizers/__init__.pyi,sha256=lSFqDb_lPZBfRxEG99EcFEaU1HlnIhIQUu7zZIyP4AY,20898
|
||||
tokenizers/normalizers/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/pre_tokenizers/__init__.py,sha256=KV9-EsAykGENUUzkGWCbv4n6YM6hYa1hfnY-gzBpMNE,598
|
||||
tokenizers/pre_tokenizers/__init__.pyi,sha256=n6BFClhxm8y7miCC0lJd7oVeo8oj3kPg2tU9ObV4PGU,26556
|
||||
tokenizers/pre_tokenizers/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/processors/__init__.py,sha256=xM2DEKwKtHIumHsszM8AMkq-AlaqvBZFXWgLU8SNhOY,307
|
||||
tokenizers/processors/__init__.pyi,sha256=hx767ZY8SHhxb_hiXPRxm-f_KcoR4XDx7vfK2c0lR-Q,11357
|
||||
tokenizers/processors/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/tokenizers.abi3.so,sha256=-T5w9yJa7pkS35XUNSagc57jj1Nvn4dayt95bQDaX10,10376544
|
||||
tokenizers/tools/__init__.py,sha256=xG8caB9OHC8cbB01S5vYV14HZxhO6eWbLehsb70ppio,55
|
||||
tokenizers/tools/__pycache__/__init__.cpython-313.pyc,,
|
||||
tokenizers/tools/__pycache__/visualizer.cpython-313.pyc,,
|
||||
tokenizers/tools/visualizer-styles.css,sha256=zAydq1oGWD8QEll4-eyL8Llw0B1sty_hpIE3tYxL02k,4850
|
||||
tokenizers/tools/visualizer.py,sha256=zEXELCLSxXL_RgQgCTjVQrc3G_ptuDwr7deaU95b3dA,14625
|
||||
tokenizers/trainers/__init__.py,sha256=UTu22AGcp76IvpW45xLRbJWET04NxPW6NfCb2YYz0EM,248
|
||||
tokenizers/trainers/__init__.pyi,sha256=OwdiVOlMXhU5hOq7a5TYYG1vw3fk8nTqH88tVr05NZ0,5860
|
||||
tokenizers/trainers/__pycache__/__init__.cpython-313.pyc,,
|
||||
@@ -0,0 +1,4 @@
|
||||
Wheel-Version: 1.0
|
||||
Generator: maturin (1.9.4)
|
||||
Root-Is-Purelib: false
|
||||
Tag: cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64
|
||||
Reference in New Issue
Block a user