环境
弹性资源本身就是Docker了,没法在里面再开Docker了。
学生认证
代理:
1 2 3 4
| apt update apt install git-lfs build-essential ninja-build vim tmux
ulimit -n 2048
|
选择GPU,镜像选择Pytorch,注意版本以及对应CUDA的版本。
/root/autodl-tmp
目录是数据盘,比较大有50G,但是在保存镜像的时候里面的数据不会存下来。
vLLM
现在vLLM默认是CUDA12编译的,直接pip install vllm
,pip install flash-attn
即可,下面说一下CUDA11.8版本的(注意匹配Python、CUDA、CXXABI):
1 2 3 4 5 6 7
| export VLLM_VERSION=0.5.5 export PYTHON_VERSION=310 pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118 torch._C._GLIBCXX_USE_CXX11_ABI pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
python -m vllm.entrypoints.api_server --model ${model_dir} --tensor-parallel-size 1 --gpu-memory-utilization 0.93
|
TensorRT-LLM
使用镜像:https://www.codewithgpu.com/i/triton-inference-server/tensorrtllm_backend/tensorrtllm_backend参考:https://www.cnblogs.com/zackstang/p/18269743 https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/scripts/launch-trt-server.sh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| git clone https://github.com/NVIDIA/TensorRT-LLM.git -b v0.8.0
convert_checkpoint.py:权重格式转换 trtllm-build:编译 --use_gpt_attention_plugin=fp16 --remove_input_padding --enable_context_fmha --use_inflight_batching --paged_kv_cache --use_gemm_plugin=fp16 --max_input_len=1024 run.py:静态运行 --input_text="hello" --max_output_len=1024 --run_profiling
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.8.0
cp TensorRT-LLM/examples/llama/trt_engines/* all_models/inflight_batcher_llm/tensorrt_llm/1/
python3 tools/fill_template.py
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=./all_models/inflight_batcher_llm/ --http_port 8080
|