AutoDL使用

jyf111

2024-08-30 (Updated: 2024-09-01)

(5 mins to read)

¶环境

弹性资源本身就是Docker了，没法在里面再开Docker了。

学生认证

代理：

source /etc/network_turbo （用于github和transformer）
clash：https://zhuanlan.zhihu.com/p/685018159
软件源：https://www.autodl.com/docs/source/

apt update
apt install git-lfs build-essential ninja-build vim tmux

ulimit -n 2048 # 调整最大文件数

选择GPU，镜像选择Pytorch，注意版本以及对应CUDA的版本。

/root/autodl-tmp目录是数据盘，比较大有50G，但是在保存镜像的时候里面的数据不会存下来。

¶vLLM

现在vLLM默认是CUDA12编译的，直接pip install vllm，pip install flash-attn即可，下面说一下CUDA11.8版本的（注意匹配Python、CUDA、CXXABI）：

export VLLM_VERSION=0.5.5
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
torch._C._GLIBCXX_USE_CXX11_ABI # 看torch编译时有没有加CXX11ABI
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

python -m vllm.entrypoints.api_server --model ${model_dir} --tensor-parallel-size 1 --gpu-memory-utilization 0.93

¶TensorRT-LLM

使用镜像：https://www.codewithgpu.com/i/triton-inference-server/tensorrtllm_backend/tensorrtllm_backend参考：https://www.cnblogs.com/zackstang/p/18269743 https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/scripts/launch-trt-server.sh

git clone https://github.com/NVIDIA/TensorRT-LLM.git -b v0.8.0
# 先编译模型为trt引擎，参考examples/${model_type}
convert_checkpoint.py：权重格式转换
trtllm-build：编译
--use_gpt_attention_plugin=fp16 --remove_input_padding --enable_context_fmha --use_inflight_batching --paged_kv_cache --use_gemm_plugin=fp16 --max_input_len=1024
run.py：静态运行
--input_text="hello" --max_output_len=1024 --run_profiling

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.8.0
# 拷贝trt引擎到backend中，注意还有一个config.json文件
cp TensorRT-LLM/examples/llama/trt_engines/* all_models/inflight_batcher_llm/tensorrt_llm/1/
# 填各种配置，主要是tokenzier_dir和max_batch_size
python3 tools/fill_template.py
# 启动
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=./all_models/inflight_batcher_llm/ --http_port 8080