โ€”
OVERVIEW GPU ์ž„๋ฒ ๋”ฉ ๊ฐœ์š”

BAAI/bge-m3 ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ API์ž…๋‹ˆ๋‹ค. ํ…์ŠคํŠธ๋ฅผ 1024์ฐจ์› ์ˆซ์ž ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ํ…์ŠคํŠธ๋Š” ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ๋‹จ๊ฑด ๋ฐ ๋ฐฐ์น˜(์ตœ๋Œ€ 100๊ฐœ) ๋ชจ๋‘ ์ง€์›ํ•˜๋ฉฐ, ๋™์ผ ์ž…๋ ฅ์˜ ๋ฐ˜๋ณต ์š”์ฒญ์€ ์บ์‹œ์—์„œ ์ฆ‰์‹œ ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค.

BAAI/bge-m3 โ€” ๋ฒ ์ด์ง•์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์›(BAAI)์ด ๊ฐœ๋ฐœํ•œ ๋‹ค๊ตญ์–ด ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ. 100๊ฐœ ์ด์ƒ์˜ ์–ธ์–ด๋ฅผ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, DenseยทSparseยทColBERT 3๊ฐ€์ง€ ๊ฒ€์ƒ‰ ๋ฐฉ์‹์„ ๋™์‹œ์— ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. 8192 ํ† ํฐ๊นŒ์ง€์˜ ๊ธด ๋ฌธ์„œ๋„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•˜๊ณ , ํ•œ๊ตญ์–ดยท์˜์–ดยท์ค‘๊ตญ์–ดยท์ผ๋ณธ์–ด์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. HuggingFace TEI(Text Embeddings Inference)๋กœ ์„œ๋น™ํ•˜์—ฌ Continuous Batching๊ณผ Flash Attention์œผ๋กœ ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. VRAM ~2GB.
๋ชจ๋ธ ์ •๋ณด
ํ•ญ๋ชฉ๊ฐ’
๋ชจ๋ธBAAI/bge-m3
VRAM~2 GB (FP16)
์ถœ๋ ฅ ์ฐจ์›1024 dim
์ง€์› ์–ธ์–ด100๊ฐœ ์–ธ์–ด (ํ•œ๊ตญ์–ด ํฌํ•จ)
ํŠน์ง•Dense + Sparse + ColBERT ๋ฉ€ํ‹ฐ ๋ฒกํ„ฐ ์ง€์› (Dense ์‚ฌ์šฉ)
์š”์ฒญ ํ๋ฆ„
POST /api/v1/ns/{ns}/gpu/embed โ†’ 202 + request_id
โ†“ Worker๊ฐ€ bge-m3 ๋ชจ๋ธ๋กœ ์ถ”๋ก 
GET  /api/v1/ns/{ns}/gpu/embed/{request_id} โ†’ 200 + ๋ฒกํ„ฐ ๊ฒฐ๊ณผ
DEL  /api/v1/ns/{ns}/gpu/embed/{request_id} โ†’ ๊ฒฐ๊ณผ ์‚ญ์ œ (์„ ํƒ)
โš ๏ธ ํ™œ์„ฑํ™” ํ™•์ธ: Embedding ํƒœ์Šคํฌ๊ฐ€ ํ™œ์„ฑํ™”๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. REDGX_GPU_EMBEDDING_ENABLED=true ํ™˜๊ฒฝ๋ณ€์ˆ˜ ๋ฐ ๋ชจ๋ธ(BAAI/bge-m3) ๋กœ๋“œ ์ƒํƒœ๋ฅผ ์ƒ๋‹จ ์„ค์ •๋ฐ”์—์„œ ํ™•์ธํ•˜์„ธ์š”.
๐Ÿ’ก ์‹œ์ž‘ ์ „ ํ™•์ธ: ์ƒ๋‹จ ์„ค์ •๋ฐ”์—์„œ API Key๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”. ์“ฐ๊ธฐ ๊ถŒํ•œ(write)์ด ์žˆ๋Š” ํ‚ค๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Namespace๋Š” HRM๋กœ ๊ธฐ๋ณธ ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
POST ๊ธฐ๋ณธ ์ž„๋ฒ ๋”ฉ ์š”์ฒญ

ํ…์ŠคํŠธ 1๊ฐœ๋ฅผ ์ œ์ถœํ•˜๊ณ  ๋ฒกํ„ฐ ๊ฒฐ๊ณผ๋ฅผ ์กฐํšŒํ•ฉ๋‹ˆ๋‹ค.

1 ์š”์ฒญ
โ†’
2 ๋Œ€๊ธฐ
โ†’
3 ๊ฒฐ๊ณผ
curl ๋ช…๋ น ๋ณด๊ธฐ

        
POST ๋ฐฐ์น˜ ์ž„๋ฒ ๋”ฉ

์—ฌ๋Ÿฌ ํ…์ŠคํŠธ๋ฅผ ํ•œ ๋ฒˆ์— ์ œ์ถœํ•ฉ๋‹ˆ๋‹ค. ์ˆœ์„œ๊ฐ€ ๋ณด์žฅ๋ฉ๋‹ˆ๋‹ค.

curl ๋ช…๋ น ๋ณด๊ธฐ

            
๋ฐ˜ํ™˜ ๊ตฌ์กฐ: vectors[i].index๊ฐ€ ์ž…๋ ฅ ์ˆœ์„œ์™€ ์ผ์น˜ํ•˜๋„๋ก ๋ณด์žฅ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ํ…์ŠคํŠธ๊ฐ€ ํ•˜๋‚˜์˜ request_id๋กœ ๋ฌถ์—ฌ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.
POST ์บ์‹œ ๋™์ž‘ ํ™•์ธ

๋™์ผํ•œ ํ…์ŠคํŠธ๋ฅผ ๋‘ ๋ฒˆ ์š”์ฒญํ•˜๋ฉด ๋‘ ๋ฒˆ์งธ๋Š” GPU ์—†์ด ์บ์‹œ์—์„œ ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค.

curl ๋ช…๋ น ๋ณด๊ธฐ

            
์บ์‹œ ํ‚ค: SHA-256(๋ชจ๋ธ๋ช… + ํ…์ŠคํŠธ ๋ชฉ๋ก)๋กœ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ชจ๋ธ, ๊ฐ™์€ ํ…์ŠคํŠธ๋ฉด ์–ธ์ œ๋‚˜ ์บ์‹œ์—์„œ ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค. (cached: true)
POST ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

๋‘ ํ…์ŠคํŠธ๋ฅผ ์ž„๋ฒ ๋”ฉํ•˜๊ณ  ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

curl ๋ช…๋ น ๋ณด๊ธฐ

            
์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„: 1.0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์˜๋ฏธ๊ฐ€ ์œ ์‚ฌ, 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๊ด€๋ จ ์—†์Œ. normalize: true๋กœ ์ž„๋ฒ ๋”ฉ ์‹œ ๋‚ด์ (dot product)์ด ๊ณง ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์ž…๋‹ˆ๋‹ค.
๊ฒฐ๊ณผ ์กฐํšŒ / ์‚ญ์ œ / ์ทจ์†Œ

๊ฒฐ๊ณผ ๋ช…์‹œ ์‚ญ์ œ, auto_clear ์˜ต์…˜, ๋Œ€๊ธฐ ์ค‘ ์ทจ์†Œ๋ฅผ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค.

GET ๊ฒฐ๊ณผ ์กฐํšŒ + auto_clear
curl ๋ช…๋ น ๋ณด๊ธฐ

            
์ƒํƒœ ์ฝ”๋“œ ์˜๋ฏธ
status์˜๋ฏธ
queuedํ ๋Œ€๊ธฐ ์ค‘ (HTTP 202)
processingGPU ์ถ”๋ก  ์ค‘ (HTTP 202)
completed์™„๋ฃŒ โ€” ๊ฒฐ๊ณผ ํฌํ•จ (HTTP 200)
failed์ถ”๋ก  ์‹คํŒจ (HTTP 200)
not_found๊ฒฐ๊ณผ ์—†์Œ / ๋งŒ๋ฃŒ / ์‚ญ์ œ๋จ (HTTP 404)
POST ๋ฐฐ์น˜ ์ƒํƒœ ์ผ๊ด„ ์กฐํšŒ

์—ฌ๋Ÿฌ request_id์˜ ์ฒ˜๋ฆฌ ์ƒํƒœ๋ฅผ ํ•œ ๋ฒˆ์— ์กฐํšŒํ•ฉ๋‹ˆ๋‹ค.

curl ๋ช…๋ น ๋ณด๊ธฐ

            
ERROR ์—๋Ÿฌ ์ผ€์ด์Šค

์ธ์ฆ ์‹คํŒจ, ์ž˜๋ชป๋œ ํŒŒ๋ผ๋ฏธํ„ฐ, ๋น„ํ™œ์„ฑ ํƒœ์Šคํฌ ๋“ฑ ๋‹ค์–‘ํ•œ ์—๋Ÿฌ ์ƒํ™ฉ์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค.

์—๋Ÿฌ ์ฝ”๋“œ ๋ชฉ๋ก
์ฝ”๋“œHTTP์„ค๋ช…
UNAUTHORIZED401API Key ๋ˆ„๋ฝ ๋˜๋Š” ์ž˜๋ชป๋จ
NAMESPACE_DENIED403Namespace ์ ‘๊ทผ ๊ถŒํ•œ ์—†์Œ
โ€”422texts ๋นˆ ๋ฐฐ์—ด ๋“ฑ Pydantic ์œ ํšจ์„ฑ ์˜ค๋ฅ˜ (FastAPI ๊ธฐ๋ณธ ํ˜•์‹: {"detail":[...]}, error.code ์—†์Œ)
GPU_INVALID_INPUT400texts 100๊ฐœ ์ดˆ๊ณผ (max_texts_per_request)
GPU_NOT_FOUND404request_id ์—†์Œ ๋˜๋Š” ๋งŒ๋ฃŒ (outbox TTL 3600s)
GPU_PROCESSING409์ทจ์†Œ ๋ถˆ๊ฐ€ โ€” ์ด๋ฏธ ์ฒ˜๋ฆฌ ์ค‘ (cancel ์—”๋“œํฌ์ธํŠธ)
GPU_TASK_DISABLED503embedding ํƒœ์Šคํฌ ๋น„ํ™œ์„ฑ (REDGX_GPU_EMBEDDING_ENABLED=false)
GPU_UNAVAILABLE503Worker ์›Œ๋ฐ์—… ์ค‘ ๋˜๋Š” drain ๋ชจ๋“œ
GPU_QUEUE_FULL503ํ ์šฉ๋Ÿ‰ ์ดˆ๊ณผ (max_inflight ๋˜๋Š” max_requests ๋„๋‹ฌ)
GPU_CIRCUIT_OPEN200 (failed)Worker ๊ฒฐ๊ณผ โ€” ์ถ”๋ก  ์„œ๋ฒ„ Circuit Breaker OPEN ์ƒํƒœ๋กœ ํ˜ธ์ถœ ์ฐจ๋‹จ
GPU_INFERENCE_FAILED200 (failed)Worker ๊ฒฐ๊ณผ โ€” ์ถ”๋ก  ์„œ๋ฒ„ ํ˜ธ์ถœ ์‹คํŒจ / ์˜ˆ์™ธ
GPU_TIMEOUTWS 4008WebSocket /wait ๊ฒฐ๊ณผ ๋Œ€๊ธฐ timeout ์ดˆ๊ณผ
โ€”429nginx Rate Limit ์ดˆ๊ณผ (IP๋‹น 10 r/s, burst 20). ์‘๋‹ต์€ plain HTML, JSON ์ฝ”๋“œ ์—†์Œ
API ๋ ˆํผ๋Ÿฐ์Šค

์ „์ฒด ์—”๋“œํฌ์ธํŠธ์™€ ์š”์ฒญ/์‘๋‹ต ์Šคํ‚ค๋งˆ์ž…๋‹ˆ๋‹ค.

POST /api/v1/ns/{ns}/gpu/embed โ€” ์ž„๋ฒ ๋”ฉ ์š”์ฒญ

์‘๋‹ต: 202 Accepted

{
  "texts": ["ํ…์ŠคํŠธ1", "ํ…์ŠคํŠธ2"],  // ํ•„์ˆ˜, 1~100๊ฐœ
  "model": null,                    // ์„ ํƒ, null์ด๋ฉด ์„œ๋ฒ„ ๊ธฐ๋ณธ๊ฐ’ (BAAI/bge-m3)
  "normalize": true                 // ์„ ํƒ, L2 ์ •๊ทœํ™” ์—ฌ๋ถ€ (๊ธฐ๋ณธ true)
}
{
  "ok": true,
  "data": {
    "request_id": "emb-1710000000000-a1b2c3d4",
    "task_type": "embedding",
    "chunk_count": 1
  }
}
GET /api/v1/ns/{ns}/gpu/embed/{req_id} โ€” ๊ฒฐ๊ณผ ์กฐํšŒ

์™„๋ฃŒ: 200 OK  |  ๋Œ€๊ธฐ/์ฒ˜๋ฆฌ์ค‘: 202 Accepted  |  ์—†์Œ/๋งŒ๋ฃŒ: 404

// ์™„๋ฃŒ โ€” HTTP 200, status: "completed"
// ๊ฒฐ๊ณผ ํ•„๋“œ๊ฐ€ data ๋ฐ”๋กœ ์•„๋ž˜ ํŽผ์ณ์ง (์ค‘์ฒฉ ์—†์Œ)
{
  "ok": true,
  "data": {
    "request_id": "emb-...",
    "status": "completed",
    "model": "BAAI/bge-m3",
    "vectors": [
      {
        "index": 0,
        "text": "์›๋ณธ ํ…์ŠคํŠธ",
        "embedding": "base64_encoded_float32_vector...",
        "dim": 1024
      }
    ],
    "cached": false,
    "elapsed_ms": 42.5
  }
}

// ๋Œ€๊ธฐ ์ค‘ โ€” HTTP 202, status: "queued"
{ "ok": true, "data": { "request_id": "emb-...", "status": "queued" } }

// GPU ์ฒ˜๋ฆฌ ์ค‘ โ€” HTTP 202, status: "processing"
{ "ok": true, "data": { "request_id": "emb-...", "status": "processing" } }

// ์ถ”๋ก  ์‹คํŒจ โ€” HTTP 200, status: "failed"
{ "ok": true, "data": { "request_id": "emb-...", "status": "failed", "error": {...} } }

// ์—†์Œ/๋งŒ๋ฃŒ/์‚ญ์ œ๋จ โ€” HTTP 404
{ "ok": false, "error": { "code": "GPU_NOT_FOUND", "message": "..." } }
WS /api/v1/ns/{ns}/gpu/embed/{req_id}/wait โ€” ๊ฒฐ๊ณผ Push ๋Œ€๊ธฐ

Redis Pub/Sub์œผ๋กœ ์™„๋ฃŒ ์•Œ๋ฆผ์„ ์ˆ˜์‹  ํ›„ ์ฆ‰์‹œ ๊ฒฐ๊ณผ๋ฅผ ์ „์†กํ•ฉ๋‹ˆ๋‹ค. GET ํด๋ง๋ณด๋‹ค ์ง€์—ฐ์ด ์ ์Šต๋‹ˆ๋‹ค.

// ์ธ์ฆ: Sec-WebSocket-Protocol ํ—ค๋”๋กœ API ํ‚ค ์ „๋‹ฌ (URL ์ฟผ๋ฆฌ ๋ฏธ์ง€์›)
// ๋ธŒ๋ผ์šฐ์ €: new WebSocket(url, ["redgx_ak_hrm_..."])
wss://<host>/api/v1/ns/HRM/gpu/embed/emb-xxx/wait?timeout=10          // timeout: ์ตœ๋Œ€ ๋Œ€๊ธฐ ์‹œ๊ฐ„(์ดˆ), ๊ธฐ๋ณธ 10, ์ตœ๋Œ€ 300
          
// ์™„๋ฃŒ โ€” status: "completed" (REST GET๊ณผ ๋™์ผ ๊ตฌ์กฐ)
{ "ok": true, "data": { "request_id": "emb-...", "status": "completed",
    "model": "BAAI/bge-m3", "vectors": [...], "cached": false, "elapsed_ms": 42.5 } }

// ์‹คํŒจ โ€” status: "failed"
{ "ok": true, "data": { "request_id": "emb-...", "status": "failed", "error": {...} } }

// ์ธ์ฆ ์‹คํŒจ (API key ๋ˆ„๋ฝ/์˜ค๋ฅ˜, ns ๊ถŒํ•œ ์—†์Œ) โ†’ HTTP 403 (WebSocket upgrade ๊ฑฐ๋ถ€, accept ์ „)
// Not Found    โ†’ { "ok": false, "error": { "code": "GPU_NOT_FOUND" } } + close(4004)
// Timeout      โ†’ { "ok": false, "error": { "code": "GPU_TIMEOUT" } }  + close(4008)
ํ—ค๋”
ํ—ค๋”์„ค๋ช…
X-API-Keyํ•„์ˆ˜. ํด๋ผ์ด์–ธํŠธ API ํ‚ค
Content-Type: application/jsonPOST ์š”์ฒญ ์‹œ ํ•„์ˆ˜