{"note":"OpenAPI conversion -- returning structured metadata","name":"huggingface-text-gen","description":"Text Generation Inference","version":"3.3.6-dev0","base_url":"","endpoints":12,"raw":"@lap v0.3\n# Machine-readable API spec. Each @endpoint block is one API call.\n@api Text Generation Inference\n@version 3.3.6-dev0\n@endpoints 12\n@toc root(1), chat_tokenize(1), generate(1), generate_stream(1), health(1), info(1), invocations(1), metrics(1), tokenize(1), chat(1), completions(1), models(1)\n\n@group root\n@endpoint POST /\n@desc Generate tokens if `stream == false` or a stream of token if `stream == true`\n@required {inputs: str}\n@optional {parameters: map{adapter_id: str, best_of: int, decoder_input_details: bool, details: bool, do_sample: bool, frequency_penalty: num(float), grammar: any, max_new_tokens: int(int32), repetition_penalty: num(float), return_full_text: bool, seed: int(int64), stop: [str], temperature: num(float), top_k: int(int32), top_n_tokens: int(int32), top_p: num(float), truncate: int, typical_p: num(float), watermark: bool}, stream: bool=false}\n@returns(200) Generated Text\n@errors {422: Input validation error, 424: Generation Error, 429: Model is overloaded, 500: Incomplete generation}\n\n@endgroup\n\n@group chat_tokenize\n@endpoint POST /chat_tokenize\n@desc Template and tokenize ChatRequest\n@required {messages: [any] # A list of messages comprising the conversation so far.}\n@optional {frequency_penalty: num(float) # Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim., logit_bias: [num(float)] # UNUSED Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token., logprobs: bool # Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message., max_tokens: int(int32)=1024 # The maximum number of tokens that can be generated in the chat completion., model: str # [UNUSED] ID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API., n: int(int32) # UNUSED How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep n as 1 to minimize costs., presence_penalty: num(float) # Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics, response_format: any=null, seed: int(int64), stop: [str] # Up to 4 sequences where the API will stop generating further tokens., stream: bool, stream_options: any, temperature: num(float) # What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.  We generally recommend altering this or `top_p` but not both., tool_choice: any=auto, tool_prompt: str # A prompt to be appended before the tools, tools: [map{function!: map, type!: str}] # A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for., top_logprobs: int(int32) # An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used., top_p: num(float) # An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.}\n@returns(200) {templated_text: str, tokenize_response: [map]} # Templated and tokenized ChatRequest\n@errors {404: Failed to tokenize ChatRequest}\n\n@endgroup\n\n@group generate\n@endpoint POST /generate\n@desc Generate tokens\n@required {inputs: str}\n@optional {parameters: map{adapter_id: str, best_of: int, decoder_input_details: bool, details: bool, do_sample: bool, frequency_penalty: num(float), grammar: any, max_new_tokens: int(int32), repetition_penalty: num(float), return_full_text: bool, seed: int(int64), stop: [str], temperature: num(float), top_k: int(int32), top_n_tokens: int(int32), top_p: num(float), truncate: int, typical_p: num(float), watermark: bool}}\n@returns(200) {details: any?, generated_text: str} # Generated Text\n@errors {422: Input validation error, 424: Generation Error, 429: Model is overloaded, 500: Incomplete generation}\n\n@endgroup\n\n@group generate_stream\n@endpoint POST /generate_stream\n@desc Generate a stream of token using Server-Sent Events\n@required {inputs: str}\n@optional {parameters: map{adapter_id: str, best_of: int, decoder_input_details: bool, details: bool, do_sample: bool, frequency_penalty: num(float), grammar: any, max_new_tokens: int(int32), repetition_penalty: num(float), return_full_text: bool, seed: int(int64), stop: [str], temperature: num(float), top_k: int(int32), top_n_tokens: int(int32), top_p: num(float), truncate: int, typical_p: num(float), watermark: bool}}\n@returns(200) Generated Text\n@errors {422: Input validation error, 424: Generation Error, 429: Model is overloaded, 500: Incomplete generation}\n\n@endgroup\n\n@group health\n@endpoint GET /health\n@desc Health check method\n@returns(200) Everything is working fine\n@errors {503: Text generation inference is down}\n\n@endgroup\n\n@group info\n@endpoint GET /info\n@desc Text Generation Inference endpoint info\n@returns(200) {docker_label: str?, max_best_of: int, max_client_batch_size: int, max_concurrent_requests: int, max_input_tokens: int, max_stop_sequences: int, max_total_tokens: int, model_id: str, model_pipeline_tag: str?, model_sha: str?, router: str, sha: str?, validation_workers: int, version: str} # Served model info\n\n@endgroup\n\n@group invocations\n@endpoint POST /invocations\n@desc Generate tokens from Sagemaker request\n@returns(200) Generated Chat Completion\n@errors {422: Input validation error, 424: Generation Error, 429: Model is overloaded, 500: Incomplete generation}\n\n@endgroup\n\n@group metrics\n@endpoint GET /metrics\n@desc Prometheus metrics scrape endpoint\n@returns(200) Prometheus Metrics\n\n@endgroup\n\n@group tokenize\n@endpoint POST /tokenize\n@desc Tokenize inputs\n@required {inputs: str}\n@optional {parameters: map{adapter_id: str, best_of: int, decoder_input_details: bool, details: bool, do_sample: bool, frequency_penalty: num(float), grammar: any, max_new_tokens: int(int32), repetition_penalty: num(float), return_full_text: bool, seed: int(int64), stop: [str], temperature: num(float), top_k: int(int32), top_n_tokens: int(int32), top_p: num(float), truncate: int, typical_p: num(float), watermark: bool}}\n@returns(200) Tokenized ids\n@errors {404: No tokenizer found}\n\n@endgroup\n\n@group chat\n@endpoint POST /v1/chat/completions\n@desc Generate tokens\n@required {messages: [any] # A list of messages comprising the conversation so far.}\n@optional {frequency_penalty: num(float) # Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim., logit_bias: [num(float)] # UNUSED Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token., logprobs: bool # Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message., max_tokens: int(int32)=1024 # The maximum number of tokens that can be generated in the chat completion., model: str # [UNUSED] ID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API., n: int(int32) # UNUSED How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep n as 1 to minimize costs., presence_penalty: num(float) # Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics, response_format: any=null, seed: int(int64), stop: [str] # Up to 4 sequences where the API will stop generating further tokens., stream: bool, stream_options: any, temperature: num(float) # What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.  We generally recommend altering this or `top_p` but not both., tool_choice: any=auto, tool_prompt: str # A prompt to be appended before the tools, tools: [map{function!: map, type!: str}] # A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for., top_logprobs: int(int32) # An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used., top_p: num(float) # An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.}\n@returns(200) {choices: [map], created: int(int64), id: str, model: str, system_fingerprint: str, usage: map{completion_tokens: int(int32), prompt_tokens: int(int32), total_tokens: int(int32)}} # Generated Chat Completion\n@errors {422: Input validation error, 424: Generation Error, 429: Model is overloaded, 500: Incomplete generation}\n\n@endgroup\n\n@group completions\n@endpoint POST /v1/completions\n@desc Generate tokens\n@required {prompt: [str]}\n@optional {frequency_penalty: num(float) # Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim., max_tokens: int(int32)=1024 # The maximum number of tokens that can be generated in the chat completion., model: str # UNUSED ID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API., repetition_penalty: num(float), seed: int(int64), stop: [str] # Up to 4 sequences where the API will stop generating further tokens., stream: bool, suffix: str # The text to append to the prompt. This is useful for completing sentences or generating a paragraph of text. please see the completion_template field in the model's tokenizer_config.json file for completion template., temperature: num(float) # What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or `top_p` but not both., top_p: num(float) # An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.}\n@returns(200) {choices: [map], created: int(int64), id: str, model: str, system_fingerprint: str, usage: map{completion_tokens: int(int32), prompt_tokens: int(int32), total_tokens: int(int32)}} # Generated Chat Completion\n@errors {422: Input validation error, 424: Generation Error, 429: Model is overloaded, 500: Incomplete generation}\n\n@endgroup\n\n@group models\n@endpoint GET /v1/models\n@desc Get model info\n@returns(200) {created: int(int64), id: str, object: str, owned_by: str} # Served model info\n@errors {404: Model not found}\n\n@endgroup\n\n@end\n"}