Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

LlaSMolGeneration

The LlaSMolGeneration class provides an interface for generating chemistry predictions using fine-tuned LlaSMol models.

Initialization

from generation import LlaSMolGeneration

generator = LlaSMolGeneration(
    model_name="osunlp/LlaSMol-Mistral-7B",
    base_model="mistralai/Mistral-7B-v0.1",
    device="cuda"
)

Parameters

model_name
string
required
The name or path of the fine-tuned LlaSMol model. Supported models:
  • osunlp/LlaSMol-Mistral-7B
  • osunlp/LlaSMol-Galactica-6.7B
  • osunlp/LlaSMol-Llama2-7B
  • osunlp/LlaSMol-CodeLlama-7B
base_model
string
default:"None"
The base model architecture. If not specified, it will be inferred from the model_name using the BASE_MODELS mapping in config.py.
device
string
default:"None"
The device to run inference on. Options: cuda or cpu. If None, automatically selects CUDA if available.

generate()

Generate predictions for input text prompts.
outputs = generator.generate(
    input_text="Convert the IUPAC name to SMILES: aspirin",
    batch_size=1,
    max_input_tokens=512,
    max_new_tokens=1024,
    canonicalize_smiles=True,
    print_out=True,
    num_return_sequences=1,
    num_beams=4
)

Parameters

input_text
string | list[string]
required
The input prompt(s) for generation. Can be a single string or a list of strings.
batch_size
int
default:"1"
Number of samples to process in parallel.
max_input_tokens
int
default:"512"
Maximum number of tokens for input text. Inputs exceeding this limit will be skipped.
max_new_tokens
int
default:"1024"
Maximum number of tokens to generate in the response.
canonicalize_smiles
bool
default:"True"
Whether to canonicalize SMILES strings in the input text before generation.
print_out
bool
default:"False"
Whether to print input and output pairs during generation.
**generation_settings
dict
Additional generation parameters passed to the Hugging Face GenerationConfig:
  • num_return_sequences: Number of sequences to generate per input
  • num_beams: Number of beams for beam search
  • temperature: Sampling temperature
  • top_p: Nucleus sampling parameter
  • And other Hugging Face generation parameters

Returns

A list of dictionaries, one for each input, containing:
  • input_text: Original input text
  • real_input_text: Processed input text (with canonicalized SMILES and chat formatting)
  • output: List of generated text sequences (or None if input was too long)

create_sample()

Create a tokenized sample from input text.
sample = generator.create_sample(
    text="Convert IUPAC to SMILES: benzene",
    canonicalize_smiles=True,
    max_input_tokens=512
)

Parameters

text
string
required
The input text to process.
canonicalize_smiles
bool
default:"True"
Whether to canonicalize SMILES strings in the input.
max_input_tokens
int
default:"None"
Maximum token limit. If exceeded, the sample will be marked with input_too_long=True.

Returns

A dictionary containing:
  • input_text: Original input text
  • real_input_text: Formatted prompt with chat template
  • input_ids: Tokenized input IDs
  • attention_mask: Attention mask
  • labels: Labels for training (same as input_ids)
  • input_too_long: Boolean flag if input exceeds max_input_tokens

Usage Examples

Basic Generation

from generation import LlaSMolGeneration

# Initialize generator
generator = LlaSMolGeneration(
    model_name="osunlp/LlaSMol-Mistral-7B",
    device="cuda"
)

# Generate prediction
outputs = generator.generate(
    input_text="What is the molecular formula of caffeine?",
    print_out=True
)

print(outputs[0]['output'][0])

Batch Generation

inputs = [
    "Convert IUPAC to SMILES: ethanol",
    "Convert IUPAC to SMILES: acetic acid",
    "Convert IUPAC to SMILES: benzene"
]

outputs = generator.generate(
    input_text=inputs,
    batch_size=4,
    num_return_sequences=3,
    num_beams=6
)

for inp, out in zip(inputs, outputs):
    print(f"Input: {inp}")
    print(f"Outputs: {out['output']}")

Name Conversion (IUPAC to SMILES)

from config import TASKS_GENERATION_SETTINGS

# Use task-specific settings
task = "name_conversion-i2s"
task_settings = TASKS_GENERATION_SETTINGS.get(task, {})
generation_kargs = task_settings.get("generation_kargs", {})

outputs = generator.generate(
    input_text="Convert IUPAC name to SMILES: 2-acetoxybenzoic acid",
    **generation_kargs
)

Forward Synthesis

task = "forward_synthesis"
task_settings = TASKS_GENERATION_SETTINGS[task]

outputs = generator.generate(
    input_text="Predict the product: <SMILES> CC(=O)OC1=CC=CC=C1C(=O)O </SMILES>",
    **task_settings.get("generation_kargs", {})
)

Helper Functions

canonicalize_smiles_in_text()

Canonicalizes SMILES strings within text that are wrapped in <SMILES> tags.
from generation import canonicalize_smiles_in_text

text = "The molecule is <SMILES> c1ccccc1 </SMILES>"
canonicalized = canonicalize_smiles_in_text(
    text,
    tags=('<SMILES>', '</SMILES>'),
    keep_text_unchanged_if_no_tags=True,
    keep_text_unchanged_if_error=False
)

tokenize()

Tokenizes prompt text for the model.
from generation import tokenize

tokenizer = generator.tokenizer
result = tokenize(
    tokenizer,
    prompt="Convert to SMILES: benzene",
    add_eos_token=True
)
# Returns: {'input_ids': [...], 'attention_mask': [...], 'labels': [...]}

File Location

LLM4Chem/generation.py
The generation module automatically handles chat formatting, SMILES canonicalization, and batch processing. For custom tasks, refer to config.py for task-specific generation settings.

Build docs developers (and LLMs) love