SiMologics — Protein Sequence Analysis

Documentation

Learn how to use SiMologics for antibody sequence analysis and design

Getting Started

1. Create an account — Sign up for free at the signup page.

2. Verify your email — Enter the 6-digit code sent to your email.

3. Start analysing — Go to the Analyse page, pick an operation, paste your sequence, and submit.

4. View results — Results appear in the right panel once processing completes. You can also find them in your Dashboard.

Operations

Sequence Embedding (embed)

Generates a high-dimensional vector representation of your protein sequence using AntiBERTy, a transformer-based antibody language model trained on 558M sequences. These embeddings capture structural and functional properties and can be used for clustering, similarity search, or as input features for downstream ML models.

INPUT

A single amino acid sequence (10–1024 residues)

OUTPUT

A tensor of shape [1, N, 512] — one sequence, N token positions (including [CLS]/[SEP] special tokens), 512 embedding dimensions. Sequence length = N − 2 residues.

Species Classification (classify)

Identifies the likely species of origin and antibody chain type (heavy or light) from the sequence alone. Powered by a classification head trained on paired antibody datasets.

INPUT

A single amino acid sequence

OUTPUT

Species label (e.g., human, mouse) and chain type (heavy, light) with confidence probabilities

Fill Mask (fill)

Predicts the most likely amino acids at masked positions in your sequence. Useful for exploring sequence variants, understanding residue importance, or generating conservative mutations. Use _ (underscore) to mark positions for prediction.

INPUT

A sequence with one or more _ mask tokens (e.g., EVQL_ESGG...)

OUTPUT

Top-k predicted residues with probabilities for each masked position

Log Likelihood (log_likelihood)

Computes the pseudo-log-likelihood score of your sequence under our antibody language model. Higher scores indicate the sequence is more "natural" according to the model's learned distribution. Useful for scoring variants or filtering generated sequences.

INPUT

A single amino acid sequence

OUTPUT

A single float score (pseudo-log-likelihood)

Batch Analysis

Upload a CSV file with a sequence column to analyse multiple sequences at once. Free tier allows up to 50 rows per batch and 2 batches per month. Pro tier supports up to 10,000 rows with unlimited batches.

Example CSV format:

sequence,label EVQLVESGGGLVQPGGSLRL...,antibody_1 DIQMTQSPSSLSASVGDRVTITC...,antibody_2

Pro Tools

Sequence Generation (ProGen2-OAS)

Causal language model trained on the OAS antibody corpus. Provide a seed sequence and the model autoregressively extends it to the target length. Control creativity via Temperature and Top-P sampling parameters.

INPUT

Seed amino acid sequence (5–512 residues)

OUTPUT

1–10 extended sequences as clean amino acid strings (end tokens removed)

Antibody Design Studio (IgCraft)

Four design modes — de novo generation (unconditional), region inpainting (redesign selected CDR/FWR regions from IMGT CSV input), inverse folding (predict sequence from PDB structure), and CDR grafting (transplant CDRs onto an acceptor framework).

INPAINT INPUT FORMAT

CSV with IMGT region columns: H-fwr1, H-cdr1, H-fwr2, H-cdr2, H-fwr3, H-cdr3, H-fwr4, L-fwr1…L-fwr4. Download a template from the Design page.

Humanisation (BioPhi / Sapiens)

Two modes: Humanise applies Sapiens BERT-based humanisation to suggest VH and/or VL mutations that increase humanness while preserving function; Score computes per-residue humanness scores without modifying the sequence. Upload a CSV with vh_sequence / vl_sequence columns for batch processing.

FAQ

How long does analysis take?

Standard tools (embed, classify, fill, log-likelihood) respond within seconds once running. GPU-backed Pro tools (generation, design, humanisation) run on serverless GPU infrastructure that scales to zero when idle. The first request after a quiet period may take some time to warm up — please allow a few minutes if this happens. Subsequent requests complete in seconds.

What sequence formats are supported?

Standard single-letter amino acid codes (ACDEFGHIKLMNPQRSTVWY). Sequences must be 10–1024 residues.

Is my data secure?

Yes. All sequences are encrypted in transit (HTTPS) and results are stored in your private S3 bucket with 30-day auto-expiry.