Simplified molecular-input line-entry system

SMILES
Filename extension	.smi
Internet media type	chemical/x-daylight-smiles
Type of format	chemical file format

Generation of SMILES: Break cycles, then write as branches off a main backbone. (Ciprofloxacin)

The simplified molecular-input line-entry system (SMILES) is a specification in form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.

The original SMILES specification was initiated by David Weininger at the USEPA Mid-Continent Ecology Division Laboratory in Duluth in the 1980s.^[1]^[2]^[3] Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo and Corwin Hansch (Pomona College) for supporting the work, and Arthur Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in programming the system."^[4] The Environmental Protection Agency funded the initial project to develop SMILES.^[5]^[6]

It has since been modified and extended by others, most notably by Daylight Chemical Information Systems. In 2007, an open standard called "OpenSMILES" was developed by the Blue Obelisk open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc).

In July 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical (e.g., graph theory) backing.

Terminology

The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.

Typically, a number of equally valid SMILES strings can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanol. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the canonicalization algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems, OpenEye Scientific Software, MEDIT, Chemical Computing Group, MolSoft LLC, and the Chemistry Development Kit. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a database.

The original paper that described the CANGEN^[2] algorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases (e.g. cuneane, 1,2-dicyclopropylethane) and cannot be considered a correct method for representing a graph canonically.^[7] There is currently no systematic comparison across commercial software to test if such flaws exist in those packages.

SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. These are structural features that cannot be specified by connectivity alone and SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which isotopes are specified.

Graph-based definition

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.

Examples

Atoms

Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water is simply O.

An atom holding one or more electrical charges is enclosed in brackets, followed by the symbol H if it is bonded to one or more atoms of hydrogen, followed by the number of hydrogen atoms (as usual one is omitted example: NH4 for ammonium), then by the sign '+' for a positive charge or by '-' for a negative charge. The number of charges is specified after the sign (except if there is one only); however, it is also possible write the sign as many times as the ion has charges: instead of "Ti+4", one can also write "Ti++++" (Titanium IV, Ti⁴⁺). Thus, the hydroxide anion is represented by [OH-], the oxonium cation is [OH3+] and the cobalt III cation (Co³⁺) is either [Co+3] or [Co+++].

Bonds

Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. For example, the SMILES for ethanol can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES string, which for cyclohexane and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2 (naphthalene: c1cccc2c1cccc2 (note the lower case for aromatic compounds)), and so on. After reaching 9, the label must be preceded by a '%', in order to differentiate it from two different labels bonded to the same atom (~C12~ will mean the atom of carbon holds the ring closure labels 1 and 2, whereas ~C%12~ will indicate one label only, 12). Double, triple, and quadruple bonds are represented by the symbols '=', '#', and '$' respectively as illustrated by the SMILES O=C=O (carbon dioxide), C#N (hydrogen cyanide) and [Ga-]$[As+] (gallium arsenide).

Aromaticity

Aromatic C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. Benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Bonds between aromatic atoms are, by default, aromatic although these can be specified explicitly using the ':' symbol. Aromatic atoms can be singly bonded to each other and biphenyl can be represented by c1ccccc1-c2ccccc2. Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] and imidazole is written in SMILES notation as n1c[nH]cc1.

The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.

Visualization of 3-cyanoanisole as COc(c1)cccc1C#N.

Branching

Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.

Stereochemistry

Configuration around double bonds is specified using the characters "/" and "\". For example, F/C=C/F (see depiction) is one representation of trans-difluoroethene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=C\F (see depiction) is one possible representation of cis-difluoroethene, in which the Fs are on the same side of the double bond, as shown in the figure.

Configuration at tetrahedral carbon is specified by @ or @@. L-Alanine, the more common enantiomer of the amino acid alanine can be written as N[C@@H](C)C(=O)O (see depiction). The @@ specifier indicates that, when viewed from nitrogen along the bond to the chiral center, the sequence of substituents hydrogen (H), methyl (C) and carboxylate (C(=O)O) appear clockwise. D-Alanine can be written as N[C@H](C)C(=O)O (see depiction). The order of the substituents in the SMILES string is very important and D-alanine can also be encoded as N[C@@H](C(=O)O)C (see depiction).

Isotopes

Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.

Examples

Molecule	Structure	SMILES Formula
Dinitrogen	N≡N	N#N
Methyl isocyanate (MIC)	CH₃–N=C=O	CN=C=O
Copper(II) sulfate	Cu²⁺ SO₄²⁻	[Cu+2].[O-]S(=O)(=O)[O-]
Oenanthotoxin (C₁₇H₂₂O₂)		CCC[C@@H](O)CC\C=C\C=C\C#CC#C\C=C\CO
Pyrethrin II (C₂₂H₂₈O₅)		COC(=O)C(\C)=C\C1C(C)(C)[C@H]1C(=O)O[C@@H]2C(C)=C(C(=O)C2)CC=CC=C
Aflatoxin B1 (C₁₇H₁₂O₆)		O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5
Glucose (glucopyranose) (C₆H₁₂O₆)		OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)1
Bergenin (cuscutin) (a resin) (C₁₄H₁₆O₉)		OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H]2[C@@H]1c3c(O)c(OC)c(O)cc3C(=O)O2
A pheromone of the Californian scale insect		CC(=O)OCCC(/C)=C\C[C@H](C(C)=C)CCC=C
2S,5R-Chalcogran: a pheromone of the bark beetle Pityogenes chalcographus^[8]		CC[C@H](O1)CC[C@@]12CCCO2
Vanillin		O=Cc1ccc(O)c(OC)c1
Melatonin (C₁₃H₁₆N₂O₂)		CC(=O)NCCC1=CNc2c1cc(OC)cc2
Flavopereirin (C₁₇H₁₅N₂)		CCc(c1)ccc2[n+]1ccc3c2Nc4c3cccc4
Nicotine (C₁₀H₁₄N₂)		CN1CCC[C@H]1c2cccnc2
Alpha-thujone (C₁₀H₁₆O)		CC(C)[C@@]12C[C@@H]1[C@@H](C)C(=O)C2
Thiamin (C₁₂H₁₇N₄OS⁺) (vitamin B1)		OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2

Illustration with a molecule with more than 9 rings, Cephalostatin-1^[9] (a steroidic trisdecacyclic pyrazine with the empirical formula C₅₄H₇₄N₂O₁₀ isolated from the Indian Ocean hemichordate Cephalodiscus gilchristi):

Starting with the left-most methyl group in the figure:

C[C@@](C)(O1)C[C@@H](O)[C@@]1(O2)[C@@H](C)[C@@H]3CC=C4[C@]3(C2)C(=O)C[C@H]5[C@H]4CC[C@@H](C6)[C@]5(C)Cc(n7)c6nc(C[C@@]89(C))c7C[C@@H]8CC[C@@H]%10[C@@H]9C[C@@H](O)[C@@]%11(C)C%10=C[C@H](O%12)[C@]%11(O)[C@H](C)[C@]%12(O%13)[C@H](O)C[C@@]%13(C)CO

Note that '%' appears in front of the index of ring closure labels above 9; see section Bonds above.

Other examples of SMILES

The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool.

Extensions

SMARTS is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism. SMIRKS is a line notation for specifying reaction transforms.

Conversion

SMILES can be converted back to 2-dimensional representations using Structure Diagram Generation algorithms (Helson, 1999). This conversion is not always unambiguous. Conversion to 3-dimensional representation is achieved by energy minimization approaches. There are many downloadable and web-based conversion utilities.

References

↑ Weininger 1988
↑ ^2.0 ^2.1 Weininger, Weininger & Weininger 1989
↑ Weininger 1990
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Anderson, Veith & Weininger 1987
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ National Center for Biotechnology Information (NCBI). PubChem Compound. (accessed May 12, 2012) PubChem Compound CID=183413 (Cephalostatin-1)

External links

Lua error in package.lua at line 80: module 'strict' not found.

SMILES related software utilities

NCI/CADD Chemical Identifier Resolver – resolves or generates SMILES from chemical names, CAS Registry Numbers, InChI/InChIKey and many other chemical structure file formats
NCI/CADD Online SMILES Translator and Structure File Generator – Java online molecule editor
PubChem server side structure editor – online molecule editor
smi23d^{[dead link]} – 3D Coordinate Generation
Daylight Depict – Translate a SMILES formula into graphics
GIF/PNG-Creator for 2D Plots of Chemical Structures
JME molecule editor - Chemical editor/viewer and SMILES/SMARTS generator in Java
JSME molecule editor - Free chemical editor/viewer and SMILES/SMARTS generator in JavaScript
ACD/ChemSketch
Marvin by ChemAxon – online chemical editor/viewer and SMILES generator/converter
Instant JChem by ChemAxon – desktop application for storing/generating/converting/visualizing/searching SMILES structures, particularly batch processing; personal edition free
JChem for Excel by ChemAxon – MS Excel add-in for storing/generating/converting/visualizing/searching SMILES structures
Smormo-Ed – a molecule editor for Linux which can read and write SMILES
InChI.info – an unofficial InChI website featuring on-line converter from InChI and SMILES to molecular drawings
Balloon – A free program for 3D coordinate generation and conformational analysis.
Indigo – an open-source cross-platform cheminformatics library with a plugin for IUPAC-compliant molecule and reaction 2D structural formula rendering.
Open Babel – an open-source chemical toolbox allowing anyone to search, convert, analyze, or store biochemical data.
Bioclipse – a free and open source workbench for the life sciences
MolEngine – A .NET cheminformatics toolkit to read/write SMILES, generate 2D coordinate from SMILES, and convert SMILES from/into other Chemical file formats.
JSDraw – A cross-platform javascript chemical structure editor to generate SMILES and SMARTS.

[SMILES1-1] Weininger 1988

[SMILES2-2] 2.0 ^2.1 Weininger, Weininger & Weininger 1989

[SMILES3-3] Weininger 1990

[4] Lua error in package.lua at line 80: module 'strict' not found.

[5] Anderson, Veith & Weininger 1987

[SMILES_Tutorial:_What_is_SMILES.3F-6] Lua error in package.lua at line 80: module 'strict' not found.

[7] Lua error in package.lua at line 80: module 'strict' not found.

[8] Lua error in package.lua at line 80: module 'strict' not found.

[9] National Center for Biotechnology Information (NCBI). PubChem Compound. (accessed May 12, 2012) PubChem Compound CID=183413 (Cephalostatin-1)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Simplified molecular-input line-entry system

Contents

Terminology

Graph-based definition

Examples

Atoms

Bonds

Aromaticity

Branching

Stereochemistry

Isotopes

Examples

Other examples of SMILES

Extensions

Conversion

See also

References

Further reading

External links

SMILES related software utilities

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Filename extension	`.smi`
Internet media type	`chemical/x-daylight-smiles`
Type of format	chemical file format