Working with Syntax Highlighting

Working with Syntax Highlighting Overview Syntax Highlighting is what makes the editor automatically display text in different styles/colors, depending on the function of the string in relation to the purpose of the file. In program source code for example, control statements may be rendered bold, while data types and comments get different colors from the rest of the text. This greatly enhances the readability of the text, and thus helps the author to be more efficient and productive. A Perl function, rendered with syntax highlighting. A Perl function, rendered with syntax highlighting. The same Perl function, without highlighting. The same Perl function, without highlighting. Of the two examples, which is easiest to read? &kate; comes with a flexible, configurable and capable system for doing syntax highlighting, and the standard distribution provides definitions for a wide range of programming, scripting and markup languages and other text file formats. In addition you can provide your own definitions in simple &XML; files. &kate; will automatically detect the right syntax rules when you open a file, based on the &MIME; Type of the file, determined by its extension, or, if it has none, the contents. Should you experience a bad choice, you can manually set the syntax to use from the DocumentsHighlight Mode menu. The styles and colors used by each syntax highlight definition can be configured using the Appearance page of the Config Dialog, while the &MIME; Types it should be used for, are handeled by the Highlight page. Syntax highlighting is there to enhance the readability of correct text, but you cannot trust it to validate your text. Marking text for syntax is difficult depending on the format you are using, and in some cases the authors of the syntax rules will be proud if 98% of text gets correctly rendered, though most often you need a rare style to see the incorrect 2%. You can download updated or additional syntax highlight definitions from the &kate; website by clicking the Download button in the Highlight Page of the Config Dialog. The &kate; Syntax Highlight System This section will discuss the &kate; syntax highlighting mechanism in more detail. It is for you if you want to know about it, or if you want to change or create syntax definitions. How it Works Whenever you open a file, one of the first things the &kate; editor does is detect which syntax definition to use for the file. While reading the text of the file, and while you type away in it, the syntax highlighting system will analyze the text using the rules defined by the syntax definition and mark in it where different contexts and styles begin and end. When you type in the document, the new text is analyzed and marked on the fly, so that if you delete a character that is marked as the beginning or end of a context, the style of surrounding text changes accordingly. The syntax definitions used by the &kate; Syntax Highlighting System are &XML; files, containing Rules for detecting the role of text, organized into context blocks Keyword lists Style Item definitions When analyzing the text, the detection rules are evaluated in the order in which they are defined, and if the beginning of the current string matches a rule, the related context is used. The start point in the text is moved to the final point at which that rule matched and a new loop of the rules begins, starting in the context set by the matched rule. Rules The detection rules are the heart of the highlighting detection system. A rule is a string, character or regular expression against which to match the text being analyzed. It contains information about which style to use for the matching part of the text. It may switch the working context of the system either to an explicitly mentioned context or to the previous context used by the text. Rules are organized in context groups. A context group is used for main text concepts within the format, for example quoted text strings or comment blocks in program source code. This ensures that the highlighting system does not need to loop through all rules when it is not necessary, and that some character sequences in the text can be treated differently depending on the current context. Contexts may be generated dynamically to allow the usage of instance specific data in rules. Context Styles and Keywords In some programming languages, integer numbers are treated differently than floating point ones by the compiler (the program that converts the source code to a binary executable), and there may be characters having a special meaning within a quoted string. In such cases, it makes sense to render them differently from the surroundings so that they are easy to identify while reading the text. So even if they do not represent special contexts, they may be seen as such by the syntax highlighting system, so that they can be marked for different rendering. A syntax definition may contain as many styles as required to cover the concepts of the format it is used for. In many formats, there are lists of words that represent a specific concept. For example in programming languages, the control statements is one concept, data type names another, and built in functions of the language a third. The &kate; Syntax Highlighting System can use such lists to detect and mark words in the text to emphasize concepts of the text formats. Default Styles If you open a C++ source file, a &Java; source file and an HTML document in &kate;, you will see that even though the formats are different, and thus different words are chosen for special treatment, the colors used are the same. This is because &kate; has a predefined list of Default Styles which are employed by the individual syntax definitions. This makes it easy to recognize similar concepts in different text formats. For example comments are present in almost any programming, scripting or markup language, and when they are rendered using the same style in all languages, you do not have to stop and think to identify them within the text. All styles in a syntax definition use one of the default styles. A few syntax definitions use more styles that there are defaults, so if you use a format often, it may be worth launching the configuration dialog to see if some concepts are using the same style. For example there is only one default style for strings, but as the Perl programming language operates with two types of strings, you can enhance the highlighting by configuring those to be slightly different. All available default styles will be explained later. The Highlight Definition &XML; Format Overview This section is an overview of the Highlight Definition &XML; format. Based on a small example it will describe the main components and their meaning and usage. The next section will go into detail with the highlight detection rules. The formal definition, aka the DTD is stored in the file language.dtd which should be installed on your system in the folder $TDEDIR/share/apps/katepart/syntax. Main sections of &kate; Highlight Definition files A highlighting file contains a header that sets the XML version and the doctype: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE language SYSTEM "language.dtd"> The root of the definition file is the element language. Available attributes are: Required attributes: name sets the name of the language. It appears in the menus and dialogs afterwards. section specifies the category. extensions defines file extensions, like "*.cpp;*.h" Optional attributes: mimetype associates files &MIME; Type based. version specifies the current version of the definition file. kateversion specifies the latest supported &kate; version. casesensitive defines, whether the keywords are casesensitiv or not. priority is necessary if another highlight definition file uses the same extensions. The higher priority will win. author contains the name of the author and his email-address. license contains the license, usually LGPL, Artistic, GPL and others. hidden defines, whether the name should appear in &kate;'s menus. So the next line may look like this: <language name="C++" version="1.00" kateversion="2.4" section="Sources" extensions="*.cpp;*.h" /> Next comes the highlighting element, which contains the optional element list and the required elements contexts and itemDatas. list elements contain a list of keywords. In this case the keywords are class and const. You can add as many lists as you need. The contexts element contains all contexts. The first context is by default the start of the highlighting. There are two rules in the context Normal Text, which match the list of keywords with the name somename and a rule that detects a quote and switches the context to string. To learn more about rules read the next chapter. The third part is the itemDatas element. It contains all color and font styles needed by the contexts and rules. In this example, the itemData Normal Text, String and Keyword are used. <highlighting> <list name="somename"> <item> class </item> <item> const </item> </list> <contexts> <context attribute="Normal Text" lineEndContext="#pop" name="Normal Text" > <keyword attribute="Keyword" context="#stay" String="somename" /> <DetectChar attribute="String" context="string" char=""" /> </context> <context attribute="String" lineEndContext="#stay" name="string" > <DetectChar attribute="String" context="#pop" char=""" /> </context> </contexts> <itemDatas> <itemData name="Normal Text" defStyleNum="dsNormal" /> <itemData name="Keyword" defStyleNum="dsKeyword" /> <itemData name="String" defStyleNum="dsString" /> </itemDatas> </highlighting> The last part of a highlight definition is the optional general section. It may contain information about keywords, code folding, comments and indentation. The comment section defines with what string a single line comment is introduced. You also can define a multiline comments using multiLine with the additional attribute end. This is used if the user presses the corresponding shortcut for comment/uncomment. The keywords section defines whether keyword lists are casesensitive or not. Other attributes will be explained later. <general> <comments> <comment name="singleLine" start="#"/> </comments> <keywords casesensitive="1"/> </general> </language> The Sections in Detail This part will describe all available attributes for contexts, itemDatas, keywords, comments, code folding and indentation. The element context belongs into the group contexts. A context itself defines context specific rules like what should happen if the highlight system reaches the end of a line. Available attributes are: name the context name. Rules will use this name to specify the context to switch to if the rule matches. lineEndContext defines the context the highlight system switches to if it reaches the end of a line. This may either be a name of another context, #stay to not switch the context (eg. do nothing) or #pop which will cause to leave this context. It is possible to use for example #pop#pop#pop to pop three times. lineBeginContext defines the context if a begin of a line is encountered. Default: #stay. fallthrough defines if the highlight system switches to the context specified in fallthroughContext if no rule matches. Default: false. fallthroughContext specifies the next context if no rule matches. dynamic if true, the context remembers strings/placeholders saved by dynamic rules. This is needed for HERE documents for example. Default: false. The element itemData is in the group itemDatas. It defines the font style and colors. So it is possible to define your own styles and colors, however we recommend to stick to the default styles if possible so that the user will always see the same colors used in different languages. Though, sometimes there is no other way and it is necessary to change color and font attributes. The attributes name and defStyleNum are required, the other optional. Available attributes are: name sets the name of the itemData. Contexts and rules will use this name in their attribute attribute to reference an itemData. defStyleNum defines which default style to use. Available default styles are explained in detail later. color defines a color. Valid formats are '#rrggbb' or '#rgb'. selColor defines the selection color. italic if true, the text will be italic. bold if true, the text will be bold. underline if true, the text will be underlined. strikeout if true, the text will be stroked out. The element keywords in the group general defines keyword properties. Available attributes are: casesensitive may be true or false. If true, all keywords are matched casesensitive weakDeliminator is a list of characters that do not act as word delimiters. For example the dot '.' is a word delimiter. Assume a keyword in a list contains a dot, it will only match if you specify the dot as a weak delimiter. additionalDeliminator defines additional delimiters. wordWrapDeliminator defines characters after which a line wrap may occur. Default delimiters and word wrap delimiters are the characters .():!+,-<=>%&*/;?[]^{|}~\, space (' ') and tabulator ('\t'). The element comment in the group comments defines comment properties which are used for ToolsComment and ToolsUncomment. Available attributes are: name is either singleLine or multiLine. If you choose multiLine the attributes end and region are required. start defines the string used to start a comment. In C++ this would be "/*". end defines the string used to close a comment. In C++ this would be "*/". region should be the name of the the foldable multiline comment. Assume you have beginRegion="Comment" ... endRegion="Comment" in your rules, you should use region="Comment". This way uncomment works even if you do not select all the text of the multiline comment. The cursor only must be in the multiline comment. The element folding in the group general defines code folding properties. Available attributes are: indentationsensitive if true, the code folding markers will be added indentation based, like in the scripting language Python. Usually you do not need to set it, as it defaults to false. The element indentation in the group general defines which indenter will be used, however we strongly recommend to omit this element, as the indenter usually will be set by either defining a File Type or by adding a mode line to the text file. If you specify an indenter though, you will force a specific indentation on the user, which he might not like at all. Available attributes are: mode is the name of the indenter. Available indenters right now are: normal, cstyle, csands, xml, python and varindent. Available Default Styles Default Styles were already explained, as a short summary: Default styles are predefined font and color styles. So here only the list of available default styles: dsNormal, used for normal text. dsKeyword, used for keywords. dsDataType, used for data types. dsDecVal, used for decimal values. dsBaseN, used for values with a base other than 10. dsFloat, used for float values. dsChar, used for a character. dsString, used for strings. dsComment, used for comments. dsOthers, used for 'other' things. dsAlert, used for warning messages. dsFunction, used for function calls. dsRegionMarker, used for region markers. dsError, used for error highlighting and wrong syntax. Highlight Detection Rules This section describes the syntax detection rules. Each rule can match zero or more characters at the beginning of the string they are test against. If the rule matches, the matching characters are assigned the style or attribute defined by the rule, and a rule may ask that the current context is switched. A rule looks like this: <RuleName attribute="(identifier)" context="(identifier)" [rule specific attributes] /> The attribute identifies the style to use for matched characters by name, and the context identifies the context to use from here. The context can be identified by: An identifier, which is the name of the other context. An order telling the engine to stay in the current context (#stay), or to pop back to a previous context used in the string (#pop). To go back more steps, the #pop keyword can be repeated: #pop#pop#pop Some rules can have child rules which are then evaluated only if the parent rule matched. The entire matched string will be given the attribute defined by the parent rule. A rule with child rules looks like this: <RuleName (attributes)> <ChildRuleName (attributes) /> ... </RuleName> Rule specific attributes varies and are described in the following sections. Common attributes All rules have the following attributes in common and are available whenever (common attributes) appears. attribute and context are required attributes, all others are optional. attribute: An attribute maps to a defined itemData. context: Specify the context to which the highlighting system switches if the rule matches. beginRegion: Start a code folding block. Default: unset. endRegion: Close a code folding block. Default: unset. lookAhead: If true, the highlighting system will not process the matches length. Default: false. firstNonSpace: Match only, if the string is the first non-whitespace in the line. Default: false. column: Match only, if the column matches. Default: unset. Dynamic rules Some rules allow the optional attribute dynamic of type boolean that defaults to false. If dynamic is true, a rule can use placeholders representing the text matched by a regular expression rule that switched to the current context in its string or char attributes. In a string, the placeholder %N (where N is a number) will be replaced with the corresponding capture N from the calling regular expression. In a char the placeholer must be a number N and it will be replaced with the first character of the corresponding capture N from the calling regular expression. Whenever a rule allows this attribute it will contain a (dynamic). dynamic: may be (true|false). The Rules in Detail DetectChar Detect a single specific character. Commonly used for example to find the ends of quoted strings. <DetectChar char="(character)" (common attributes) (dynamic) /> The char attribute defines the character to match. Detect2Chars Detect two specific characters in a defined order. <Detect2Chars char="(character)" char1="(character)" (common attributes) (dynamic) /> The char attribute defines the first character to match, char1 the second. AnyChar Detect one character of a set of specified characters. <AnyChar String="(string)" (common attributes) /> The String attribute defines the set of characters. StringDetect Detect an exact string. <StringDetect String="(string)" [insensitive="true|false"] (common attributes) (dynamic) /> The String attribute defines the string to match. The insensitive attribute defaults to false and is passed to the string comparison function. If the value is true insensitive comparing is used. RegExpr Matches against a regular expression. <RegExpr String="(string)" [insensitive="true|false"] [minimal="true|false"] (common attributes) (dynamic) /> The String attribute defines the regular expression. insensitive defaults to false and is passed to the regular expression engine. minimal defaults to false and is passed to the regular expression engine. Because the rules are always matched against the beginning of the current string, a regular expression starting with a caret (^) indicates that the rule should only be matched against the start of a line. See Regular Expressions for more information on those. keyword Detect a keyword from a specified list. <keyword String="(list name)" (common attributes) /> The String attribute identifies the keyword list by name. A list with that name must exist. Int Detect an integer number. <Int (common attributes) (dynamic) /> This rule has no specific attributes. Child rules are typically used to detect combinations of L and U after the number, indicating the integer type in program code. Actually all rules are allowed as child rules, though, the DTD only allowes the child rule StringDetect. The following example matches integer numbers follows by the character 'L'. <Int attribute="Decimal" context="#stay" > <StringDetect attribute="Decimal" context="#stay" String="L" insensitive="true"/> </Int> Float Detect a floating point number. <Float (common attributes) /> This rule has no specific attributes. AnyChar is allowed as a child rules and typically used to detect combinations, see rule Int for reference. HlCOct Detect an octal point number representation. <HlCOct (common attributes) /> This rule has no specific attributes. HlCHex Detect a hexadecimal number representation. <HlCHex (common attributes) /> This rule has no specific attributes. HlCStringChar Detect an escaped character. <HlCStringChar (common attributes) /> This rule has no specific attributes. It matches literal representations of characters commonly used in program code, for example \n (newline) or \t (TAB). The following characters will match if they follow a backslash (\): abefnrtv"'?\. Additionally, escaped hexadecimal numbers like for example \xff and escaped octal numbers, for example \033 will match. HlCChar Detect an C character. <HlCChar (common attributes) /> This rule has no specific attributes. It matches C characters enclosed in a tick (Example: 'c'). So in the ticks may be a simple character or an escaped character. See HlCStringChar for matched escaped character sequences. RangeDetect Detect a string with defined start and end characters. <RangeDetect char="(character)" char1="(character)" (common attributes) /> char defines the character starting the range, char1 the character ending the range. Usefull to detect for example small quoted strings and the like, but note that since the highlighting engine works on one line at a time, this will not find strings spanning over a line break. LineContinue Matches at end of line. <LineContinue (common attributes) /> This rule has no specific attributes. This rule is useful for switching context at end of line, if the last character is a backslash ('\'). This is needed for example in C/C++ to continue macros or strings. IncludeRules Include rules from another context or language/file. <IncludeRules context="contextlink" [includeAttrib="true|false"] /> The context attribute defines which context to include. If it a simple string it includes all defined rules into the current context, example: <IncludeRules context="anotherContext" /> If the string begins with ## the highlight system will look for another language definition with the given name, example: <IncludeRules context="##C++" /> If includeAttrib attribute is true, change the destination attribute to the one of the source. This is required to make for example commenting work, if text matched by the included context is a different highlight than the host context. DetectSpaces Detect whitespaces. <DetectSpaces (common attributes) /> This rule has no specific attributes. Use this rule if you know that there can several whitespaces ahead, for example in the beginning of indented lines. This rule will skip all whitespace at once, instead of testing multiple rules and skipping one at the time due to no match. DetectIdentifier Detect identifier strings (as a regular expression: [a-zA-Z_][a-zA-Z0-9_]*). <DetectIdentifier (common attributes) /> This rule has no specific attributes. Use this rule to skip a string of word characters at once, rather than testing with multiple rules and skipping one at the time due to no match. Tips & Tricks Once you have understood how the context switching works it will be easy to write highlight definitions. Though you should carefully check what rule you choose in what situation. Regular expressions are very mighty, but they are slow compared to the other rules. So you may consider the following tips. If you only match two characters use Detect2Chars instead of StringDetect. The same applies to DetectChar. Regular expressions are easy to use but often there is another much faster way to achieve the same result. Consider you only want to match the character '#' if it is the first character in the line. A regular expression based solution would look like this: <RegExpr attribute="Macro" context="macro" String="^\s*#" /> You can achieve the same much faster in using: <DetectChar attribute="Macro" context="macro" char="#" firstNonSpace="true" /> If you want to match the regular expression '^#' you can still use DetectChar with the attribute column="0". The attribute column counts character based, so a tabulator still is only one character. You can switch contexts without processing characters. Assume that you want to switch context when you meet the string */, but need to process that string in the next context. The below rule will match, and the lookAhead attribute will cause the highlighter to keep the matched string for the next context. <Detect2Chars attribute="Comment" context="#pop" char="*" char1="/" lookAhead="true" /> Use DetectSpaces if you know that many whitespaces occur. Use DetectIdentifier instead of the regular expression '[a-zA-Z_]\w*'. Use default styles whenever you can. This way the user will find a familiar environment. Look into other XML-files to see how other people implement tricky rules. You can validate every XML file by using the command xmllint --dtdvalid language.dtd mySyntax.xml. If you repeat complex regular expression very often you can use ENTITIES. Example: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE language SYSTEM "language.dtd" [ <!ENTITY myref "[A-Za-z_:][\w.:_-]*"> ]> Now you can use &myref; instead of the regular expression.