Korp API v7.1.1

Introduction

Korp is a tool for searching in text corpora, developed at Språkbanken. The Korp API is used by the Korp frontend, but can also be used independently. This documentation will give you an overview of all the available commands, which in some cases include functionality not yet available in the Korp frontend.

The source code is made available under the MIT license on GitHub.

Most examples in this documentation will link to Språkbanken’s instance of the Korp backend, to take advantage of its corpora.

The Basics of a Query

Queries to the web service are done using HTTP GET requests:

/command?parameter=...&...

It is also possible to use POST requests (both regular form data and JSON), with the same result.

The service responds with a JSON object.

The parameters available for each command are presented in this documentation as a bulleted list, separated into required and optional parameters. A parameter marked with [multi] can take multiple values separated by commas.

Many of the commands make use of the CQP query language. For further information about CQP, please refer to the CQP Query Language Tutorial.

For every request the key time will always be included in the JSON object, indicating the execution time in seconds:

{
  "time": 0.0125
}

Global Options

The following parameters can be used together with any of the commands:

Basic Information

General Information

Get information about available corpora, which corpora are protected, and CWB and API version.

Endpoint

/info

Returns

{
  "version": "7.1.1",
  "cqp-version": "<CQP version>",
  "corpora": [<list of corpora on the server>],
  "protected_corpora": [<list of which of the above corpora that are password protected>]
}

Example

/info

Corpus Information

Endpoint

/info

Fetch information about one or more corpora.

Parameters

Returns

{
  "corpora": {
    "<corpus>": {
      "attrs": {
        "p": [<list of positional attributes>],
        "s": [<list of structural attributes>],
        "a": [<list of align attributes, for linked corpora>]
      },
      "info": {
        "Charset": "<character encoding of the corpus>",
        "FirstDate": "<date and time of the oldest dated text in the corpus>",
        "LastDate": "<date and time of the newest dated text in the corpus>",
        "Size": <number of tokens in the corpus>,
        "Sentences": <number of sentences in the corpus>,
        "Updated": "<date when the corpus was last updated>"
      }
    }
  },
  "total_size": <total number of tokens in the above corpora>,
  "total_sentences": <total number of sentences in the above corpora>
}

Example

/info?corpus=ROMI,PAROLE

Concordance

Do a concordance search in one or more corpora.

Endpoint

/query

Parameters

The positional attribute “word” will always be shown, even if omitted.

Returns

{
  "hits": <total number of hits>,
  "corpus_hits": {
    "<corpus>": <number of hits>,
    ...
  },
  "kwic": [
    {
      "match": {
        "start": <start position of the match within the context>,
        "end": <end position of the match within the context>,
        "position": <global corpus position of the match>
      },
      "structs": {
        "<structural attribute>": "<value>",
        ...
      },
      "tokens": [
        {
          "word": "<word form>",
          "<positional attribute>": "<value>",
          ...
        },
        ...
      ],
      <if aligned corpora>
      "aligned": {
        "<aligned corpus>": [<list of tokens>], ...
      }
    },
    ...
  ]
}

If in_order is set to ‘false’, "match" will instead consist of a list of match objects, one per highlighted word.

Examples

Query the corpus SUC3 and show the first 10 sentences matching the CQP query "och" [] [pos="NN"], including part of speech and base form in the result:
/query?corpus=SUC3&start=0&end=9&default_context=1+sentence&cqp="och"+[]+[pos="NN"]&show=msd,lemma

Query the parallel corpus SALTNLD-SV and show part of speech + the linked Dutch sentence:
/query?corpus=SALTNLD-SV&start=0&end=9&context=1+link&cqp="och"+[]+[pos="NN"]&show=saltnld-nl

Sample Concordance

Same as regular concordance, but does a sequential search in the selected corpora in random order until at least one hit is found, then aborts. The result will be randomly sorted. Use this to get one or more random sample sentences.

Endpoint

/query_sample

Parameters

Same as /query, but sort will always be ‘random’.

Statistics

Given a CQP query, calculate the frequency for one or more attributes. Both absolute and relative frequency are calculated. The relative frequency is given as hits per 1 million tokens.

Endpoint

/count

Parameters

For instances when you want to calculate statistics for every token in one or several corpora, the /count_all command should be used instead since it is optimized for that kind of query.

If you want to base your statistics on one single token in a multi token query, prefix that token with an @, e.g. [pos = "JJ"] @[pos = "NN"].

Returns

{
  "corpora": {
    "<corpus>": {
      "absolute": [
        {
          "value": {
            "<positional attribute>": [
              "<value for first token>",
              "<value for second token>",
              ...
            ],
            "<structural attribute>": "<value>",
            ...
          },
          "freq": <absolute frequency>
        },
        ...
      ],
      "relative": [
        {
          "value": {
            "<positional attribute>": [
              "<value for first token>",
              "<value for second token>",
              ...
            ],
            "<structural attribute>": "<value>",
            ...
          },
          "freq": <relative frequency>
        },
        ...
      ],
      "sums": {
        "absolute": <absolute sum>,
        "relative": <relative sum>
      }
    },
  },
  "total": {
    "absolute": [
      {
        "value": {
          "<positional attribute>": [
            "<value for first token>",
            "<value for second token>",
            ...
          ],
          "<structural attribute>": "<value>",
          ...
        },
        "freq": <absolute frequency>
      },
      ...
    ],
    "relative": [
      {
        "value": {
          "<positional attribute>": [
            "<value for first token>",
            "<value for second token>",
            ...
          ],
          "<structural attribute>": "<value>",
          ...
        },
        "freq": <relative frequency>
      },
      ...
    ],
    "sums": {
      "absolute": <absolute sum>,
      "relative": <relative sum>
    }
  },
  "count": <total number of different values>
}

When subcqp# parameters are used, "<corpus>" and "total" above will instead each contain a list, with the first item being the result of the main cqp query, and the following items the results of the subcqp# queries. The subcqp# results will each have an additional key, "cqp", containing the CQP query for that particular subquery.

Example

Get frequencies for the different word forms of the lemgram ge..vb.1:
/count?corpus=ROMI&cqp=[lex+contains+"ge..vb.1"]&group_by=word&ignore_case=word

Complete Statistics

Just like regular /count but without specifying cqp, resulting in a complete list of every value of the given attributes.

Endpoint

/count_all

Parameters

Takes the same parameters as /count except it doesn’t use cqp.

Example

Get statistics for all parts of speech in one corpus:
/count_all?corpus=ROMI&group_by=pos

Statistics Over Time

Show the change in frequency of one or more search results over time.

Endpoint

/count_time

Parameters

If subcqp is omitted, the result will only contain frequency information for the CQP query in cqp (or the last cqp# query). If one or more sub-queries are specified using subcqp1, subcqp2 and so on, the result will contain frequency information for these as well.

The result is returned both per corpus, and a total.

Strategies

What should happen when you ask for time data with a granularity finer than that of the annotated material? Does a search limited to the period 2005-01-01 – 2005-01-31 include material dated with only “2005”? The strategy parameter gives you some control over this, affecting both how from and to work, and what parts of the material contribute to the results.

The list below describes the three different strategies, and for each strategy the rules that decide what part of the material is included in the search, as well as what tokens contribute to the token count for each data point.

The term “result time span” below refers both to the from and to span given by the user, and the different time spans making up the data points in the result data, the size of which are determined by the granularity parameter. For example the data point “2015” representing the whole of year 2015 when granularity is set to ‘y’, and “2015-01” representing the whole of January 2015 with granularity set to ‘m’.

t1 and t2 represents the from and to dates for an annotated part of the material, and t1' and t2' is the from and to of “result time span” described above.

Strategy 1
The material time span needs to be completely contained by the result time span, or the result time span needs to be completely contained by the material time span.
(t1 >= t1' AND t2 <= t2') OR (t1 <= t1' AND t2 >= t2')
Strategy 2
All overlaps allowed between material time span and result time span.
t1 <= t2' AND t2 >= t1'
Strategy 3
The material time span is completely contained by the result time span.
t1 >= t1' AND t2 <= t2'

Returns

{
  "corpora": {
    "<corpus>": [
      {
        "relative": {
          "<date>": <relative frequency>,
          ...
        },
        "sums": {
          "relative": <sum, relative frequency>,
          "absolute": <sum, absolute frequency>
        },
        "absolute": {
          "<date>": <absolute frequency>,
          ...
        }
      },
      {
        "cqp": "<sub-CQP query>",
        "relative": {
          "<date>": <relative frequency>,
          ...
        },
        "sums": {
          "relative": <sum, relative frequency>,
          "absolute": <sum, absolute frequency>
        },
        "absolute": {
          "<date>": <absolute frequency>,
          ...
        }
      },
      <more structures like the one above, one per sub-query>
    ],
    ...
  },
  "combined": [
    {
      "relative": {
        "<date>": <relative frequency>,
        ...
      },
      "sums": {
        "relative": <sum, relative frequency>,
        "absolute": <sum, absolute frequency>
      },
      "absolute": {
        "<date>": <absolute frequency>,
        ...
      }
    },
    {
      "cqp": "<sub-CQP query>",
      "relative": {
        "<date>": <relative frequency>,
        ...
      },
      "sums": {
        "relative": <sum, relative frequency>,
        "absolute": <sum, absolute frequency>
      },
      "absolute": {
        "<date>": <absolute frequency>,
        ...
      }
    },
    <more structures like the one above, one per sub-query>
  ]
}

The data points in the result indicates the number of hits from that point onward until the next data point, meaning that the following data:

"2010": 100,
"2012": 50,
"2013": 0,
"2016": null

should be interpreted as 100 hits during 2010–2011, then 50 hits during 2012, zero hits 2013–2015, and finally from 2016 onwards we have no data at all.

Example

Show how the use of “tsunami” and “flodvåg” (“tidal wave”) has changed over time in the Swedish newspaper Göteborgs-Posten:
/count_time?cqp=[lex+contains+"tsunami\.\.nn\.1|flodvåg\.\.nn\.1"]&corpus=GP2001,GP2002,GP2003,GP2004,GP2005,GP2006,GP2007,GP2008,GP2009,GP2010,GP2011,GP2012&subcqp0=[lex+contains+'tsunami\.\.nn\.1']&subcqp1=[lex+contains+'flodvåg\.\.nn\.1']

Distribution Over Time

Show the distribution of all tokens in a corpus over time.

Endpoint

/timespan

Parameters

Returns

{
  "corpora": {
    "<corpus>": {
      "<date>": <token count>,
      ...
    },
    ...
  },
  "combined": {
    "<date>": <token count>,
    ...
  }
}

Example

Show distribution of tokens in the Swedish Party Programs and Election Manifestos corpus over time:
/timespan?corpus=VIVILL

Log-Likelihood Comparison

Compare the results of two different searches by using log-likelihood.

Endpoint

/loglike

Parameters

Returns

{
  "average": <average for log-likelihood>,
  "loglike": {
    "<value>": <log-likelihood value>,
    ...
  },
  "set1": {
    "<value>": <absolute frequency>,
    ...
  },
  "set2": {
    "<value>": <absolute frequency>,
    ...
  }
}

A positive log-likelihood value indicates a relative increase in set2 compared to set1, while a negative value indicates a relative decrease.

Example

Compare the nouns of two different corpora:
/loglike?set1_cqp=[pos="NN"]&set2_cqp=[pos="NN"]&group_by=word&max=10&set1_corpus=ROMI&set2_corpus=GP2012

Word Picture

Get typical dependency relations for a given lemgram or word.

Endpoint

/relations

Parameters

Returns

{
  "relations": [
    {
      "dep": "<dependent lemgram or word>",
      "depextra": "<dependent prefix>",
      "deppos": "<dependent part of speech>",
      "freq": <number of occurrences>,
      "head": "<head lemgram or word>",
      "headpos": "<head part of speech>",
      "mi": <lexicographer's mutual information score>,
      "rel": "<relation>",
      "source": [
        <list of IDs, for getting the source sentences>
      ]
    },
    ...
  ]
}

Example

Get dependency relations for the lemgram ge..vb.1:
/relations?word=ge..vb.1&type=lemgram&corpus=ROMI

Word Picture Sentences

Given the source ID for a relation (from a Word Picture query), return the sentences in which this relation occurs.

Endpoint

/relations_sentences

Parameters

Returns

Returns a structure identical to a regular /query.

Lemgram Statistics

Return the number of occurrences of one or more lemgrams in one or more corpora.

Endpoint

/lemgram_count

Parameters

Returns

{
  "<lemgram>": <number of occurrences>,
  ...
}

Example

Get number of occurrences of the lemgrams ge..vb.1 and ta..vb.1 in a single corpus:
/lemgram_count?lemgram=ge..vb.1,ta..vb.1&corpus=ROMI

Structural Values

Get all available values for one or more structural attributes, together with number of tokens for each value. Similar to /count_all but without relative frequencies and with support for hierarchies.

Endpoint

/struct_values

Parameters

struct can be either a plain attribute, or a hierarchy of two or more attributes, like so: text_author>text_title.

Returns

Without count the result will contain lists:

{
  "corpora": {
    "<corpus>": {
      "<attribute 1>": [
        "<value>",
        ...
      ],
      ...
    },
    ...
  },
  "combined": {
    "<attribute 1>": [
      "<value>",
      ...
    ],
  ...
  }
}

With count the result will consist of objects:

{
  "corpora": {
    "<corpus>": {
      "<attribute 1>": {
        "<value>": <token count>,
        ...
      },
      ...
    },
    ...
  },
  "combined": {
    "<attribute 1>": {
      "<value>": <token count>,
      ...
    },
    ...
  }
}

Example

Get all authors and their titles together with token count:
/struct_values?corpus=ROMI&struct=text_author>text_title&count=true

Authentication

Authenticate a user against an authentication system, if available.

Endpoint

/authenticate

Parameters

Returns

A list of protected corpora that the user has access to.

{
  "corpora": [
    "<corpus>",
    ...
  ]
}