Introduction
Korp is a tool for searching in text corpora, developed at Språkbanken. The Korp API is used by the Korp frontend, but can also be used independently. This documentation will give you an overview of all the available commands, which in some cases include functionality not yet available in the Korp frontend.
The source code is made available under the MIT license on GitHub.
Most examples in this documentation will link to Språkbanken’s instance of the Korp backend, to take advantage of its corpora.
The Basics of a Query
Queries to the web service are done using HTTP GET requests:
/command?parameter=...&...
It is also possible to use POST requests (both regular form data and JSON), with the same result.
The service responds with a JSON object.
The parameters available for each command are presented in this documentation as a bulleted list, separated into required and optional parameters. A parameter marked with [multi] can take multiple values separated by commas.
Many of the commands make use of the CQP query language. For further information about CQP, please refer to the CQP Query Language Tutorial.
For every request the key time
will always be included in the JSON object, indicating the execution time
in seconds:
{
"time": 0.0125
}
Global Options
The following parameters can be used together with any of the commands:
- Optional
- encoding – Character encoding used when communicating with Corpus Workbench. Default is UTF-8.
- indent – Indent the JSON response by n spaces to make it human readable. By default a compact un-indented JSON is used.
- callback – A string which will surround the returned JSON object, sometimes necessary with AJAX requests.
- cache – Set to ‘false’ to bypass cache. (default: ‘true’)
- debug – Set to ‘true’ to enable verbose error messages. (default: ‘false’)
Basic Information
General Information
Get information about available corpora, which corpora are protected, and CWB and API version.
Endpoint
Returns
{
"version": "7.1.1",
"cqp-version": "<CQP version>",
"corpora": [<list of corpora on the server>],
"protected_corpora": [<list of which of the above corpora that are password protected>]
}
Example
Corpus Information
Endpoint
Fetch information about one or more corpora.
Parameters
- Required
- corpus – Corpus name. [multi]
Returns
{
"corpora": {
"<corpus>": {
"attrs": {
"p": [<list of positional attributes>],
"s": [<list of structural attributes>],
"a": [<list of align attributes, for linked corpora>]
},
"info": {
"Charset": "<character encoding of the corpus>",
"FirstDate": "<date and time of the oldest dated text in the corpus>",
"LastDate": "<date and time of the newest dated text in the corpus>",
"Size": <number of tokens in the corpus>,
"Sentences": <number of sentences in the corpus>,
"Updated": "<date when the corpus was last updated>"
}
}
},
"total_size": <total number of tokens in the above corpora>,
"total_sentences": <total number of sentences in the above corpora>
}
Example
Concordance
Do a concordance search in one or more corpora.
Endpoint
Parameters
- Required
- corpus – Corpus name. [multi]
- cqp – CQP query.
- Optional
- start – First result to return; used for pagination. (default: 0)
- end – Last result to return; used for pagination. (default: 9)
- default_context – Context to show, e.g. ‘1 sentence’. (default: ‘10 words’)
- context – Context to show for specific corpora, overriding the default. Specified using the format ‘corpus:context’. [multi]
- show – Positional attributes to show. ‘word’ will always be included. [multi]
- show_struct – Structural attributes to show. [multi]
- default_within – Prevent search from crossing boundaries of the given structural attribute, e.g. ‘sentence’.
- within – Like default_within, but for specific corpora, overriding the default. Specified using the format ‘corpus:attribute’. [multi]
- in_order – By default the order of the tokens in your query matters, and will only match tokens in that particular order. By setting this parameter to ‘false’ the order of the tokens will no longer matter, and every occurrence of each matched token will be highlighted. Requires default_within or within. (default: ‘true’)
- sort – Sort the results within each corpus by: hit (‘keyword’), left (‘left’) or right context (‘right’), random (‘random’), or a given positional attribute. (default: no sorting)
- random_seed – Numerical value for reproducible random order, used together with sort=random.
- cut – Limit total number of hits per corpus to this number. (default: no limit)
- cqp# – Where # is a number. In addition to the cqp parameter, you can add additional CQP queries that will be executed on the result of the previous query (i.e. searching within search results). The final result returned to the user will be that of the last numbered query.
- expand_prequeries – When using multiple CQP queries (cqp# above), this determines whether subsequent queries should be executed on the containing sentences (or any other structural attribute defined by within) from the previous query, or just the actual matched tokens. (default: ‘true’)
- incremental – Return results incrementally when set to ‘true’ and more than one corpus is specified.
The positional attribute “word” will always be shown, even if omitted.
Returns
{
"hits": <total number of hits>,
"corpus_hits": {
"<corpus>": <number of hits>,
...
},
"kwic": [
{
"match": {
"start": <start position of the match within the context>,
"end": <end position of the match within the context>,
"position": <global corpus position of the match>
},
"structs": {
"<structural attribute>": "<value>",
...
},
"tokens": [
{
"word": "<word form>",
"<positional attribute>": "<value>",
...
},
...
],
<if aligned corpora>
"aligned": {
"<aligned corpus>": [<list of tokens>], ...
}
},
...
]
}
If in_order
is set to ‘false’, "match"
will instead consist of a list of match objects, one per highlighted
word.
Examples
Query the corpus SUC3 and show the first 10 sentences matching the CQP query
"och" [] [pos="NN"]
, including part of speech and base form in the result:
/query?corpus=SUC3&start=0&end=9&default_context=1+sentence&cqp="och"+[]+[pos="NN"]&show=msd,lemma
Query the parallel corpus SALTNLD-SV and show part of speech + the linked Dutch sentence:
/query?corpus=SALTNLD-SV&start=0&end=9&context=1+link&cqp="och"+[]+[pos="NN"]&show=saltnld-nl
Sample Concordance
Same as regular concordance, but does a sequential search in the selected corpora in random order until at least one hit is found, then aborts. The result will be randomly sorted. Use this to get one or more random sample sentences.
Endpoint
Parameters
Same as /query
, but sort will always be ‘random’.
Statistics
Given a CQP query, calculate the frequency for one or more attributes. Both absolute and relative frequency are calculated. The relative frequency is given as hits per 1 million tokens.
Endpoint
Parameters
- Required
- corpus – Corpus name. [multi]
- cqp – CQP query.
- Optional
- group_by – Positional attribute by which the hits should be grouped. (default: ‘word’ if neither group_by nor group_by_struct is defined) [multi]
- group_by_struct – Structural attribute by which the hits should be grouped. The value for the first token of the hit will be used. [multi]
- default_within – Prevent search from crossing boundaries of the given structural attribute.
- within – As above, but for specific corpora, overriding the default. Specified using the format ‘corpus:attribute’. [multi]
- ignore_case – Change all values of the given attribute(s) to lowercase. [multi]
- relative_to_struct – Calculate relative frequencies based on total number of tokens with the same value for the structural annotations specified here, instead of relative to corpus size. [multi]
- split – Attributes that should be split (used for sets). [multi]
- top – Preserve only the first n annotations in a set. Format: ‘annotation:n’. If :n is omitted only the first value will be preserved. Must be used together with split. [multi]
- cqp# – Where # is a number. In addition to the cqp parameter, you can add additional CQP queries that will be executed on the result of the previous query (i.e. searching within search results). The final result returned to the user will be that of the last numbered query.
- expand_prequeries – When using multiple CQP queries (cqp# above), this determines whether subsequent queries should be executed on the containing sentences (or any other structural attribute defined by within) from the previous query, or just the actual matched tokens. (default: ‘true’)
- subcqp# – Where # is a number. Sub-queries to the main query (or last cqp# query). Any number of numbered subcqp-parameters can be used. These will always be executed on just the actual matched tokens from the main query (i.e. no expansion), and the result for each subquery will be included as a separate object in the final JSON, in addition to the main query result.
- start – Start row; used for pagination. (default: 0)
- end – End row; used for pagination. (default: no limit)
- incremental – Incrementally return progress updates when the calculation for each corpus is finished.
For instances when you want to calculate statistics for every token in one or several corpora, the
/count_all
command
should be used instead since it is optimized for that kind of query.
If you want to base your statistics on one single token in a multi token query, prefix that token with an @
, e.g.
[pos = "JJ"] @[pos = "NN"]
.
Returns
{
"corpora": {
"<corpus>": {
"absolute": [
{
"value": {
"<positional attribute>": [
"<value for first token>",
"<value for second token>",
...
],
"<structural attribute>": "<value>",
...
},
"freq": <absolute frequency>
},
...
],
"relative": [
{
"value": {
"<positional attribute>": [
"<value for first token>",
"<value for second token>",
...
],
"<structural attribute>": "<value>",
...
},
"freq": <relative frequency>
},
...
],
"sums": {
"absolute": <absolute sum>,
"relative": <relative sum>
}
},
},
"total": {
"absolute": [
{
"value": {
"<positional attribute>": [
"<value for first token>",
"<value for second token>",
...
],
"<structural attribute>": "<value>",
...
},
"freq": <absolute frequency>
},
...
],
"relative": [
{
"value": {
"<positional attribute>": [
"<value for first token>",
"<value for second token>",
...
],
"<structural attribute>": "<value>",
...
},
"freq": <relative frequency>
},
...
],
"sums": {
"absolute": <absolute sum>,
"relative": <relative sum>
}
},
"count": <total number of different values>
}
When subcqp#
parameters are used, "<corpus>"
and "total"
above will instead each contain a list, with the first
item being the result of the main cqp
query, and the following items the results of the subcqp#
queries. The
subcqp#
results will each have an additional key, "cqp"
, containing the CQP query for that particular subquery.
Example
Get frequencies for the different word forms of the lemgram
ge..vb.1
:
/count?corpus=ROMI&cqp=[lex+contains+"ge..vb.1"]&group_by=word&ignore_case=word
Complete Statistics
Just like regular /count
but without specifying cqp
, resulting in a complete list of every value of the
given attributes.
Endpoint
Parameters
Takes the same parameters as /count
except it doesn’t use cqp
.
Example
Get statistics for all parts of speech in one corpus:
/count_all?corpus=ROMI&group_by=pos
Statistics Over Time
Show the change in frequency of one or more search results over time.
Endpoint
Parameters
- Required
- corpus – Corpus name. [multi]
- cqp – CQP query.
- Optional
- default_within – Prevent search from crossing boundaries of the given structural attribute.
- within – As above, but for specific corpora, overriding the default. Specified using the format ‘corpus:attribute’. [multi]
- subcqp# – Where # is a number. Sub-queries to the main query (or last cqp# query). Any number of numbered subcqp-parameters can be used.
- granularity – Time resolution. y = year (default), m = month, d = day, h = hour, n = minute, s = second.
- from, to – Only include results contained by specified date range. On the format
YYYYMMDDhhmmss
. - strategy – Time matching strategy. One of 1 (default), 2 or 3. See below for explanation.
- cqp# – Where # is a number. In addition to the cqp parameter, you can add additional CQP queries that will be executed on the result of the previous query (i.e. searching within search results). The final result returned to the user will be that of the last numbered query.
- expand_prequeries – When using multiple CQP queries (cqp# above), this determines whether subsequent queries should be executed on the containing sentences (or any other structural attribute defined by within) from the previous query, or just the actual matched tokens. (default: ‘true’)
- incremental – Incrementally return progress updates when the calculation for each corpus is finished.
If subcqp
is omitted, the result will only contain frequency information for the CQP query in cqp
(or the last
cqp#
query). If one or
more sub-queries are specified using subcqp1
, subcqp2
and so on, the result will contain frequency information for
these as well.
The result is returned both per corpus, and a total.
Strategies
What should happen when you ask for time data with a granularity finer than that of the annotated material? Does a
search limited to the period 2005-01-01 – 2005-01-31 include material dated with only “2005”? The strategy
parameter
gives you some control over this, affecting both how from
and to
work, and what parts of the material contribute
to the results.
The list below describes the three different strategies, and for each strategy the rules that decide what part of the material is included in the search, as well as what tokens contribute to the token count for each data point.
The term “result time span” below refers both to the from
and to
span given by the user, and the different time spans
making up the data points in the result data, the size of which are determined by the granularity
parameter.
For example the data point “2015” representing the whole of year 2015 when granularity
is set to ‘y’, and “2015-01” representing the whole of January 2015 with granularity
set to ‘m’.
t1
and t2
represents the from and to dates for an annotated part of the material, and t1'
and t2'
is the
from and to of “result time span” described above.
- Strategy 1
- The material time span needs to be completely contained by the result time span, or the result time span needs to be
completely contained by the material time span.
(t1 >= t1' AND t2 <= t2') OR (t1 <= t1' AND t2 >= t2')
- Strategy 2
- All overlaps allowed between material time span and result time span.
t1 <= t2' AND t2 >= t1'
- Strategy 3
- The material time span is completely contained by the result time span.
t1 >= t1' AND t2 <= t2'
Returns
{
"corpora": {
"<corpus>": [
{
"relative": {
"<date>": <relative frequency>,
...
},
"sums": {
"relative": <sum, relative frequency>,
"absolute": <sum, absolute frequency>
},
"absolute": {
"<date>": <absolute frequency>,
...
}
},
{
"cqp": "<sub-CQP query>",
"relative": {
"<date>": <relative frequency>,
...
},
"sums": {
"relative": <sum, relative frequency>,
"absolute": <sum, absolute frequency>
},
"absolute": {
"<date>": <absolute frequency>,
...
}
},
<more structures like the one above, one per sub-query>
],
...
},
"combined": [
{
"relative": {
"<date>": <relative frequency>,
...
},
"sums": {
"relative": <sum, relative frequency>,
"absolute": <sum, absolute frequency>
},
"absolute": {
"<date>": <absolute frequency>,
...
}
},
{
"cqp": "<sub-CQP query>",
"relative": {
"<date>": <relative frequency>,
...
},
"sums": {
"relative": <sum, relative frequency>,
"absolute": <sum, absolute frequency>
},
"absolute": {
"<date>": <absolute frequency>,
...
}
},
<more structures like the one above, one per sub-query>
]
}
The data points in the result indicates the number of hits from that point onward until the next data point, meaning that the following data:
"2010": 100,
"2012": 50,
"2013": 0,
"2016": null
should be interpreted as 100 hits during 2010–2011, then 50 hits during 2012, zero hits 2013–2015, and finally from 2016 onwards we have no data at all.
Example
Show how the use of “tsunami” and “flodvåg” (“tidal wave”) has changed over time in the Swedish newspaper Göteborgs-Posten:
/count_time?cqp=[lex+contains+"tsunami\.\.nn\.1|flodvåg\.\.nn\.1"]&corpus=GP2001,GP2002,GP2003,GP2004,GP2005,GP2006,GP2007,GP2008,GP2009,GP2010,GP2011,GP2012&subcqp0=[lex+contains+'tsunami\.\.nn\.1']&subcqp1=[lex+contains+'flodvåg\.\.nn\.1']
Distribution Over Time
Show the distribution of all tokens in a corpus over time.
Endpoint
Parameters
- Required
- corpus – Corpus name. [multi]
- Optional
- granularity – Time resolution. y = year (default), m = month, d = day, h = hour, n = minute, s = second.
- per_corpus – Include per-corpus results. (default: ‘true’)
- combined – Include combined results. (default: ‘true’)
- from, to – Limit result to specified date range. On the format
YYYYMMDDhhmmss
. - strategy – Time matching strategy. One of 1 (default), 2 or 3. See
/count_time
for explanation. - incremental – Incrementally return progress updates when the calculation for each corpus is finished.
Returns
{
"corpora": {
"<corpus>": {
"<date>": <token count>,
...
},
...
},
"combined": {
"<date>": <token count>,
...
}
}
Example
Show distribution of tokens in the Swedish Party Programs and Election Manifestos corpus over time:
/timespan?corpus=VIVILL
Log-Likelihood Comparison
Compare the results of two different searches by using log-likelihood.
Endpoint
Parameters
- Required
- set1_cqp – CQP query for query 1.
- set2_cqp – CQP query for query 2.
- set1_corpus – Corpus name for query 1. [multi]
- set2_corpus – Corpus name for query 2. [multi]
- At least one of:
- group_by – Positional attribute by which the hits should be grouped. [multi]
- group_by_struct – Structural attribute by which the hits should be grouped. [multi]
- Optional
- ignore_case – Change all values of the given attribute to lowercase. [multi]
- max – Max number of results per set.
- incremental – Incrementally return progress updates when the calculation for each corpus is finished.
Returns
{
"average": <average for log-likelihood>,
"loglike": {
"<value>": <log-likelihood value>,
...
},
"set1": {
"<value>": <absolute frequency>,
...
},
"set2": {
"<value>": <absolute frequency>,
...
}
}
A positive log-likelihood value indicates a relative increase in set2
compared to set1
, while a negative value
indicates a relative decrease.
Example
Compare the nouns of two different corpora:
/loglike?set1_cqp=[pos="NN"]&set2_cqp=[pos="NN"]&group_by=word&max=10&set1_corpus=ROMI&set2_corpus=GP2012
Word Picture
Get typical dependency relations for a given lemgram or word.
Endpoint
Parameters
- Required
- corpus – Corpus name. [multi]
- word – Word or lemgram.
- Optional
- type – Search type: ‘word’ (default) or ‘lemgram’.
- min – Cut-off frequency. (default: no cut-off)
- max – Maximum number of results. 0 = unlimited. (default: 15)
- incremental – Incrementally return progress updates when the calculation for each corpus is finished. (default: ‘false’)
Returns
{
"relations": [
{
"dep": "<dependent lemgram or word>",
"depextra": "<dependent prefix>",
"deppos": "<dependent part of speech>",
"freq": <number of occurrences>,
"head": "<head lemgram or word>",
"headpos": "<head part of speech>",
"mi": <lexicographer's mutual information score>,
"rel": "<relation>",
"source": [
<list of IDs, for getting the source sentences>
]
},
...
]
}
Example
Get dependency relations for the lemgram ge..vb.1:
/relations?word=ge..vb.1&type=lemgram&corpus=ROMI
Word Picture Sentences
Given the source ID for a relation (from a Word Picture query), return the sentences in which this relation occurs.
Endpoint
Parameters
- Required
- source – List of source IDs (from a Word Picture query). [multi]
- Optional
- start – First result to return; used for pagination. (default: 0)
- end – Last result to return; used for pagination. (default: 9)
- show – Positional attributes to show. (default: ‘word’) [multi]
- show_struct – Structural attributes to show. [multi]
Returns
Returns a structure identical to a regular /query
.
Lemgram Statistics
Return the number of occurrences of one or more lemgrams in one or more corpora.
Endpoint
Parameters
- Required
- lemgram – Lemgram. [multi]
- Optional
- corpus – Corpus name. (default: all corpora) [multi]
- count – Occurrences as regular lemgrams (‘lemgram’), as prefix (‘prefix’), or as suffix (‘suffix’). (default: ‘lemgram’)
- incremental – Incrementally return progress updates when the calculation for each corpus is finished.
Returns
{
"<lemgram>": <number of occurrences>,
...
}
Example
Get number of occurrences of the lemgrams
ge..vb.1
andta..vb.1
in a single corpus:
/lemgram_count?lemgram=ge..vb.1,ta..vb.1&corpus=ROMI
Structural Values
Get all available values for one or more structural attributes, together with number of tokens for each value.
Similar to /count_all
but without relative frequencies and with support for
hierarchies.
Endpoint
Parameters
- Required
- corpus – Corpus name. [multi]
- struct – Structural attribute. [multi]
- Optional
- count – Include token count. (default: ‘false’)
- per_corpus – Include per-corpus results. (default: ‘true’)
- combined – Include combined results. (default: ‘true’)
- incremental – Incrementally return progress updates when the calculation for each corpus is finished.
struct
can be either a plain attribute, or a hierarchy of two or more attributes, like so:
text_author>text_title
.
Returns
Without count
the result will contain lists:
{
"corpora": {
"<corpus>": {
"<attribute 1>": [
"<value>",
...
],
...
},
...
},
"combined": {
"<attribute 1>": [
"<value>",
...
],
...
}
}
With count
the result will consist of objects:
{
"corpora": {
"<corpus>": {
"<attribute 1>": {
"<value>": <token count>,
...
},
...
},
...
},
"combined": {
"<attribute 1>": {
"<value>": <token count>,
...
},
...
}
}
Example
Get all authors and their titles together with token count:
/struct_values?corpus=ROMI&struct=text_author>text_title&count=true
Authentication
Authenticate a user against an authentication system, if available.
Endpoint
Parameters
- Required
- username, password – Login information using basic access authentication.
Returns
A list of protected corpora that the user has access to.
{
"corpora": [
"<corpus>",
...
]
}