Kumiko Tanaka-Ishii Group

ーStudy of Natural Language as a Complex Systemー
We study mathematical properties of natural language by using the theory of complex systems. The relation between linguistic structure (such as words and grammar) and large-scale properties is investigated from the perspectives of fractals and chaos. Mathematical models of language that reproduce such understanding are explored and are applied to natural language processing.
(Recently published Book and its Errata)

Language has properties in common with other large-scale social systems, such as finance and communication networks. We explore new ways of engineering these social systems by using texts and computing with large-scale resources.
Computational linguistics and study of natural language by using the theory of complex systems
x Statistical properties of language
x Methods for measuring nonstationarity and long memory underlying language
x Scaling properties of language
x Quantification of complexity of language
Mathematical language models and computational representations of language structure
x Language models that reproduce statistical properties of language
x Embeddings that encode scaling properties
x Mathematical relation between long memory and grammatical structure
x Deep learning methods for sequences with complex properties
Analysis and prediction of social complex systems via texts
x Embedding methods for social complex system entities
x Deep learning methods for financial time series by using texts
x Analyses of various social complex systems from linguistic perspectives
Recent study examples
Mathematical language models and computational representations of language structure
We discuss the limitations of neural and other machine learning language models in relation to the complex properties of language, and we investigate directions for improvement. Moreover, we explore how the scaling properties of language are formally related to the structural components of language such as words and grammar.
Nonlinear language representation x
State-of-the-art word embedding methods represent a word with a single vector and presume a linear vector space, which does not easily incorporate nonlinearity that is necessary to, for example, represent polysemeous words (those having multiple meanings). We study alternative mathematical representation of language, such as the use of nonlinear functions and fields. We have proposed one formulation called FIRE (Reference) which beats BERT in an evaluation of counting the number of word senses.
Deep learning and scaling properties x
For mathematical models of language, their potential, limitations, and ways of improvement are investigated in terms of whether they reproduce the complex properties of language. The nature of linguistic structure is studied in terms of its relation to scaling properties of language. (Reference) (Reference)
Generative models of complex systems
A generative model is a mathematical formulation that generates a sample similar to real data. Many such models have been proposed using machine learning methods, including deep learning. Study of a good model serves to characterize the nature of a system and also to clarify the potential of machine learning. We study various time series models including classical Markov models, grammatical models, Simon processes, random walks on a network, neural models, autoencoders, and adversarial methods. The fundamental properties of generative models are studied in terms of whether they can generate samples resembling real data. (Reference)
Unsupervised extration of templates from texts x
Templates are multi-word expressions with slots, such as "Starting at __ on __ " or "regard _ as _", that appear frequently in text and also in data from sources such as Twitter. Automatic extraction of these template expressions is a challenging problem that is related to grammar inference. We propose automatic template extraction by using a binary decision diagram (BDD), which is mathematically equivalent to a minimal deterministic finite-state automaton (DFA). We have studied a basic formulation and currently seek a larger application to extract patterns from social networking service (SNS) data through additional use of deep learning methods. (Reference and its link to arxiv)

Study of language by using theory of complex systems
Scaling properties that derive from statistical mechanics are known to hold for natural language. We study the nature of language by computing with large-scale data.
Metrics that characterize kinds of data x
Various metrics are considered in terms of whether they characterize different kinds of data. For example, in the case of natural language, metrics that specify the author, language, or genre have been studied. One such metric is Yule's K, which is equivalent to Renyi's second-order (plug-in) entropy. Yule's K computes a value that does not depend on the data size but only on the data kind. We explore such metrics among various statistics related to scaling properties of real data and compare different kinds of data such as music, programming language sources, and natural language. (Reference)
Quantification of structural complexity underlying real world time series x
How grammatically complex are adults' utterances as compared with those of children? How is a literary text more structurally complex than a Wikipedia source? How can such complexity be compared with that of a music performance or a programming languages source? In the linguistic domain, one existing, formal way to consider such questions is through the Chomsky hierarchy, which formulates different complexity levels of grammar through constraints put on rewriting rules. While the hierarchy provides qualitative categorization, it cannot serve for quantitatively comparing the structural complexity of time series. We investigate a new way to quantify structural complexity by using metrics based on scaling properties. (Reference)

Analysis of long memory underlying nonnumerical time series x
Real instances of social systems have a bursty character, meaning that events occur in a clustered manner. For example, the figure on the right shows how rare events occur over time in texts (the first indicates rarer events than the second; the second, rarer than the third). This clustering phenomenon indicates how a sequence has long memory and thus exhibits self-similarity. We study methods for nonnumerical time series to quantify the degree of clustering and examine different self-similarity degrees across various systems. (Reference)
Analysis and prediction of social complex systems via texts
Language has properties in common with other large-scale complex systems, such as finance and communication networks. We aim to model these universal common properties and apply them to engineering of such systems by processing large-scale textual data.
Modeling of financial markets under extreme risks x
The theory of econophysics reveals the scaling properties of price, which explains why market crashes much more frequently than expected. A challenge of financial market modeling is to characterize the risks implied by extreme events, such as the COVID-19 crisis, which are rare and unable to well capture from the limited history of price data. News texts, on the other hand, are biased to such rare events and can thus be a desirable information source in addition to price history. We have studied the use of textual data for quantifying such extreme risks and its potential application. (Reference)
Influence of textual data and communication structure on financial prices x
The bitcoin price crash at the beginning of 2018 was caused by various social factors. The influence of news wire stories and social media was especially crucial because of the combination of both credible and fake information being mixed and expanded on social media. We accumulate financial data including various stock and bitcoin prices and analyze the influence of the communication structure and textual data. (Reference)
Entropy rate of human symbolic sequences x
We explore the complexity underlying human symbolic sequences via entropy rate estimation. Consider the number of possibilities for a time series of length n, with a parameter h, as 2hn. For a random binary series consisting of half ones and half zeros, h=1. For the 26 characters in English, however, the number of possibilities is not 26n, because of various constraints such as "q" being followed only by "u". Shannon computed a value of h=1.3, but the question of acquiring a true h for human language is difficult to answer and remains unsolved: in fact, it is unknown whether h is even positive. Therefore, we study ways to compute the upper bound of h for various kinds of data, including music, programs, and market data, in addition to natural language. (Reference, Reference)