Computer-Assisted Language Comparison

Project Website Online

2017-03-31T12:00:00+00:00

In time with the official start of the project, we are glad to announce that the official project website is now online. It is without question that this website will be refined during the project duration, but the basic infrastructure is now there, and those interested in our project will be able to follow our news.

Attending the EACL Conference in Valencia

2017-04-02T12:00:00+00:00

Next week, I will attend the conference of the European Chapter of the Association of Computational Linguistics. After Lyon in 2012, this is my second EACL, and I will be involved in two presentations, one together with Gerhard Jäger and Pavel Sofroniev on automatic cognate detection, and one where I present the current state of my EDICTOR tool for computer-assisted language comparison.

Mini-Workshop on Sino-Tibetan Phylogenies in Zürich

2017-04-12T12:00:00+00:00

We had an interesting small workshop in Zürich where my former colleagues from Paris, Laurent Sagart, Guillaume Jacques, and Yunfan Lai, with whom I pursue the goal to establish a larger lexicostatistic database of Sino-Tibetan languages, as well as people from Balthasar Bickel's team were present. We presented our respective work we have done so far, and I myself gave a talk on my ideas regarding a Sino-Tibetan Lexicostatistic Database. We will all keep collaborating in the future and potentially organize a second meeting, either in Paris or in Jena, later during this year.

Mini-Workshop on Poetry

2017-04-22T12:00:00+00:00

On Thursday, last week, we had a mini-workshop on poetry for which we invited colleagues from the Max Planck Institute for Empirical Aesthetics and from the University of Zurich. It may look strange on first sight why poetry would matter for computer-assisted language comparison, but the poetic tradition of rhyming in the history of Chinese in fact plays a crucial role for the reconstruction of the oldest stages of the languages. I myself devoted two recent studies to the application of network approaches to study Old Chinese phonology which are currently in the final phase of editing and will hopefully appear soon (the draft for one study can be found here). In my talk, I presented this research quickly (the slides are here), and pointed to future questions on the dynamics underlying the development of poetic traditions from a cross-linguistic and historical perspective.

The other speakers discussed many interesting topics, ranging from empirical studies on poetry and how one can annotate the important factors that constitute poetic speech (Winfried Menninghaus and Christine Knoop, MPI-AE), via the automatic detection of rhyme patterns in German poetry (Thomas Haider MPI-AE), up to tquestions of language contact and cultural exchange (Paul Widmer, UZH), and the co-evolution of linguistic and poetic forms (Cormac Anderson, MPI-SHH). Our discussions during the talks were long, and since we had to stop at some point, there was no time for the talk by Olivier Morin (MPI-SHH) on "poetry as super-week communication". This was a definit loss, as I saw when Olivier shared his slides afterwards, but luckily we are working in the same department, and nothing will prevent us to go on with discussions and exchange of ideas.

We all decided to stay in close contact and keep each other informed on future ideas as well as concrete research, and it is quite likely that at some point in the not-so-far future, I will present more of this here.

LingPy-Tutorial at the Quantitative Methods Spring School

2017-05-15T12:00:00+00:00

Last week, we had a spring school on Quantitative Methods here in Jena. This is an annual event, and it was the second time that it took place, with Fiona Jordan organizing the main event, and many interesting scientists coming here as tutors or students for one full week (seven days), which was quite exhausting but also very interesting. This time, I gave a tutorial on LingPy, introducing the basic ideas of automatic sequence comparison and how it can be used to get started on computer-assisted work flows. You will find the tutorial online here in form of an Ipython Notebook, but you can likewise download the pdf or follow my introductory slides. All in all, this tutorial will provide you with all the most recent information needed to start making your own analyses with LingPy.

Final Report of SinDial Project

2017-05-19T12:00:00+00:00

My DFG-funded research project on Vertical and lateral aspects of Chinese dialect history officially ended on December 31, 2016. From January 2015 until December 2016 I had two very interesting but also challenging years during which I made acquaintance with many different scholars from different disciplines and countries but also with many new approaches and methods to historical linguistics and related disciplines.

Having submitted my final report in April (first time for a long time I wrote in German again), and hoping that the reviewers do not have anything grave to complain about, I now published the report online with Zenodo, and you can find it online here.

In case you wonder why I recommend this final report in the context of the CALC project, the answer is simple: Much of the ideas that I put into the project application for the CALC project were developed while I was in Paris, funded by the DFG, so in some sense, the SinDial project on Chinese dialects was the root of CALC.

New Papers Accepted

2017-06-15T12:00:00+00:00

Two new papers have been accepted during the last two weeks, and I am very glad about both publications, since they cover topics that touch the core of my project on computer-assisted language comparison.

The first is joint work with Nathan W. Hill (SOAS, London), and titled "Challenges of annotation and analysis in computer-assisted language comparison: A case study on Burmish languages". In this paper, we point to general annotation challenges when analysing South-East Asian languags in which compounding is frequent and sound correspondences are often hard to discover. We present a new database of cognate sets across 8 Burmish languages, all coded for partial cognacy, and consistently aligned. The final version of the paper which we submitted as our final version to the Yearbook of the Poznań Linguistic Meeting is available here.

The second paper is joint work with Gerhard Jäger (University Tübingen), and concentrates on a problem which is often overlooked in the literature, namely the problem of how well current algorithms infer which word forms where used to express a given concept in ancestral, unattested languages. This is not a trivial problem, and we only address it from the perspective of the classical lexicostatistical word lists, where we test on three datasets (Indo-European, Austronesian, and Chinese) how well different algorithms infer the ancestral states as they are predicted by the gold standard (the proto-forms provided along with the datasets). It turns out that the algorithms do not perform very well (unfortunately, MLN, an algorithm on which I worked a lot myself, performs even worst), but when looking at the gold standard in detail, we realized that many of the errors are due to problems with the gold standards, which are themselves quite inconsistent and not very trustworthy. As a result, we think that using ancestral state reconstruction methods for this purpose of "onomasiological reconstruction" might actually really help to get a better estimate. The draft of the paper can be found here.

New Blog Posts and Papers

2017-06-29T12:00:00+00:00

I have published a couple of new papers recently, but since they go back to my former research project and were not directly developed as part of the CALC project, I do not list them in the list of papers. They are, however, quite important for our research, since they both deal with Old Chinese reconstruction.

The first paper is in "Using network models to analyze Old Chinese rhyme data" and will soon officially appear in the Bulletin of Chinese Linguistics. In the meantime, you can find my author's copy here.

The second paper is on "Vowel purity and rhyme evidence in Old Chinese reconstruction" (common work with my colleagues from Paris and London, Jananan S. Pathmanathan, Eric Bapteste, Philippe Lopez, and Nathan W. Hill) finally came out today, and you can find the PDF for download here.

I also wrote two more blog posts, both devoted to language comparison in general and my view on computer-assisted language comparison in particular. The first blog post (in English) is titled "Trees do not necessarily help in linguistic reconstruction" can be found here, and the second one deals with sound change and explains them with help of tooth-loss in comic books and can be found here.

Three new talks during a busy week

2017-07-09T12:00:00+00:00

From Friday, 30th of June, until last Friday, 7th of July, I was giving three talks on three different topics. It started with a summary on the potential of networks approaches in Old Chinese reconstruction in Paris, after which I was very surprised that many scholars seem to support the idea of handling Chinese character formation with directed networks (and I hope that I will find time to address this soon, even if only in a small example). After that, I gave a talk in Liège on colexifications and cross-linguistic polysemies, and how we plan to update the CLICS database when we launch CLICS 2.0. Finally, I introduced some basic ideas on how to handle lexical and etymological data within the Cross-Linguistic Data Formats initiative, focusing specifically on annotation and analysis. Although it was quite exhaustive to prepare all these talks, I am glad that I scheduled them for this time, since it forced me to push a couple of important projects, such as cross-linguistic colexifications, and the cross-linguistic data formats, which are all central for computer-assisted language comparison in general, and also important for Sino-Tibetan in specific.

After one week in Jena, where I'll try to catch up with the work I could not finish yet, I'll finally have two weeks of holidays until beginning of August, interrupted only from another talk in Cologne next week on Friday, in which me and Nathan Hill will present some interesting new work on Burmish languages (I'll report later in more detail).

Back from Holidays

2017-08-01T12:00:00+00:00

Having been traveling for about two weeks, interrupted by a talk I gave in Cologne, I am now back at work and finally find time to announce some news on what happened recently. First, there are two new blogposts I wrote, one in English on similarities in linguistics, a follow-up to a blogpost I devoted to the same topic earlier this year. The other blogpost in German is devoted to impoliteness (Unhöflichkeit) in Chinese and other languages. Second, there is the talk I gave together with Nathan W. Hill in Cologne, on a workshop on the regularity of sound change, organized by Eugen Hill and Robert Mailhammer. In our talk, titled "Computer-assisted approaches to linguistic reconstruction" , we outlined a new framework for automated linguistic reconstruction which we illustrated with examples from the Burmish languages.

New Post-Doc in CALC

2017-08-02T12:00:00+00:00

It is my pleasure to welcome Yunfan Lai as a post-doc in the CALC project. He has a lot of experience in working with Sino-Tibetan languages and devoted his PhD to Khroskyabs, a very interesting branch of Sino-Tibetan whose history is still not clearly understood. As a member of CALC, Yunfan will pursue his studies on Khroskyabs and related varieties, and also provide help to uncover the mysterious history of Sino-Tibetan.

Talk at the Human Document Project

2017-08-09T12:00:00+00:00

Last week, I visited the Human Document Project 2017 in Freiburg, a project that seeks to preserve information about humans beyond the existence of the human race. As scify as this may sound on the first sight, as interesting it is, how many different questions and disciplines need to be involved into the plan of creating a time capsule that could witness of our existence even if we, that is, humanity, no longer exists. They invited philosophers, artists, technicians, data-experts, informaticians, physicists, and also me, as a linguist, whose job it was to give a rough overview on linguistic diversity and how we try to represent our knowledge about it. Although my talk, titled Storing our knowledge of linguistic diversity: Towards the standardization of cross-linguistic data formats did not involve the longer perspective of the next million years, I had the impression that it triggered the interest of the colleagues. While I remain sceptical about the general usefulness of science fiction questions in science, I have to admit that the day I spent in Freiburg was very inspiring, as I learned so many new things. Maybe, in the end, this is even the more important aspect of the HUDOC project: bringing together people from different disciplines and having them talk with each other...

Yunfan Lai's PhD thesis is now online

2017-08-10T12:00:00+00:00

Hi there. I defended my thesis back in June, but after a large gap, I failed to motivate myself to upload it online. Now I finally did it. You may now have a look at my thesis here. Have fun!

New Blog Posts for August

2017-08-17T12:00:00+00:00

I wrote two new blogposts in August, one in German on the benefits of using alignments and similar visualization techniques more broadly in the media, which you can find here, and one in English, where I discuss the problem of unattested character states in phylogenetic reconstruction, specifically in linguistics, which you can find here.

CALC and DLCE Organize Panel at the Deutscher Orientalistentag (Jena)

2017-09-06T12:00:00+00:00

The DLCE and CALC are organizing a panel on the Deutscher Orientalistentag, which will take place in Jena this year (September 18-22). On September 21, from 9am to 1pm scientists from the institute and external guests will share and discuss their thoughts on the topic "Languages as keys to our past".

We will soon provide more information on the list of speakers and their abstracts.

Schedule and Abstracts for DOT Panel on Historical Linguistics Online

2017-09-07T12:00:00+00:00

I just finalized the first version of our website for the Panel on "Languages as Keys to Our Past" which we organize for the 33. Deutscher Orientalistentag „Asien, Afrika und Europa“. The website can be found here. Later, we will also link the slides of all speakers in PDF format.

Radio Interview on Language Diversity

2017-09-11T12:00:00+00:00

Last week, I gave a radio interview with Deutschlandfunk Nova in which I tried my best to answer questions regarding language diversity and its driving forces. The interview, which was broadcasted yesterday, can also be found online under this link.

New Paper on Annotation in Historical Linguistics

2017-09-14T12:00:00+00:00

I am proud to announce that a paper in which me and Nathan Hill discuss Challenges of Annotation and Analysis in Computer-Assisted Language Comparison has now been published online and can be freely downloaded form this link. The paper discusses general challenges of annotation for the purpose of historical language comparison and also introduces first ideas on how to solve these challenges. Here is the abstract:

The use of computational methods in comparative linguistics is growing in popularity. The increasing deployment of such methods draws into focus those areas in which they remain inadequate as well as those areas where classical approaches to language comparison are untransparent and inconsistent. In this paper we illustrate specific challenges which both computational and classical approaches encounter when studying South-East Asian languages. With the help of data from the Burmish language family we point to the challenges resulting from missing annotation standards and insufficient methods for analysis and we illustrate how to tackle these problems within a computer-assisted framework in which computational approaches are used to pre-analyse the data while linguists attend to the detailed analyses.

DLCE, CALC, and University Jena Co-Organize Workshop at the Poznań Linguistic Meeting

2017-09-15T12:00:00+00:00

The DLCE (Cormac Anderson, Paul Heggarty) CALC (Johann-Mattis List), and Friedrich Schiller University Jena (Adrian Simpson) are co-organizing a workshop as part of the Poznań Linguistic Meeting on Monday, September 18. For more information, see the workshop website which has just been launched.

New Blog Post on Authority Arguments

2017-09-21T12:00:00+00:00

Two days ago, I wrote another blogpost, this time in Arguments from authority, and the Cladistic Ghost, in historical linguistics. This may look like an offensive argument I make there, but my major intention was to draw the attention to the fact that our "classical" comparative method was never classical in any sense, as it is just a label we use to denote what we do to compare languages, and that, in the light of new approaches, we should not be too dismissive, but rather try to work harder on integrated, computer-assisted frameworks, which will hopefully enable us to understand better, how our languages evolved into their current shape.

New Blog Posts for September

2017-10-02T12:00:00+00:00

On a last-minute-note I managed to write my monthly German blogpost for September, which was published last Saturday and deals with freedom and obligation in languages: Wahlpflicht und Wahlfreiheit in der Sprache.

You may further notice, that we have added a events section to this website, in which we list upcoming and past events which were carried out as part of the CALC research project.

New Blogposts and Lectures Online

2017-10-30T12:00:00+00:00

My German blog post for this month is devoted to the case system in Russian, titled Ein Fall für Tee. In this post, I discuss how difficult it is in linguistics to identify true regularities without exceptions.

In addition, my lecture series which I gave at Tianjin university is now available online. You can download the full lecture here.

My monthly post for The Genealogical World of Phylogenetic Networks also appeared. This time, I collaborated with Guido Grimm to investigate cross-linguistic naming patterns for domesticated animals, like cat, dog, goat, and sheep.

LingPy 2.6 released!

2017-11-23T12:00:00+00:00

We just released LingPy 2.6, which in addition to some smaller changes that may prove useful was fully concentrated on stabilizing the behavior of the algorithms and making the package easier to use.

Documentation of the package can be found, as usually, at http://lingpy.org, and the package itself can be downloaded from the traditional channels, be it Zenodo, GitHub, or PyPi.

Two new blog posts and website for upcoming workshop

2017-11-29T12:00:00+00:00

During the last days, we managed to finish two new blogpost. The first is a follow-up from our earlier blogpost on animal names, this time devoted to goats and sheep. The second blogpost (in German) is devoted to "hybrid pronunciations", exemplified with help of the debate about the Jamaica coalition in Germany.

I would also like to announce that our upcoming workshop on "Old Chinese and Friends" is gaining more structure (to take place from 26-27 of April, 2018, in Jena), and we have just managed to launch the project website along with the call for abstracts online.

Three Positions Available in the CALC Project

2017-12-01T12:00:00+00:00

Our research project offers three positions for three years each, two for doctoral students, and one for a post-doc (the post-doc position is initially for two years with an option for a one-year extension after positive evaluation after the first year). Starting date is April 2018, and deadline for the submission of applications is end of January. The call for post-docs with all details, can be found here and the call for doctoral studens can be found here.

Final blog posts for 2017

2017-12-19T12:00:00+00:00

A week earlier than normal, my final blog posts for this year have now appeared, the German one is titled Die Angst des Jongleurs vorm Fallenlassen and discusses the problem of letting things go (especially if one tries to juggle). The English one is titled The art of doing science: alignments in historical linguistics and discusses what we should keep in mind when using alignments in historical linguistics.

Call for papers on a special issue on "Computational approaches in historical linguistics after the quantitative turn"

2018-01-15T12:00:00+00:00

I would like to point to a call for papers for the journal "Computational Linguistics" on "Computational approaches in historical linguistics after the quantitative turn", guest-edited by Taraka Rama, Simon J. Greenhill, Harald Hammarström, Gerhard Jäger, and myself.

The deadline is July 15, 2018, and detailed information can be in this PDF.

New Blog on Mayflies

2018-01-23T12:00:00+00:00

I just finished my regular German blog post for January, this time on Terry Pratchett's Eintagsfliege and the question whether language change and language decay are the same phenomenon (which they aren't, of course).

English blogpost on pronunciation networks in Chinese phonology

2018-01-29T12:00:00+00:00

It was in some sense last minute (but planned as such before) to write my monthly blogpost for David Morrison's blog on phylogenetic networks in the last week of January. This has now been done (also thanks to David's help in making my English and the story in general more readable), and you can find my blogpost on pronunciation networks in Chinese phonology online.

New Draft Paper on Network Approaches to Historical Chinese Phonology

2018-02-13T12:00:00+00:00

Yesterday, I finished preparing a paper that will be presented at this year's LFK Young Scholars Symposium in Taipei. It deals with new network approaches to the discipline of Historical Chinese Phonology, including networks of Chinese character formation and networks of Chinese sound glosses. A draft form of the paper can be found here.

New German Blog Post on Kitchen Etymologies

2018-02-20T12:00:00+00:00

On Saturday, I published another German blog post. This time about kitchen etymologies in times of elections, titled Konservativ kommt wirklich nicht von Konserve.

New English Blog Post Synonymy and Phylogenetic Reconstruction

2018-02-26T12:00:00+00:00

Today, my regularly monthly English blog post appeared. This time, it discusses the problem of excessive synonymy in linguistic datasets and its implication for phylogenetic reconstruction: Tossing coins: linguistic phylogenies and extensive synonymy .

Finally Released Concepticon 1.1 and New Drafts

2018-03-08T12:00:00+00:00

After two years of hard work on API and data, we have finally managed to release version 1.1 of the Concepticon resource. In addition to the general application, we also offer a standalone app with enhanced search functionalities in currently seven languages, which can be found here.

Furthermore, we just submitted our final version (before the final proofs) of an paper that will appear some time later in 2018 with the provocative title Save the trees: Why we need tree models in historical linguistics (and when we should apply them).

Last not least, a short paper on the question Are automatic methods for cognate detection good enough for phylogenetic reconstruction (Taraka Rama, myself, Johannes Wahle, and Gerhard Jäger) was now accepted as a short paper to be presented in form of a poster at the NAACL conference. We're currently revising the draft, but we will try to put a draft close to the final version soon. The results indicate that especially the simpler methods may perform surprisingly well, although we could, unfortunately, only check the topology.

Latest Thinking Interview on Automatic Cognate Detection

2018-03-09T12:00:00+00:00

Yesterday, an interview on our work on automatic cognate detection from early last year (List, Greenhill, and Gray 2017) appeared online at the Latest Thinking Platform.

New German Blog Post on the Imitation of Unknown Languages

2018-03-12T12:00:00+00:00

During the weekend, I found time to write my monthly German blog post. This time, I discuss how speakers of a given language imitate or joke about speakers from other languages. This topic is linguistically interesting, since it may reveal quite a few things about the speakers who joke about foreign languages, as I try to show with German jokes about the Chinese language. The post can be found here.

Article in FAZ Discussing Computational Historical Linguistics

2018-03-16T12:00:00+00:00

On March 14, an article appeared in the Frankfurter Allgemeine Zeitung, discussing the usefulness and appropriateness of computational approaches in historical linguistics, titled "Bäume der Erkenntnis", by Wolfgang Krischke. The article also presents our department and mentions our attempts to work on a reconciliation of computational and classical historical linguistics. Unfortunately, I cannot share it at the moment, since it did not appear online, but if it does, we will announce it here.

FAZ Article Online and Blog Post on the Systemic Aspects of Sound Change

2018-03-26T12:00:00+00:00

I am pleased to announce that the FAZ article I mentioned in a previous post has now appeared online, where you can freely read it: Wie erforscht man Ursprünge?.

Furthermore, my English blogpost for March just appeared. This time, I try to explain why sound change is so peculiar, and why it cannot be simplified with changes in DNA or protein sequences due to mutations: It's the system, stupid!.

Lecture on Computer-Assisted Language Comparison

2018-04-05T12:00:00+00:00

Now that our group has finally been assembled (more info on that regard soon), we are ready to spread the word by presenting a lecture on computer-assisted language comparison at the Friedrich-Schiller-University Jena during the summer term, regularly on Tuesdays from 14 to 16 o'clock.

The target audience of the lecture are linguists with a background in historical linguistics and the interest to learn more about computational approaches. For those interested in joining, you can check out the seminar plan or the official announcement of the lecture with FSU Jena.

Old Chinese and Friends workshop ended successful

2018-05-03T12:00:00+00:00

Our workshop, Old Chinese and Friends, held from 26/04/2018 to 27/04/2018, has enjoyed high appreciations from the participants.

17 renowned scholars from all over the world participated in the workshop and had their say. The presentations covered a wide range of domains concerning Old Chinese, including historical phonology, morphology, methodology and paleography. New ideas, opinions and hypotheses have successfully found their way to Jena, a city which can now be named the home of historical linguistics.

The participants were also amazed at how much time they were given for discussion at the end of each day. The discussion sessions were full of insightful questions and comments.

The workshop embraced language variety, as both English and Chinese presentations and discussions were accepted.

Old Chinese and Friends is a follow-up workshop of the 2016 Recent Advances in Old Chinese Historical Phonology at SOAS, University of London, UK. The aim of this workshop is to share new findings and results in the field of the Old Chinese language.

Apart from this exciting event, the month of April has also seen the acceptance of my paper on "Relativisation in Wobzi Khroksyabs and the integration of genitivisation" with Linguistics of the Tibeto-Burman Area. This article is the first contribution on sentential construction of the Khroskyabs language (more info coming soon).

LingPy Tutorial Accepted

2018-05-16T12:00:00+00:00

Our tutorial for the LingPy library, which describes in detail how cognate detection and alignment analyses can be carried out with help of LingPy 2.6, has now finally been accepted for publication with the Journal of Language Evolution. This tutorial is supposed to represent the state of the art of what can be done with LingPy and how it should be done. It was prepared in collaboration with Mary Walworth, Simon Greenhill, Tiago Tresoldi, and Robert Forkel, and reflects the strong collaboration between different members of our Department of Linguistic and Cultural Evolution and our CALC research group. The draft of the paper can be found here and the tutorial itself is available from GitHub.

German Blog Post

2018-05-18T12:00:00+00:00

Yesterday, I published my traditional German blog post for May. This time, I discuss different aspects of linguistic variations: Ur-in-stinkt: Grenzen und Chancen der Schriftsprache.

New Version of CLICS

2018-05-24T12:00:00+00:00

We have completely relaunched the database of cross-linguistic colexifications with help of the CLLD framework, which is now available as a beta-release at http://clics.clld.org. Our paper (together with Simon Greenhill, Cormac Anderson, Thomas Mayer, Tiago Tresoldi, and Robert Forkel) introducing the database, titled "An improved database of cross-linguistic colexifications" is available in draft form here.

Yunfan is going to the field

2018-05-26T12:00:00+00:00

I am now packing my stuff, getting ready for a summer fieldtrip in Sichuan. I will be working on Khroskyabs dialects (hopefully several new dialects). I will focus on basically everything, from phonology to morphosyntax. I will also keep an eye on the expressions of geography in this language, as well as animal calling sounds and other interesting stuff.

New English Blog Post

2018-05-30T12:00:00+00:00

On Monday, I published my traditional monthly blog post for the Phylogenetic Networks Blog, this time Comparing reconstruction systems in historical linguistics.

New Weblog on Computer-Assisted Language Comparison

2018-06-06T12:00:00+00:00

After almost two months of preparation, our CALC team is now busy preparing the first blogposts for our new weblog on Computer-Assisted Language Comparison in Practice. In the future, we hope to publish minimaly one post per month, targeting different topics, including methodological discussion notes and fresh tutorials on software, data curation, and data analysis.

New Post on Gender Differences in Language

2018-06-10T12:00:00+00:00

Last week, I wrote my monthly German post, titled Huhu, Digga! Geschlechtsunterschiede in der Sprache. I discuss recent phenomena of gender differences in the German language and their potential implications for the debate in German about "fair language" ("gerechte Sprache").

CALC Blog Post

2018-06-18T12:00:00+00:00

Last week I published a short info on a dataset we developed for our project, containing all the parallel translations in the English Wiktionary. The post can be found here and the data is available on Zenodo as Parallel Translations from the English Wiktionary.

New Post on Internal and External Language Comparison

2018-06-26T12:00:00+00:00

Today, my monthly post for the phylogenetic networks blog appeared, this time discussing Horizontal and vertical language comparison, that is, the differences in comparing languages internally or externally.

LingPy Tutorial Published Online

2018-07-07T12:00:00+00:00

Our tutorial on LingPy (common work with Mary Walworth, Simon Greenhill, Tiago Tresoldi, and Robert Forkel) has now appeared online, published with the Journal of Language Evolution. The article is open access and can be downloaded or viewed online under this link. It reflects the current state of the art of our LingPy in its 2.6 version.

Official Release of CLICS 2.0

2018-07-19T12:00:00+00:00

The CLICS database of cross-linguistic colexifications has now officially been released in a new version, called CLICS², available at http://clics.clld.org. The database features a multitude of new data points and a completely new framework for data curation and data analysis. What is new in this new database is also documented in a forthcoming paper, which will soon be published with Linguistic Typology. You can find the draft of that paper here.

German Blog on Teekesselchen

2018-07-24T12:00:00+00:00

I just wrote another German blog for July, this time focusing on homophony, polysemy, and the game that we used to play when I was young, called Teekesselchen "teapot": Netzwerke aus Teekesselchen.

English post on colexification networks

2018-07-30T12:00:00+00:00

My monthly blogpost in English just appeared online. This time, I am introducing the idea of cross-linguistic colexification networks (as they appear in our CLICS database): Networks of polysemous and homophonous words.

German blog post on word histories

2018-08-05T12:00:00+00:00

I just published my German blog post for August, this time reflecting about word histories, and how social factors may influence the history of words. You find the post here.

New cookbook entry for CLICS

2018-08-09T12:00:00+00:00

I published a new cookboor for the usage of the pyclics API with our blog for tutorials and methodological discussions on CALC, which you can find here.

CLICS paper appeared online

2018-08-22T12:00:00+00:00

Our paper presenting the CLICS² database of cross-linguistic colexifications has now appeared online officially. Unfortunately, the production team of DeGruyter messed up the online version of the article, so Chinese characters are not readable, and Russian characters are turned upside down. Luckily, the PDF version is correct. The paper is open access and can be accessed here.

First official beta release of SinoPy library

2018-08-24T12:00:00+00:00

I just submitted a first (beta) version of the SinoPy library for quantitative tasks in Chinese historical linguistics. SinoPy is an attempt to provide useful functionality for users working with Chinese dialects and Sino-Tibetan language data and struggling with tasks like converting characters to Pinyin, analysing characters, or analysing readings in Chinese dialects and other SEA languages.

You can find the library on GitHub, on PyPi, or on Zenodo.

Paper on CLDF accepted by Scientific Data

2018-08-27T12:00:00+00:00

Our paper introducing the CLDF (Cross-Linguistic Data Formats, https://cldf.clld.org) initiative has been accepted by Nature's Scientific Data journal:

Forkel, R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (forthcoming): Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

The paper introduces the basic ideas behind the CLDF standard and provides some examples and background information. I have uploaded the final draft we submitted to Nature here.

Blogpost on terminology for cognate relations

2018-08-28T12:00:00+00:00

Yesterday, I published a new English blogpost, this time introducing a new term for cognate relations in historical linguistics: "regular cognates". The concept is crucial for our terminology, although we are only beginning to develop methods to assess the regularity properly within computer-assisted frameworks. You can find the blogpost here.

Blogpost on representing structural data in CLDF

2018-09-05T12:00:00+00:00

Last Monday, I published a blogpost presenting how structural data can be represented in CLDF format. The blog can be found here, and it includes two datasets that were published before, which are now provided in CLDF format.

Paper with Abbie Hantgan on Bangime and Dogon accepted

2018-09-06T12:00:00+00:00

A paper by Abbie Hantgan and me, where we present the preliminaries for a computer-assisted analysis of Bangime and its relation to the Dogon languages in Mali, has now been accepted with the Journal of Language Contact. The paper presents a new approach for automatic borrowing detection based on a comparison of different algorithms for automatic cognate detection. Although the approach is rather simple, it seems to be efficient enough to provide initial hints regarding major borrowing partners in language contact scenarios, and it also shows that the mysterious Bangime language remains an isolate, at least with the methods we have at our disposal by now. The draft of the paper is available here.

New blog post on the dangers of etymologies in our daily life

2018-09-09T12:00:00+00:00

Yesterday, I published my monthly German blogpost, this time discussing the problem of using etymologies in speeches for rhetorical reasons. You can find the post, titled "Von hohen Zeiten und Schlägen in Raten: Vorsicht vor Alltagsetymologien!" here.

New blog post on CLDF for structural data

2018-09-24T12:00:00+00:00

Yesterday, I published my monthly English blogpost, this time showing how structural data can be represented in th e CLDF format. We plan to publish follow-up posts where we show how this data can be analyzed with network approaches. You can find the blogpost here.

A fast algorithm for cognate detection based on matching consonant classes

2018-10-03T12:00:00+00:00

On Monday, I published a new blogpost presenting a fast algorithm for cognate detection using Dolgopolsky's approach of matching consonant classes. The blogpost with an example using LingPy's test sets and basic data structures can be found here.

Prediction experiment on Kho-Bwa language data

2018-10-08T12:00:00+00:00

During the last weekend, Tim Bodt, Nathan Hill, and me submitted a preregistration with the Open Science Framework. In this experiment, we used computer-assited techniques to predict the most likely pronunciations for words so far missing in Tim's corpus on Kho-Bwa languages. During fieldwork in November, Tim will try to verify how good our predictions are. The preregistered version of the experiment can be found here.

For the prediction, a new method for sound correspondence pattern inference was used, which I developed in close collaboration with Nathan Hill during the last three years. Algorithm and code are now also available, both in a preprint (which you can find here) and with GitHub (lingpy/lingrex).

New blog post on morphological annotation

2018-10-10T12:00:00+00:00

Today, I published my first blog post on our project weblog, in which I propose a workflow for enhancing wordlists with morphological information. You can find the blogpost here.

Promiscuity of words

2018-10-15T12:00:00+00:00

It is surprising how many of the words in our languages are composed of other words. It is also surprising that linguistics has not yet come up with a term for the fact that it is specifically certain concepts that denote word forms which are then frequently reused throughout the lexicon of a language. I discuss this briefly in my monthly German blog post, titled Von Wortfamilien und promiskuitiven Wörtern.

Cross-Linguistic Data Formats

2018-10-17T12:00:00+00:00

Our paper describing the basic ideas underlying the Cross-Linguistic Data Formats initiative was finally published with the Scientific Data journal:

The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for validation and manipulation, a basic ontology which links to more general frameworks, and usage examples of best practices.

You can find the paper here.

Call for abstracts: Proposal for a Workshop on CALC at SLE 2019

2018-10-19T12:00:00+00:00

We plan on submitting a workshop proposal for computer-assisted language comparison at the SLE 2019 meeting in Leizpig. You find the detailed call for papers here.

Blogpost on Structural Data in Historical Linguistics

2018-10-22T12:00:00+00:00

Not all people will agree with my view, but I see the use of structural data as problematic when trying to either do phylogenetic reconstructio or to infer so far undemonstrated genetic relationships among languages. I have summarized my arguments in a blogpost, entitled Controversies about structural data in historical linguistics..

Blogpost on the History of Concept Lists

2018-10-31T12:00:00+00:00

Today, a new blogpost appeared, this time presenting some kind of a "making off" of Concepticon, where I present ideas "Towards a history of concept list compilation in historical linguistics" in the blog History and Philosophy of the Language Sciences. I also formatted the text and submitted a PDF version to Zenodo, which you can find here.

Tutorial on inferring consonant clusters with LingPy

2018-11-07T12:00:00+00:00

Today I published a new blog post in our tutorial blog, this time explaining how consonant clusters that recur in the languages of the world can be inferred with help of LingPy, based on data derived from the CLICS database. The tutorial can be found here.

Yunfan's article on relativisation in Khroskyabs

2018-11-09T12:00:00+00:00

My article, entitled Relativisation in Wobzi Khroskyabs and the integration of genitivisation, is going to appear in the upcoming issue of Linguistics of the Tibeto-Burman Area. In this article, the relativisation strategies in Wobzi Khroskyabs is described, and the historical pathway of the genitive relativisation in this language is hypothesised. You can view this article by clicking on the link above.

Blogpost on language decay

2018-11-12T12:00:00+00:00

Today, I published my monthly German blogpost, this time discussing the question of language decay: Wer hat Angst vorm Sprachverfall?.

New blog post on semantic aspects of word formation

2018-11-19T12:00:00+00:00

Today, I published my second blog post on our project weblog. This time I talk about a semasiological approach to the study of word formation and argue for the use of a new term to denote the idea of concept-based type-frequency. You can find the blogpost here.

Blogpost on structural data and accepted paper

2018-11-26T12:00:00+00:00

Together with Guido Grimm, I have published a new blog post devoted to the question of structural data in historical linguistics and how it should be used. You can find the blog here.

Furthermore, I was really pleased when I heard that my paper for the automatic inference of sound correspondence patterns was now officially accepted by the Computational Linguistics journal. I just submitted my final author edits, and uploaded the draft here. The code has now also been officially released, and you can test the lingrex package yourself, if you want.

Submitted and accepted papers

2018-11-29T12:00:00+00:00

Together with Nathan W. Hill and Christopher Forster, I submitted a paper on rhyme annotation in Chinese historical phonology and beyond. You can find the draft here.

In addition, our paper presenting a database of cross-linguistic transcription systems was now finally accepted for publication with the Yearbook of the Poznań Linguistic Meeting. We now submitted our final version, which is also online available here. The source code accompanying this paper has now also been released and can be found on GitHub at cldf/clts. In addition, you can inspect the data here.

New blog posts

2018-12-10T12:00:00+00:00

Last week, I published my monthly blog post in German, this time devoted to questions of open research, titled "Von hupenden Radlern und schludrigen Wissenschaftlern".

Today, another blog post appeared in our blog on tutorials for computer-assisted language comparison, this time devoted to Merging datasets with LingPy and the CLDF curation framework.

SLE-2019 Workshop on Computer-Assisted Language Comparison

2018-12-17T12:00:00+00:00

I am very pleased to announce that the workshop on computer-assisted language comparison, which I submitted in November, has been accepted for the annual SLE conference next year in Leipzig. I just submitted the final version of our workshop description, which you can also find here.

Here is the abstract of the workshop:

The workshop invites papers that deal with computer-assisted (as opposed to pure computational or pure qualitative) approaches to historical and typological language comparison. Computer-assisted approaches are hereby understood as procedures involving different stages of qualitative and quantitative data analysis, ranging from the initial preparation of lexical or structural data, via automatic or manual annotation, up to qualitative or quantitative analysis, that yield a specific result, be it a linguistic reconstruction system linking proto-forms to aligned reflexes, a phylogeny that lists inferred word histories, or tools for exploratory data analysis. By focusing on computer-assisted approaches, we hope to foster a more intensive collaboration between classical and computational linguists. In addition to detailed descriptions of concrete tasks in historical and typological language comparison, we also encourage submissions dealing with data standards enhancing data sharing and reuse, as well as the presentation of purely qualitative approaches for which no computational solutions exist so far.

If you are working on topics that seem apt for this workshop, consider applying, by submitting an abstract for the SLE conference, where you specify our workshop (see here).

Updates

2019-01-30T12:00:00+00:00

I haven't shared many updates in January, as I was in holiday until the first half of January. In the meantime, however, a couple of blog posts appeared, and I'll quickly summarize them now.

First, already in December, an English blog post on «Patterns, processes, abduction, and consilience» appeared on The Genealogical World of Phylogenetic Networks. In this post, I discuss questions of what we can know and what we can infer from the data and the patterns we observe in historical linguistics. The post can be found here.

Second, earlier in January, I wrote a German blog post, discussing problems of fake news, fiction, and the potential crisis in journalism and science, which you can find here.

Third, another English post appeared on the phylogenetic networks blog, where I discuss Future challenges for computational diversity linguistics. I present 10 different problems, and I will try to comment on each of these 10 problems in more detail during the next 10 months.

Forth, I decided to start a new series of tutorial posts in our CALC blog. The idea, presented in the introductory post is to create something like a «Primer on automatic inference of sound correspondence patterns», presenting how the algorithm I present in a forthcoming paper can be used in practice.

Two more blogposts

2019-02-25T12:00:00+00:00

One month has past since I have shared news the last time. I have not been idle in the meantime, but did not find time in all my work to share any updates. Besides, I was writing papers, which are now under review, and will be shared online in due time, once I find time to prepare them in form of preprints.

I just would like to announce two more blogposts that I have written in February, the first one, in German, is devoted to potential errors in science which can -- nevertheless -- improve our knowledge, titled Darwin's Finkenschnäbel und der Nutzen des Irrtums. The second post follows up on my 10 open problems for diversity linguistics, and discusses why the problem of Automatic Morpheme Segmentation is such a huge problem for historical linguistics, and how it might be that we could tackle it in the future. This blog is titled.

Tutorial Blogpost

2019-02-28T12:00:00+00:00

This week, I wrote another small blogpost, a primer on automatic sound correspondence pattern inference, or, more properly, a second post discussing the topic, this time showing how data from the Benchmark Database of Phonetic Alignments can be harvested and directly analyzed with EDICTOR. The post can be found here.

German Blogpost on Plagiarism

2019-03-14T12:00:00+00:00

On Sunday, I published my monthly German blogpost, thist time discussing the problem of plagiarism in science: Von falschen Originalen und echten Kopien.

English blogpost on automatic borrowing detection

2019-03-25T12:00:00+00:00

The second problem in my series on open problems in computational diversity linguistics deals with the problem of automatic borrowing detection. While this may not seem to be per se a hard one, I think it is a huge problem speciically because there are not even standardized procedures in classical, qualitative historical linguistics for this task. You can find the blogpost here.

Tutorial blogpost

2019-03-28T12:00:00+00:00

In the third part of a series of tutorial blogposts on the automatic inference of sound correspondence patterns across multilingual wordlists, I present how the Python code of the LingRex software package can be applied to the data of the TPPSR. You can find the post here.

Blogpost on pyconcepticon

2019-04-01T12:00:00+00:00

We published the first of a series of blog posts on how to use pyconcepticon, both as a library and as a command-line tool, for the semi-automatic mapping of concept lists to Concepticon. This first blog posts guides the readers through the command line tool, hinting at the internals of the library that will explored in more detail in the following post (to be published next week).

You can find the post here.

German blogpost on translation

2019-04-15T12:00:00+00:00

I just published my monthly German blogpost for April, this time discussing questions of translation, specifically literal and adquate translation. You can find the post, titled «Wörtlichkeit, Freiheit, Adäquatheit und die Aufgabe der Übersetzer» here.

Concepticon 2.0 released

2019-04-18T12:00:00+00:00

While working during the last weeks, I completely forgot to announce that the Concepticon was now released in its version 2.0, this time comprising as many as 240 different concept lists, and many mappings being refined in contrast to earlier version. Have fun exploring the resource at https://concepticon.clld.org.

And once I am already announcing this, I also forgot to mention that version 1.2 of the Cross-Linguistic Transcription Systems was now also released, and you will find it at https://clts.clld.org.

I am very thankful to all colleagues involved in the preparation of these sources.

New Blogpost and Reference Browser

2019-04-29T12:00:00+00:00

Following up on open problems in computational diversity linguistics, my English blogpost for April now discusses the third problem, the induction of sound laws, which has been largely neglected both in the classical and the computational literature. The post can be found here.

In addition, I would like to announce a new tool that I have created recently. I call it EvoRef, and the tool offers currently 4669 distinct quotes (including abstracts and comments) from 2383 different references on topics in historical linguistics, language typology, and evolution. The tool is organized in such a way that many of the references can already be found in EvoBib, although they may be occasionally missing. The tool can be used to search for my specific interpretation of linguistic literature, since it offers the keywords that I give to work I cited. As my original database also contains specific comments and evaluations, which I do not necessarily want to share in public, this official version only offers the raw quotes with comments being hidden. Given the huge number of inter-linked resources, also with occasional translations of non-English resources into English, I hope it will be useful for those interested in topics on language evolution and historical linguistics. You can find the tool at http://calc.digling.org/evoref/.

New Paper on Sino-Tibetan Phylogenies in PNAS

2019-05-07T12:00:00+00:00

After four years of hard work, our study on the phylogeny and age of the Sino-Tibetan language family has finally appeared in PNAS. The article, which can be found here. A short video, in which I introduce our major findings can be found here. Our press release presenting some details of the study is available from this link, offered also in different translations.

To summarize our findings, here is what the abstract of the paper says:

The Sino-Tibetan language family is one of the world’s largest and most prominent families, spoken by nearly 1.4 billion people. Despite the importance of the Sino-Tibetan languages, their pre-history remains controversial, with ongoing debate about when and where they originated. To shed light on this debate we develop a database of comparative linguistic data, apply the linguistic comparative method to identify sound correspondences and establish cognates. We then use phylogenetic methods to infer the relationships among these languages and estimate the age of their origin and homeland. Our findings point to Sino-Tibetan originating with north Chinese millet farmers around 7200 B.P. and suggest a link to the late Cishan and the early Yangshao cultures.

The paper was based on a large collaborative effort, involving teams from Paris (Guillaume Jacques and Laurent Sagart from the CRLAO and Robin Ryder and Valentin Thouzeau from the Université Paris-Dauphine) and Jena (Simon J. Greenhill and Yunfan Lai). In addition, many people helped us in collecting the data, which can be freely accessed on Zenodo, or even directly inspected through the EDICTOR software.

Apart from the co-authors, I am also very thankful to the numerous contributors who shared data, and to the reviewing process, which was professional, challenging, and extremely fair.

With this study, we hope to contribute to the ongoing debate regarding the origin and spread of the Sino-Tibetan languages. Given that three teams were working in parallel on this question, with one study being published earlier in Nature two weaks ago, and one in preparation (preliminary results will be presented on a conference this week.

Radio interview and another paper accepted

2019-05-10T12:00:00+00:00

Our paper in PNAS on the history of the Sino-Tibetan languages seems to have caught some media attention. As a result, I was asked to give a short interview on the matter, which already appeared on Tuesday in Deutschlandfunk, but is still available from their archive (see below on the website).

Furthermore, our paper on rhyme annotation, common work with Nathan W. Hill and Christopher Foster, which has been under review for some time, has now been accepted. Our final draft before it goes to production can be found here.

Two blogposts and a paper accepted

2019-05-14T12:00:00+00:00

Two more blogposts appeared this week. First, I decided to start writing a series on the background of our Sino-Tibetan Database of Lexical Cognates, which you can find here. Second, I published a German blogpost on the importance of baselines and benchmarks (gold standards) for testing and training of algorithms, which you can find here. In addition, a paper I wrote with Taraka Rama on fast cognate detection and fast phylogenetic reconstruction was now accepted as a long paper for the ACL conference this year. We'll still have to finalize the paper itself according to reviewer suggestions, but will upload a preprint as early as possible. The code for my fast cognate detection method can be found here.

New blogpost series on biological methods in linguistics

2019-05-15T12:00:00+00:00

This week we start a new blogpost series, focusing on metaphors and methods shared by historical linguistics and evolutionary biology. You can find the first post here.

Blog post and paper accepted

2019-05-27T12:00:00+00:00

My English blogpost for may discusses phonological reconstruction as my fourth open problem in computational diversity linguistics and can be found here. Furthermore, a paper reviewing automated methods for contact inference in historical linguistics has now been officially accepted by Language and Linguistics Compass. You can find my most recent draft here.

Two new papers accepted

2019-06-06T12:00:00+00:00

I was very glad when I heard that the paper I wrote with Taraka Rama on "An automated framework for fast cognate detection and Bayesian phylogenetic inference in computational historical linguistics" has been accepted as a long paper for the ACL 2019 conference. Our preprint can be found here, and the code for the cognate detection algorithm can be found at GitHub/lingpy/bipskip.

At the same time, I was also notified that my paper with Tim Bodt on "Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa langauges" has also been accepted by the journal Papers in Historical Phonology. The preprint can be found here, and the code has been registered with the Open Science Framework.

New blog post and papers

2019-06-20T12:00:00+00:00

Already on Sunday I published a new German blog post, this time discussing the phenomenon of epenthesis and other sound change types in German and other languages, titled «Wissentschaft und Abenbrot: Einschübe und Aussetzer im Sprachwandel», available here.

Furthermore, two more drafts of accepted papers have been added to my list of forthcoming papers. The first draft is a comment on a forthcoming paper by Gerhard Jäger in the journal Theoretical Linguistics, discussing questions of comparing reconstruction systems. The draft, titled «Beyond Edit Distances: Comparing linguistic reconstruction systems» can be found here.

The second draft is a paper to appear in the Bulletin of Chinese Linguistics, written together with Nathan W. Hill, presenting a new idea of handling Chinese character formation processes in the reconstruction of Old Chinese phonology. This draft, titled «Using Chinese character formation graphs to test proposals in Chinese historical phonology» can be found here.

New blog post, new Python library, and new preprint

2019-06-25T12:00:00+00:00

Today, my monthly English blog post appeared, discussing problem number 5 of my list of open problems in computational diversity linguistics, this time devoted to the "Simulation of lexical change", which you can find here.

Furthermore, I released a beta-version of the PoePy library, a Python library devoted to quantitative task in the investigation of poetry, available on GitHub.

Last not least, Justin Power, Guido Grimm, and myself, finally managed to submit our pilot study on sign language evolution, titles "Evolutionary dynamics in the dispersal of sign languages". The preprint can be found here.

Talk at National Taiwan University

2019-06-27T12:00:00+00:00

Tomorrow, June 28, 2019, Tiago Tresoldi, Mei-Shin Wu, and Nathanael E. Schweikhard will give a talk titled "Fundamentals of Computer-Assisted Language Comparison" at the National Taiwan University (NTU), in Taipei (Taiwan). Introduction to computational methods of language comparison, discussion on the software, methods, and interfaces developed by the CALC group, as well as illustrations of data annotation and modeling, will be presented, with a session for question & answers and pratical demonstrations.

New paper on Cross-Linguistic Transcription Systems appeared

2019-07-02T12:00:00+00:00

Yesterday, our paper on Cross-Linguistic Transcriptoin Systems finally appeared online. In this paper, we explain how we established the Cross-Linguistic Transcription Systems (CLTS) database, which links different transcription systems and transcription datasets to a unified set of sounds, which are defined by a feature system. The paper, titled A cross-linguistic database of phonetic transcription systems, coauthored by Cormac Anderson, Tiago Tresoldi, Thiago Chacon, Anne-Maria Fehn, Mary Walworth, Robert Forkel, and myself, introduces the database and also discusses general ideas with respect to standardization efforts and traditions for linguistic transcription systems.

New paper and new releases

2019-07-10T12:00:00+00:00

Today, I released EvoBib, version 0.26, with now 3404 bibliographyic entries, and EvoRef, version 0.3, with now 4835 quotes from 2515 references. In addition to accessing the data via the web interfaces, they can also be downloaded from Zenodo, via this link for EvoBib and this link for EvoRef.

In addition, a paper that I wrote almost three years ago with Guillaume Jacques has now finally appeared. This paper, titled «Save the trees» discusses the advantages of tree models in historical linguistics. Due to open access restrictions, I can only offer the preprint of the paper, which has been available for download for quite some time from this link. A refined version can be found here.

New paper, new releases, and new blogpost

2019-07-29T12:00:00+00:00

Today, a paper by Taraka Rama and me, titled "An Automated Framework for Fast Cognate Detection and Bayesian Phylogenetic Inference in Computational Historical Linguistics" appeared online in its final version, and you can find it here.

Furthermore, we released version 2.1 of Concepticon, now offering concept links to 250 different concept lists.

Finally, a blogpost devoted to the problem of the simulation of sound change, as part of my series on open problems in computational diversity linguistics, appeared today, and you can find it here.

Special issue of the OCAF conference published

2019-07-30T12:00:00+00:00

It is my great pleasure to announce that the Journal of Language Relationship has now published our special issue that reflects some of the work presented in our Old Chinese and Friends conference. All articles are freely available and can be found here. I am very thankful to George Starostin for the fantastic work done as an editor of this issue, with all articles being thoroughly checked and adjusted to journal style guides by him, and to Yunfan Lai for help in organizing both the conference and submission and reviews.

Interview on Juggling and Science

2019-08-13T12:00:00+00:00

An interview, in which I talk about juggling and science, appeared three days ago in the Chinese online journal Zhìshifènzi (intellectuals). When reading this interview with help of automated translation, it nicely illustrates the limitations that computational approaches still have. My Chinese name Yóuhán 游函, which I chose back in 2005 because the pronunciation comes so close to my first name, is consequently translated as ``travel letter'' or similar, because the name is not recognizable as a standard name by the translation software. If you check the same interview (but with few errors in the text already corrected) on the We-Chat platform, you can also see a recent video in which I perform the pirouette with five clubs in a gym in Berlin. While this has nothing to do with science, I see it as one of the factors that allow me to pursue my research: Juggling is excellent for preventing pain in the back, resulting from sitting for too long a time in front of a computer. Therefore, the more I juggle in my free time, the more I can sit and program in my working time.

New Blog Posts and Workshop

2019-08-19T12:00:00+00:00

Yesterday and today, I published two new blogposts. My German blogposts deals with predictions in the humanities, and specifically in linguistics, titled «Und nun zur Wörtervorhersage...»: Vorhersagen in der Sprachwissenschaft. The other blogpost is a tutorial on alignment analyses with LingPy and custom scoring functions based on the CLTS feature system, titled Feature-Based Alignment Analyses with LingPy and CLTS (1), and will be followed up by one or two more posts which present a full-fledged algorithm devoted to the topic.

Furthermore, our workshop on «Computer-assisted approaches in historical and typological language comparison» will be organized as part of the annual conference of the SLE (2019). For those interested in the specific speakers of this workshop, I made a small workshop website which shares the abstract, gives some information on the full description of the workshop, and also summarizes the speakers, their titles, and provides direct links to their abstracts at the official SLE website.

New article and blog post and past workshop

2019-08-26T12:00:00+00:00

My article, discussing currently available methods for the automated studying of language contact, has now appeared in the journal Language and Linguistics Compass, titled Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Unfortunately, I could not afford the high costs for direct open access with the publisher. As a result, I placed my final version before copy-editing on Humanities Commons.

Furthermore, I managed to stick to my self-made promise and discuss the seventh problem of computational diversity linguistics in the eighth month of the year. This month, I discuss questions regarding the Statistical proof of language relationship.

Last not least, our workshop on Computer-assisted approaches to historical and typological language comparison, which was held last week, organized as part of the Annual Meeting of the SLE in Leipzig, turned out to be very nice, with a lot of presentations devoted to very different aspects of computer-assisted research.

Interview on Computer-Assisted Language Comparison with SysBlok.ru

2019-09-01T12:00:00+00:00

Last week, the Russian portal Sistemny Blok published an interview, in which we discussed computer-based, computer-assisted, and general language comparison, as well as the benefits of juggling for doing science. The interview, which was conducted in English and then translated to Russian, can be found here. I am very thankful to all involved in this, specifically Mariana Zorkina, who interviewed me. Furthermore, this enterprise helped me to find a new scientific blog with very interesting content, and I recommend to all who read Russian to have a look at the SysBlok.ru.

New Blog Posts in the CALC Blog

2019-09-20T12:00:00+00:00

In this week, we published two new blog posts in our CALC blog, both follow-ups from series that were started earlier in this year, one by myself, devoted to Feature-Based Alignment Analyses with LingPy and CLTS (2), and one by Nathanael E. Schweikhard, discussing Biological metaphors and methods in historical linguistics (2): Words and genes.

Additionally, Tim Bodt, who was visiting our group already several times, has now published all material from his trip to Nepal, during which he collected material on the Kusunda language. This trip was to a small part supported by our project, and Tim thanked us for the support by collecting a 250-item wordlist of Kusunda that we can compare with our database on Sino-Tibetan langauges. You can find a summary of the work (published by Aaley and Bodt), along with all links to the original data, which is free for download, here.

New Blog Posts on Open Problems in Computational Diversity Linguistics

2019-10-02T12:00:00+00:00

In this week, a new blogpost, discussing my 8th problem of computational diversity linguistics appeared, this time focusing on the typology of semantic change.

New Releases

2019-10-14T12:00:00+00:00

Today, there are news with respect to new releases to share. First, there is another blogpost in German, introducing the EvoBib reference browser, which offers references and citations for more than 3000 articles and books, related to historical linguistics, language contact, and linguistic typology, and was now officially released. The blogpost, titled «Wissensmanagment» quickly introduces the tool, and the tool itself can be browsed, or also downloaded on Zenodo or GitHub.

Furthermore, we released LingPy, version 2.6.5, which is now officially available through all typical channels.

New Blogpost

2019-10-29T12:00:00+00:00

Yesterday, my 9th blogpost on Open Problems in Computational Diversity Linguistics appeared, this time discussing the problems of establishing a typology of sound change processes. The blogpost, which is rather long this time, can be found here.

New Blogpost

2019-10-30T12:00:00+00:00

Yesterday I published a new post in our blog, discussing the importance and the advantages of the new approach to linguistic data that we constantly proposing. I illustrate such advantages by describing how I could reuse the data from CLICS, itself reusing data from Lexibank, to build a simple matrix and graph of semantic distance. The post can be found here.

Concepticon 2.2 and a new blogpost

2019-11-12T12:00:00+00:00

Last week, we announced version 2.2 of the Concepticon. The new version now includes as many as 275 different concept lists linked to our unified concept sets.

I also wrote a new German blogpost, this time about the review process, in which I ask "Wer begutachted eigentlich die Gutachter?" (Who reviews the reviewer, after all?), and which you can find here.

CLICS-3

2019-11-18T12:00:00+00:00

Last week, we published the third big version of our Database of Cross-Linguistic Colexifications, CLICS³, available at clics.clld.org. In this version, we managed to double the number of languages and we also drastically increased the number of concepts. Many people helped in different ways to acquire the data. In order to make sure we acknowledge all of them, we prepared a CONTRIBUTORS.MD file on GitHub, in which you can see past and present editors, as well as all who have helped to contribute to the collection of the data of CLICS. Many thanks to all who helped to establish CLICS, in the past, and specifically also for version 3.

New Blog Post

2019-11-20T12:00:00+00:00

Today I published the third blog post in our series on biological metaphors in linguistics. This time I am contrasting the processes involved in language change and in genetic evolution which cause differences or similarities between related or unrelated words or genes. You can read the blog post here.

Blogposts and other things

2019-11-25T12:00:00+00:00

Last week was a very busy week, with papers that had to be prepared and talks that had to be presented. Nevertheless, we managed to put the blogposts from 2018 online and share them with Humanities Commons. First the contributions to our blog on Computer-Assisted Language Comparison in Practice are now available here, second, my contributions to the Genealogical World of Phylogenetic Networks for 2018 can be retrieved from this link.

In addition, I managed to submit a study on inter-linear-glossed text and our attempts to retro-standardize linguistic data. This study, carrried out together with Nathaniel A. Sims, titled "Towards a sustainable handling of inter-linear-glossed text in language documentation" has now also be posted on Humanities Commons in form of a preprint and can be found here.

Last not least, today appeared my 11th blogpost in the Genealogical World of Phylogenetic networks, this time dealing with the 10th (and last) problem in computational diversity linguistics. This post, discussing the "Typology of semantic promiscuity" can be found here.

News before the end of the year

2019-12-17T12:00:00+00:00

Although quite a few things happened recently, I did not find much time to update the news feed, and I was surprised that my last update was made in November.

Anyway, yesterday, two blogposts appeared, my final post on the series on Open Problems in Computational Diversity Linguistics (available here), and a German blog post that elaborates about the topic patience and how much of impatience is needed in scientific research (Trotz der Ungeduld: Jonglieren im Wind).

In the week before, a paper discussing how to compare reconstruction systems appeared, titled "Beyond Edit Distances: Comparing linguistic reconstruction systems". The preprint is available online here.

Two papers were accepted by now: our paper on CLICS, Version 3, with Scientific Data and many coauthors (preprint here), and our paper on Sign language evolution with Justin Power and Guido Grimm with Royal Society Open Science (preprint here).

These are all the news for December, but it is possible that there will still be updates during this month.

New Paper on Emotion and Colexification in Science

2019-12-20T12:00:00+00:00

The Database of Cross-Linguistic Colexifications is one of the most prominent outputs of the CALC project and the Department of Linguistic and Cultural Evolution, since it combines our interest in standardization, aggregation of cross-linguistic lexical data, graph-based approaches for exploratory data analysis, and interactive visualization tools.

Thanks to a very fruitful collaboration with psychologists from the University of North Carolina, it could now also be shown that the data in CLICS has the potential to provide essential evidence for questions related to human cognition. The study shows that emotion semantics vary across language families, but that there is a certain common core of similarities that can be used as an explanandum of certain structures across all cultures:

Many human languages have words for emotions such as “anger” and “fear,” yet it is not clear whether these emotions have similar meanings across languages, or why their meanings might vary. We estimate emotion semantics across a sample of 2474 spoken languages using “colexification”—a phenomenon in which languages name semantically related concepts with the same word. Analyses show significant variation in networks of emotion concept colexification, which is predicted by the geographic proximity of language families. We also find evidence of universal structure in emotion colexification networks, with all families differentiating emotions primarily on the basis of hedonic valence and physiological activation. Our findings contribute to debates about universality and diversity in how humans understand and experience emotion.

The paper titled Emotion semantics show both cultural variation and universal structure by Joshua Jackson, Joseph Watts, Teague Henry, myself, Peter Mucha, Robert Forkel, Simon Greenhill, Russell Gray, and Kristen Lindquist has has now appeared in Science.

New paper introducing the CLICS database and a new blogpost

2020-01-13T12:00:00+00:00

Today, a new paper by our group and colleagues from the DLCE appeared in the journal Scientific Data, in which we present the third installment of our CLICS database.

Advances in computer-assisted linguistic research have been greatly influential in reshaping linguistic research. With the increasing availability of interconnected datasets created and curated by researchers, more and more interwoven questions can now be investigated. Such advances, however, are bringing high requirements in terms of rigorousness for preparing and curating datasets. Here we present CLICS, a Database of Cross-Linguistic Colexifications (CLICS). CLICS tackles interconnected interdisciplinary research questions about the colexification of words across semantic categories in the world’s languages, and show-cases best practices for preparing data for cross-linguistic research. This is done by addressing shortcomings of an earlier version of the database, CLICS2, and by supplying an updated version with CLICS3, which massively increases the size and scope of the project. We provide tools and guidelines for this purpose and discuss insights resulting from organizing student tasks for database updates.

The paper, titled "The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies" which involves a lot of co-authors and particularly many people from our CALC team, can be found here.

In addition, I published a new German blogpost in which I discuss the Sapir-Whorf hypothesis in the light of cross-linguistic data, which you can find here.

New paper on sign languages and new blogpost on concept mapping

2020-01-22T12:00:00+00:00

Today, a new paper by Justin Power, Guido Grimm, and myself appeared, discussing the dispersal of sign language manual alphabets:

The evolution of spoken languages has been studied since the mid-nineteenth century using traditional historical comparative methods and, more recently, computational phylogenetic methods. By contrast, evolutionary processes resulting in the diversity of contemporary sign languages (SLs) have received much less attention, and scholars have been largely unsuccessful in grouping SLs into monophyletic language families using traditional methods. To date, no published studies have attempted to use language data to infer relationships among SLs on a large scale. Here, we report the results of a phylogenetic analysis of 40 contemporary and 36 historical SL manual alphabets coded for morphological similarity. Our results support grouping SLs in the sample into six main European lineages, with three larger groups of Austrian, British and French origin, as well as three smaller groups centring around Russian, Spanish and Swedish. The British and Swedish lineages support current knowledge of relationships among SLs based on extra-linguistic historical sources. With respect to other lineages, our results diverge from current hypotheses by indicating (i) independent evolution of Austrian, French and Spanish from Spanish sources; (ii) an internal Danish subgroup within the Austrian lineage; and (iii) evolution of Russian from Austrian sources.

The paper, titled "Evolutionary dynamics in the dispersal of sign languages" here.

In addition, I published a new tutorial blogpost in which I show how large datasets can be easily linked to our Concepticon data, which you can find here.

New beginner's guide for Concepticon contribution

2020-01-29T12:00:00+00:00

Today I published a blog post that contains step-by-step instructions for adding concept lists to Concepticon. The goal of this post is to give helpful tips for the contribution process in our project. The post can be found here.

New blogpost on emotion concepts

2020-02-05T12:00:00+00:00

Already on Monday last week, a new blog post appeared in which I discussed the Sapir-Whorf hypothesis in the light of the article on emotion concepts which appeared in December last year.

The blog post, titled "From words to deeds" can be found here.

Workshop on Reproducible Research and Data Management

2020-02-05T12:00:00+00:00

Johann-Mattis List and I were involved in a successful workshop on Reproducible Research and Data Management that took place last week at the Max-Planck-Institute for the Science of Human History in Jena.

Open to entire academic community, we collaborated with collegues from our department and from the Department of Archeogenetics to introduce participants to command-line usage and Bash, Git and GitHub, and reproducible research in general. The linguistic sessions focused on the reference catalogs used for most of our research (Glottolog, Concepticon, and CLTS), on Lexibank, and on orthographic profiles, ending with a hands-on session on Lexibank. Christoph Rzymski provided unvaluable help, teaching Git and explaining the rationale for CSV(W) and CLDF, and we were joined by Simon J. Greenhill when presenting Lexibank to the general public.

Our slides will be put online in the next days, linked from the Workshop's page. The first presentation is here.

Sign language research featured on title page of Süddeutsche Zeitung

2020-02-07T12:00:00+00:00

We knew that an article would feature our research on sign language evolution in the Süddeutsche Zeitung. But when we saw today that it appeared even on the first page, we were really surprised. Unfortunately, the article is not yet online available, so we cannot link it here, but it seemed interesting enough to mention this.

Further good news are that the CNRS 2020 Summer School on "Semantic shifts from lexicon to grammar – diachronic and typological perspectives" was accepted and will be held on the island of Porquerolles in the south of France from 14th to 25th September 2020. I myself will teach a two day workshop on computational methods. More information can be found on the official website.

German blog post for February

2020-02-10T12:00:00+00:00

I just wrote my monthly German blog post for February, this time devoted to the question of language universals, the language faculty, and our work on sign language evolution. You can find the post here.

New paper accepted and new version of EvoBib

2020-03-09T12:00:00+00:00

A new paper by Robert Forkel and myself has been accepted for publication. The study presents the CLDFbench package and illustrates how it can be used in order to convert datasets conveniently into the CLDF format. While the paper will only appear in May, we have uploaded our authors' copy in form of a preprint with the Humanities Commons repository, where you can find it under this link.

Additionally, I managed to release a new version of EvoBib, Version 1.1, which now contains about 100 bibliographic entries more than the previous version and also about 300 more quotes (mostly abstracts).

New paper accepted

2020-03-10T12:00:00+00:00

A new paper by myself was published lsat month. It introduced the DAFSA project, a Python library for generating graphs over collections of sequences that highlight recurring and redundant information. I have been using it to experiment with morphological detection in low-resource languages. The paper is available here.

New paper accepted and a new blog post

2020-03-20T12:00:00+00:00

A new paper by Nathaniel Sims, Robert Forkel and myself has been accepted for publication. The study, titled "Towards a sustainable handling of interlinear-glossed text in language documentation" presents a computer-assisted approach to study interlinear-glossed text within the computational frameworks set up along with the Cross-Linguistic Data Formats initiative. Our authors' version can be found here.

Additionally, I wrote a new German blogpost, this time dealing with the evolution of personal names. This post, titled "Evolution unchained: Die Entwicklung von Personennamen und die Grenzen der Sequenzen", can be found here.

New blog posts and new annotation tool

2020-03-26T12:00:00+00:00

Two new blogposts have appeared in this week, completing the typical series of blog posts for March. The first post is an English version of my German post published earlier this month, dealing with the evolution of personal names. Titled "Evolution unchained: The development of person names and the limits of sequences" here.

The second post presents a new rhyme annotation tool, called RhyAnT, which I managed to prepare in a first draft version. This post, published with our blog on Computer-Assisted Language Comparison in Practice can be found here.

The rhyme annotation tool itself is still in flux, although a first draft version is already available at https://digling.org/calc/rhyant/. I hope to finish a stable version soon, so we can start on working towards a cross-linguistic database of rhymed poetry.

New paper submitted on annotating etymological data as word trees

2020-04-01T12:00:00+00:00

This week, Mattis List and me submitted a paper in which we present a framework for the annotation of etymological relationships in a human- and machine-readable fashion. The preprint can be accessed here.

New paper accepted on a workflow for computer-assisted language comparison

2020-04-07T12:00:00+00:00

This week, an article by Johann-Mattis List, Timotheus A. Bodt, Nathan W. Hill, Nathanael E. Schweikhard, and myself was accepted by the Journal of Open Humanities Data. In the article, we present a workflow which lifts raw data to a stage where algorithms for computer-assisted language comparison can detect sound correspondence patterns across several languages. At every stage, the data can be interactively inspected and even be modified, which makes this workflow truly computer-assisted. We also provide a tutorial, in which we show how to run the code and how to inspect or edit the data at all stages. The authors copy, which we submitted to Humanities Commons, can be accessed here.

New blog posts

2020-04-14T12:00:00+00:00

I have published a new German blog post during the last weekend, titled "Was sich reimt, das frisst sich". In this post, I discuss the chances and challenges involved in the systematic annotation of rhymed poetry across languages, genres, and times. You can find the post here.

New blog post on a CLDF dataset of the Kusunda language

2020-04-21T12:00:00+00:00

I have published a new blog post with the title "New Lexical Data for the Kusunda Language" on Monday this week. In this post, I mention the challenges of collecting second-hand Kusunda lexical data and present a new Kusunda dataset which is available in CLDF format online. You can find the post here.

Research in the news and new blog posts

2020-05-15T12:00:00+00:00

In the June issue of Psychologie Heute, which was available from Wednesday on, there is a report on the work on on emotion concepts (Jackson et al. 2019) with help of our database of cross-linguistic colexifications (Rzymski et al. 2020). The article can also be found online, but it is not freely available without subscription.

During last week, I found time to prepare two more blog posts for May, one German post concentrating on scientific practice and some general ideas on open research within the humanities, titled "Was Wissen schafft, wird festgestellt: Gedanken zur offenen Forschung", and online available here. A second blog post was devoted to an exploration of semantic similarity as it is represented and handled in the STARLING software package. This post, which can be found here, is accompanied by a Python software package called pysen, and an interactive online application which you can find here.

New Paper on CLDFBench Appeared

2020-05-18T12:00:00+00:00

A paper by Robert Forkel and myself, introducing "CLDFBench. Give your Cross-Linguistic data a lift" has just appeared officially as part of the (now only digital) LREC conference. Here is the abstract:

While the amount of cross-linguistic data is constantly increasing, most datasets produced today and in the past cannot be considered FAIR (findable, accessible, interoperable, and reproducible). To remedy this and to increase the comparability of cross-linguistic resources, it is not enough to set up standards and best practices for data to be collected in the future. We also need consistent workflows for the “retro-standardization” of data that has been published during the past decades and centuries. With the Cross-Linguistic Data Formats initiative, first standards for cross-linguistic data have been presented and successfully tested. So far, however, CLDF creation was hampered by the fact that it required a considerable degree of computational proficiency. With cldfbench, we introduce a framework for the retro-standardization of legacy data and the curation of new datasets that drastically simplifies the creation of CLDF by providing a consistent, reproducible workflow that rigorously supports version control and long term archiving of research data and code. The framework is distributed in form of a Python package along with usage information and examples for best practice. This study introduces the new framework and illustrates how it can be applied by showing how a resource containing structural and lexical data for Sinitic languages can be efficiently retro-standardized and analyzed.

The paper can be found here, and the code itself is hosted with GitHub at cldf/cldfbench.

New paper presents a workflow for Computer-Assisted Language Comparison

2020-05-25T12:00:00+00:00

Last week, we published a new paper "Computer-Assisted Language Comparison: State of the Art" in the Journal of Open Humanities Data. In this paper, we demonstrate our current five-stage workflow for computer-assisted language comparison which lifts raw data to a level where sound correspondence patterns across multiple languages have been identified and can be readily presented, inspected, and discussed. The paper can be found here, the code can be found in the GitHub repository lingpy/workflow-paper. To see the real-time executation of the workflow, please visit our Code Ocean capsule.

New blogpost on rhyme networks

2020-05-26T12:00:00+00:00

Yesterday, the second blogpost out of a series of six blogposts devoted to the construction of rhyme networks planned for the next months appeared. It discusses rhyming in general and can be found here.

New Papers Appeared

2020-06-02T12:00:00+00:00

Two more studies have recently appeared. The first is study on Chinese character formation graphs, together with Nathan W. Hill:

This paper proposes the use of network techniques in the exploration of Old Chinese phonology as reflected in the phonophoric determinatives of xiéshēng 諧聲 characters. We use the approach to examine five specific proposals in Chinese historical phonology, and whether the distinctions suggested by these proposals can be said to be recoverable on the basis of phonophoric choice. The major finding is that the type A versus type B distinction is in some cases encoded in the choice of phonophoric determinative, while other distinctions are only spuriously if at all reflected in the phonophoric subseries.

The paper is available in Open Access and can be found here.

The second study is a popular science article about the experiments on prediction in historical linguistics, which Tim Bodt and me have started some two years ago. The article is available (but only for subscribers) in the journal Babel: The Language Magazin, but you can find our authors' copy here.

New blog post on markup for lexical data

2020-06-03T12:00:00+00:00

I have published a new blog post titled "Why Tag Markup may be Useful for Lexical Data". In this post, I discuss the benefits of using tag-based semantic markup instead of category-based one and propose a relatively simple way to create a tag markup by aggregating the existing categorisations provided in the data. You can find the post here.

New Blogpost

2020-06-07T12:00:00+00:00

Yesterday evening I found time to write a new German blogpost, devoted to the multiple meanings of the word machen in German. This post, which is not meant to be entirely serious, titled Neues zum Wortfeld »machen« is online available here.

New study appeared on rhyme data handling and analysis

2020-06-10T12:00:00+00:00

Today, a new study in which I discuss the handling of rhyme data and rhyme analysis appeared in Cahiers de Linguistique Asie Orientale.

By reviewing a recent quantitative study of rhyme patterns in Mandarin Chinese, this study shows how data handling and data analysis in the study of rhyme patterns can be improved. Suggestions for improvement include (a) a consistent annotation of rhyme data, which is exhaustive and facilitates data reuse, and (b) emphasizes the importance of automated approaches for exploratory data analysis, which can help to analyze rhyme data in an improved way, prior to applying statistical frameworks for hypothesis testing.

The study itself can be found here, but it can only be viewed with subscription. I have deposited a preprint with Humanities Commons, which you can find here.

New paper accepted on annotating word formation processes

2020-06-11T12:00:00+00:00

A new paper by Johann-Mattis List and myself has been accepted for publication by the SKASE Journal of Theoretical Linguistics. In the article, titled "Developing an annotation framework for word formation processes in comparative linguistics", we propose a new approach to the annotation of cross-linguistic etymological relations that also takes morphological processes into account. Included is a small Python library and annotated data samples from a variety of language families. You can access the preprint here.

New Blogpost

2020-06-22T12:00:00+00:00

I wrote a blogpost describing an on-going project where I have developing a model of phonological distinctive features for computer-assisted language comparison. It explains the rationale behind the proposal and shows how to use a small Python library that allows to access the feature matix without too much boilerplate code. The blogpost can be found on CALC's blog.

New blog post on rhyming

2020-07-02T12:00:00+00:00

Already on Monday this week, a new blogpost on rhyming, part three in my series "From rhymes to networks" appeared. Devoted to rhyme annotation, the post presents some general ideas and a rather simple, but efficient text-based format for the annotation of rhymes in texts. You can find the blogpost here.

New Blogpost, Preprint, and CfP

2020-07-10T12:00:00+00:00

Time passes quickly, and there are three new things to announce now. First, I have published a new blog post in German, which deals with the differential treatment in speaking, Andersbehandlung von Menschen im Sprechen.

Second, a preprint titled From Text to Thought: How Analyzing Language Can Advance Psychological Science was just submitted online. The paper by Joshua C. Jackson, Joseph Watts, myself, Ryan Drabble, and Kristen Lindquist discusses how new approaches to language analysis could be fruitfully applied in psychology in the future:

Humans have been using language for thousands of years, but psychologists seldom consider what natural language can tell us about the mind. Here we propose that language offers a unique window into human cognition. After briefly summarizing the legacy of language analyses in psychological science, we show how methodological advances have made these analyses more feasible and insightful than ever before. In particular, we describe how two forms of language analysis—comparative linguistics and natural language processing—are already contributing to how we understand emotion, creativity, and religion, and overcoming methodological obstacles related to statistical power and culturally diverse samples. We summarize resources for learning both of these methods, and highlight the best way to combine language analysis techniques with behavioral paradigms. Applying language analysis to large-scale and cross-cultural datasets promises to provide major breakthroughs in psychological science.

Last not least, we have just launched a call for papers, for a workshop on Model and Evidence in Quantitative Comparative Linguistics, organized by Gerhard Jäger (University Tübingen) and myself as part of the annual meeting of the DGfS in February 2021. The deadline for this Call is 31st of August 2020, and we invite submissions for 20-minute talks and have even limited resources for travel funds available.

New Preprint for NoRaRe collection

2020-07-28T12:00:00+00:00

We have been working on a feature of the Concepticon that contains data on norms, ratings, and relations for words and concepts. The new database is called NoRaRe and currently offers 71 data sets from studies in psychology and linguistics.

You can take a look at the data in the web app or in the NoRaRe GitHub reporitory.

The preprint can be found here.

New Blog Posts in July

2020-08-03T12:00:00+00:00

Time keeps passing, and I have not found the time to keep up with my website, so news come now in condensed form. First, I initiated a new series of blog posts in our Computer-Assisted Language Comparison in Practice blog, which is called How to do X in linguistics? and will features specific topics that are barely taught but considered as one of the basic tasks of a professional linguist and scientist (such as writing a review, responding to a review, or organizing one's bibliography).

Yesterday, another blog post in the series From rhymes to networks appeared, discussing this time the Automated Detection of Rhymes.

New Blog Post

2020-08-10T12:00:00+00:00

Yesterday, I published a new German blog post, which is this time less serious than usual, discussing the tendency of scientists, including myself, to trace topics back to their scientific subjects. The blog post, titled "Wovon man sprechen kann, darüber darf man auch mal schweigen", can be found here.

New Preprint on the Detection of Contact Layers

2020-08-19T12:00:00+00:00

Today, a new preprint by Abbie Hantgan, Hiba Babiker, and myself appeared online (the study itself is currently under review), discussing preliminary approaches to the detection of contact layers in the language isolate Bangime.

We present a computer-assisted, multidisciplinary, first approach to addressing this problem of detecting the layers of contact in Bangime. First, we assemble lexical evidence of contact between Bangime speakers with their neighboring languages, using a computer-assisted technique, followed by an evaluation of the materials by contrasting them with genetic findings. Specifically, we propose trajectories for Bangande settlement patterns. With this study, we lay the foundation of future collaborative work that will improve, correct, and enhance the results of this study. The original data used for the study are made available so that additional researchers may follow up on and test our hypotheses concerning contact layers in Bangime.

The preprint has been submitted to Humanities Commons and can be accessed here.

New Blog Post in the Rhyme Network Series

2020-08-24T12:00:00+00:00

Today, my fifth (out of six) blogpost in my series From Rhymes to Networks appeared, this time focusing on Constructing Rhyme Networks. While I was first a bit disappointed that I did not find enough time to annotate all of Goethe's Faust until now, I was quite happy to see that I have annotated enough German poems already to allow at least for a small demonstration on how to create rhyme networks from rhyme data on a language different from Chinese.

New Blog Post in the How-To Series

2020-09-01T12:00:00+00:00

Already last week, my first blog post in our "How to do X in linguistics" series appered, concentrating on How to write an initial review for a journal. Here is the abstract, more can be found in the actual blog post.

Writing reviews for a journal is one of those things which most scientists never actively learn. For laypeople, this may be surprising, given how often the scientific method with its rigorous peer review procedure is being mentioned in the news nowadays. How can it be, one may ask oneself, that this procedure that is usually presented as the core principle of scientific reasoning, is never really actively taught? If the review by experts is the core of the scientific method and what decides about the acceptance of an article, how can it be that scientists do never take a course on article reviewing, and how can it be that reviewers are (as I have previously discussed in a German blogpost) themselves never reviewed or graded?

New Preprint on the Detection of Borrowings from Monolingual Wordlists

2020-09-02T12:00:00+00:00

A new preprint by John E. Miller, myself, Roberto Zariquiey, César A. Beltrán Castañón, Natalia Morozova, and Johann-Mattis List appeared online today (the paper has been submitted). We discuss the identification of borrowings from monolingual wordlists using three different statistical methods.

Native speakers are often assumed to be efficient in identifying whether a word in their language has been borrowed, even when they do not have direct knowledge of the donor language from which it was taken. To detect borrowings, speakers make use of various strategies, often in combination, relying on clues such as semantics of the words in question, phonology and phonotactics. Computationally, phonology and phonotactics can be modeled with support of Markov n-gram models or – as a more recent technique– recurrent neural network models. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages of a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in borrowing detection using only information from monolingual wordlists. Their performance is in many cases unsatisfying, but becomes more promising for strata where there is a significant ratio of borrowings and when most borrowings originate from a dominant donor language. The recurrent neural network performs marginally better overall in both realistic studies and artificial experiments,and holds out the most promise for continued improvement and innovation in lexical borrowing detection. Phonology and phonotactics, as operationalized in our lexical language models, are only a part of the multiple clues speakers use to detect borrowings. While improving our current methods will result in better borrowing detection, what is needed are more integrated approaches that also take into account multilingual and cross-linguistic information for a proper automated borrowing detection.

The preprint has been submitted to Humanities Commons and can be accessed here.

New Blog Post and Preprint

2020-09-16T12:00:00+00:00

On Monday, I published a new German blog post, this time dealing with questions of "manual labor in digital times", the post, titled "Handarbeit im digitalen Zeitalter" can be found here.

Yesterday, a new preprint appeared (common work with Hans Geisler and Robert Forkel), featuring "A digital, retro-standardized edition of the Tableaux phonétiques des patois Suisses romands (TPPSR)".

This study presents a digital, retro-standardized edition of the Tableaux Phonétiques des Patois Suisses Romands (TPPSR), an early collection of lexical dialect data of the Suisse romande, which was compiled by Louis Gauchat, Jules Jeanqaquet, and Ernest Tappolet in the beginning of the 20th century and later published in 1925. While the plan of Gauchat and his collaborators to turn their data into a dialect atlas could never be realized for the lack of funding, we show how consistent techniques for digitization, accompanied by transparent approaches to retro-standardization can be used to turn the original data of the TPPSR into a modern interactive dialect atlas. The dialect atlas is not only publicly available in the form of a web-based application, but also in the form of a dataset that offers the data in standardized, human- and machine-readable form.

The preprint was archived with Humanities Commons and the web application has been published as a CLLD project.

Final Post in Rhyme Series

2020-09-28T12:00:00+00:00

Today, the final post in my small seris of six blogposts devoted to rhyme networks has appeared. While I have to admit that I was a bit more optimistic when I started the series, I am still content with what has been achieved in the last sixth month, even if most of these achievements reflect the awareness of new problems that need to be solved in the future. The post, titled "Analyzing Rhyme Networks", can be found here.

Blog post introducing a list of 171 body part concepts

2020-09-30T12:00:00+00:00

A list of 171 body part concepts was introduced in a blog post today. The list consists of body part concepts from ADAM’S APPLE to WRIST and can be found here.

New Temporary Affiliation

2020-10-02T12:00:00+00:00

From October 2020 until March 2021, I will act as a part-time deputy professor (Vertretungsprofessor) at the University Bielefeld. Essentially this means that apart from a change in affiliation, I will give an extended lecture on computer-assisted approaches to comparative linguistics at the University Bielefeld (in remote form). I hope that the lecture can this time cover both some basic introductions to Python for linguists as well as an in-depth tutorial on the most recent advancement in computational historical linguistics. As usually, I will share my scripts openly, but I may do that only after the lecture is finished. I am looking forward to this possibility of testing how well our integrative approach to data handling and analysis can be taught to students at the bachelor and master level.

New Blogpost

2020-10-19T12:00:00+00:00

Today, a new English blogpost appeared, in which I present an updated German wordlist that provides transcribed translations of the concept list proposed by the Intercontinental Dictionary Series. The blogpost can be found here, and the data have been published in the form of a GitHub GIST.

New German Blogpost

2020-10-23T12:00:00+00:00

Yesterday, I published my monthly German blogpost, this time dealing with theories that seem to be useful and powerful but soon turn out to be less helpful, since they tend to attract just-so-stories as explanations. The blog post, titled "Scheinriesentheorien" can be found here.

Concepticon 2.4.0 Published

2020-11-02T12:00:00+00:00

Last week, we published Version 2.4.0 of the Concepticon project, now offering 353 concept lists that are linked to as many as 3825 different concept sets. In this version, we also welcomed two new editors, Carolin Hundt and Tiago Tresoldi, both of whom helped us a lot in improving the Concepticon since its last version.

New Blogpost on language colexification statistics

2020-11-05T12:00:00+00:00

A blog post authored by me was published today on the CALC blog. Following two different requests related to our CLICS project, I have explored different statistics related to which languages colexify which pair of concepts, and which languages have a higher tendency for colexification.

New German Blogpost and CALC Posts for 2019

2020-11-09T12:00:00+00:00

On Saturday, I published a new German blog post, devoted to Digital Bullshit Jobs, in which I discuss how ignorance regarding the power of computational solutions leads to a situation in which we lose a lot of time doing manually what could easily be done automatically.

Already on Thursday Volume II of Computer-Assisted Language Comparison in Practice was published with Humanities Commons. This volume presents citable PDF versions of all blog posts that were written in 2019 as part of our CALC blog.

News, news, news

2020-11-23T12:00:00+00:00

A lot has happened in November so far.

First, an article has appeared in the Frankfurter Allgemeine Zeitung, featuring our research on signed languages published earlier this year. The article, titled "Die Stille Revolution" can now be accessed online here.

Second, a paper by John Miller, Tiago Tresoldi, Roberto Zariquiey, César Beltrán, Natalia Morozova, and myself has now been accepted by PLOS One. In this article we test the suitability of several machine learning techniques to infer lexical borrowings in a supervised approach by considering only monolingual information. This article has been available in the form of a preprint already, but it will soon also be available officially.

Third, a new study by Timotheus A. Bodt and myself was accepted for publication in Diachronica. In this study, we test how well one can predict words across languages that have not been elicited during fieldwork, relying on information about potential cognates of the missing words. Our experiment, which turned out to be quite successful, showed that our automated methods provide some good help, although they cannot do all of the work alone. More importantly, however, we realized how useful it can be to carry out active prediction attempts. This study, which went on for more than three years now, is also the first known to me, where predictions about words were preregistered in the form of an experiment. Our accepted authors' version before type setting can now be accessed from Humanities Commons.

New blog post in 'How to do X in linguistics?' series and paper on 'General patterns and language variation'

2020-12-07T12:00:00+00:00

I wrote a blog post on "Possibilities of digital communication in linguistics" for our 'How to' series. In the blog post, I’ll illustrate some of the possibilities that linguists and other researchers have to discuss and share their work. You can read it here.

In addtion, my proceedings paper for the COLING2020 workshop on 'Cognitive Aspects of the Lexicon' was published. The article with the title "General patterns and language variation: Word frequencies across English, German, and Chinese" is available here.

New Paper on Borrowing Detection Appeared

2020-12-14T12:00:00+00:00

A paper by John Miller, Tiago Tresoldi, Roberto Zariquiey, César A. Beltrán Castañón, Natalia Morozova, and myself appeared last week, discussing the use of lexical language models for mono-lingual borrowing detection:

Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.

The paper can be found here.

Tangut as a West-Rgyalrongic Language

2020-12-17T12:00:00+00:00

Tangut (Chinese: 西夏語 Xīxià Yǔ, "Western Xia language|) is an extinct Sino-Tibetan language attested between 1036-1502 AD, as one of the official languages of the Tangut empire. It is a fascinating language with copious literature and a complex writing system.

Right from the beginning of the 20th century until recently, the exact affiliation of Tangut in the Sino-Tibetan family was unclear, some classified it as with Lolo-Burmese, some claimed that it was related to Qiang. In our earlier study on the phylogeny of Sino-Tibetan (Sagart et al 2019), we used Bayesian phylogeny inference, placing Tangut with Gyalrongic languages. This proposal coincides with the intuition of experts of Gyalrongic. Specifically, Tangut shows a good number of similarities with West Gyalrongic, which includes Khroskyabs and Horpa-Stau languages.

In a paper that was published this week (Lai et al 2020), we explore linguistic evidence that proves Tangut to be a West Gyalrongic language. We find shared lexical, morphological and syntactical innovations between Tangut and modern West Gyalrongic languages, demonstrating from a historical linguistic point of view, that Tangut’s most close relative is indeed West Gyalrongic. We also discuss the migration of the Tanguts based on ancient texts from the 10th century attested Tangut population in today's Western Sichuan, where West Gyalrongic languages are spoken.

The paper can be found here.

New Blog Post

2020-12-18T12:00:00+00:00

I wrote my final German blog post for this year, this time discussing to which degree we can use our "native speaker intuition" to identify foreign words in our native languages. The blog post, titled "Von Wörtern in fremdem Gewand" can be found here.

Khroskyabs is not hard, but harder than you think

2021-01-12T12:00:00+00:00

Different people might have different opinions about the difficulty in learning a particular language. This heavily depends on the learner's linguistic background, learning experience and exposure to the target language. However, if we forget about all those noises, and solely focus on morphological paradigms alone, it is assumed that all languages are not as messy as they appear to be. This is called "the conditional low entropy conjecture", first proposed by Farrell and Malouf (2013), stating that morphology is in general organized, thus showing a low overall conditional entropy. Bonami and Beniamine (2016) developped a new method, called the "implicative entropy", in order to measure the degree of morphological organization in language. They showed that French and European Portuguese have very low implicative entropies in terms of morphology.

It would have been interesting to see how implicative entropies are in Gyalrongic languages that are famous for their complex morphology. In a new paper that was just published (Lai 2021), I used Bonami and Beniamine's (2016) method to measure the implicative entropy of Siyuewu Khroskyabs, a West-Gyalrongic language. The result shows that the degree of morphological organization of Khroskyabs is relatively high, that is, the forms are more or less predictable, if we consider the high-low threshold of 1 bit: the entropies are generally lower than 1. However, if we compare the Siyuewu results to those for French and for Portuguese, we observe that the entropies exhibited in Siyuewu are significantly higher (sometimes by three or four folds). Although it might be inapproriate to compare the results directly, we can still claim that Siyuewu is indeed morphologically more complex.

The reason behind the complexity seems to be due to the evolution of the language. The same phoneme in an earlier stage evolved into different modern reflexes under different conditions, creating irregularities, and analogical change is yet to be at work to clear the mess. Therefore, the result implies that Siyuewu Khroskyabs preserves more information about the proto-language than other dialects that have lower entropies, and is useful for the reconstruction of verbal morphology in the proto-language. As a result, I put an internal reconstruction forward, which helps to understand the evolution of verbal morphology in Khroskyabs and contributes to the field of Sino-Tibetan historical linguistics.

New Blog Post

2021-01-13T12:00:00+00:00

I wrote a new blog post that just appeared in our blog on Computer-Assisted Language comparison, titled How to Handle Semantic Data with Tables. The blog post introduces basic practices used in the Concepticon project to handle semantic data.

New German Blog Post

2021-01-15T12:00:00+00:00

I wrote a new German blog post for January that discusses how important it is to be careful when seemingly detecting patterns, given that our mind often finds patterns where there are none in reality. This is exemplified with examples from etymology. You can find the post here.

New Blog Post in 'How to do X in linguistics?' Series

2021-01-20T12:00:00+00:00

I wrote a blog post about how to organize a journal club. The post is intended for someone who wants to start their own journal club or for someone who is unsure what to expect when they join our journal club. I share my experience and some tips here.

New Preprint

2021-01-25T12:00:00+00:00

Today, I submitted a new preprint, titled "Chances and Challenges of Quantitative Approaches in Chinese Historical Phonology". The study was submitted for the inclusion in a Festschrift for the 120th birthday of the famous linguist Li Fang-Kuei.

The field of Chinese Historical Phonology is traditionally dealing with a large number of complex and diverse types of data. While the data diversity can be conveniently dealt with in qualitative approaches, computational possibilities that have arisen during the past two decades offer new possibilities and new challenges for the field. In the study, I will summarize the chances and challenges which we face in the discipline and point to some suggestions for future work. While not being able to provide a direct solution for most issues of data handling and standardization, I hope that this study can contribute to a broader discussion about data and standards in the field of Chinese Historical Phonology.

The study has been submitted as a preprint to Humanities Commons.

Tiago Tresoldi Leaves the CALC Project to Start in Uppsala

2021-02-01T12:00:00+00:00

We are pleased and sad at the same time that Tiago Tresoldi, who worked in the CALC project as a post-doc for almost three years, is leaving the project to pursue a post-doc project in Uppsala. Given that we are currently finalizing some studies that have not been published or submitted yet during Tiago's presence in our project, this is not the last time, Tiago will feature in our list of authors. For those interested in seeing what Tiago will be working on in the future, I recommend to follow his personal website, where he will share his new ideas on the analysis of the cultural transmission of text traditions.

New German Blogpost

2021-02-17T12:00:00+00:00

Yesterday, I wrote my monthly German blog post for February, this time concentrating on expressions for "thirst" and what triggers thirst in our brains. You can find the blog post here.

New Tutorial Blogpost

2021-02-24T12:00:00+00:00

Today, a new tutorial blogpost appeared, this time focusing on how to work with the data for the World Atlas of Language Structures in CLDF formats. You can find the post here.

New Tutorial Blogpost and New Paper Accepted

2021-03-10T12:00:00+00:00

Today, a new tutorial blogpost appeared, focusing on how to link a particular complex dataset to the Concepticon here. In addition, a new paper by Joshua Conrad Jackson, Joseph Watts, Curtis Puryear, Ryan Drabble, Kristen Lindquist, and myself. The study, titled "From text to thought: How analyzing language can advance psychological science" will appear in "Perspectives on Psychological Science". The draft is available here.

New Blog Post on 'How to review concept lists in collaboration'

2021-03-15T12:00:00+00:00

Today, a new blog post in our 'How to do X in linguistics?' series was published. In the post, I describe our review process for adding concept lists to the Concepticon GitHub repository. The post shows how a collaborative review workflow ensures data validity and it is available here.

New Paper and Blogpost

2021-03-16T12:00:00+00:00

On Sunday, my German blog post for March appeared, this time discussing the indeterminacy in language , communication. You can find the post here. At the same time, a review paper by Cara L. Evans, Simon J. Greenhill, Joseph Watts, Carlos A. Botero, Russell D. Gray, Kathryn R. Kirby, and myself appeared, discussing "Uses and abuses of tree thinking in cultural evolution. In contributed the part on incomplete lineage sorting to this paper, which you can find as a preprint here.

New Paper

2021-03-17T12:00:00+00:00

A new paper by myself just appeared, which focuses on the verbal inflection chain of Siyuewu Khroskyabs, a Gyalrongic language (Trans-Himalayan). Siyuewu Khroskyabs goes against two general typological tendencies: first, as an SOV language, it shows an overwhelming preference for prefixes, which is rarely reported typologically; second, the inflectional prefixes in the outer slots are older than those in the inner slots, which is the reverse case of most languages. In this paper, I first identify distinct historical layers within the inflectional prefixes, and then focus on two of the prefixes, də- ‘even’ and “ɕə- ‘Q’ whose evolutionary pathways are relatively clear. The essential part of the hypothesis is that the prefixes originate from enclitics which could be attached to the end of a preverbal chain, originally loosely attached to the verb stem. The preverbal chain later became tightly attached to the verbal stem and eventually became a part of it as a chain of prefixes. As a result, the original enclitics are reanalysed as prefixes. The integration of preverbal morphemes is responsible for the prefixing preference in Modern Siyuewu Khroskyabs. However, despite this superficial prefixing preference, Siyuewu Khroskyabs underlyingly favours postposed morphemes. By following the general suffixing tendency, this language finally managed to create a typologically rare, overwhelmingly prefixing verbal template.

The author's version of the paper can be found here

LingPy 2.6.7 Released

2021-03-23T12:00:00+00:00

Yesterday we released LingPy 2.6.7, which you can find at https://pypi.org/project/lingpy (documentation at https://lingpy.github.io). The release does not introduce new features but guarantees compatibility with Python 3.9.

New Tutorial Blog Post and EDICTOR Release

2021-04-16T12:00:00+00:00

On Tuesday, I released EDICTOR 2.0, and on Wednesday, I published a blog post discussing some of the new features, titled Using EDICTOR 2.0 to Annotate Language-Internal Cognates in a German Wordlist.

Cross-Linguistic Transcription Systems 2.1.0

2021-04-21T12:00:00+00:00

Yesterday, we released the Cross-Linguistic Transcription Systems (https://clts.clld.org) in a new version, which also came along with some changes in the presentation of the data in the CLLD application, which Robert Forkel added during the past days.

New Blog Post and New Papers

2021-04-27T12:00:00+00:00

Today I published a new German blog post, titled Parallele Evolution in der Benennung von Unverpacktläden. At the same time, two papers appeared online. The first one is a prediction study with T. A. Bodt, published in Diachronica:

While analysing lexical data of Western Kho-Bwa languages of the Sino-Tibetan or Trans-Himalayan family with the help of a computer-assisted approach for historical language comparison, we observed gaps in the data where one or more varieties lacked forms for certain concepts. We employed a new workflow, combining manual and automated steps, to predict the most likely phonetic realisations of the missing forms in our data, by making systematic use of the information on sound correspondences in words that were potentially cognate with the missing forms. This procedure yielded a list of hypothetical reflexes of previously identified cognate sets, which we first preregistered as an experiment on the prediction of unattested word forms and then compared with actual word forms elicited during secondary fieldwork. In this study we first describe the workflow which we used to predict hypothetical reflexes and the process of elicitation of actual word forms during fieldwork. We then present the results of our reflex prediction experiment. Based on this experiment, we identify four general benefits of reflex prediction in historical language comparison. These comprise (1) an increased transparency of linguistic research, (2) an increased efficiency of field and source work, (3) an educational aspect which offers teachers and learners a wide plethora of linguistic phenomena, including the regularity of sound change, and (4) the possibility of kindling speakers’ interest in their own linguistic heritage.

The second study is based on our initial experiments with interlinear-glossed text, published in TALLIP, together with N. Sims and R. Forkel, which is not available in open access, but our authors copy can be found here:

While the amount of digitally available data on the worlds’ languages is steadily increasing, with more and more languages being documented, only a small proportion of the language resources produced are sustainable. Data reuse is often difficult due to idiosyncratic formats and a negligence of standards that could help to increase the comparability of linguistic data. The sustainability problem is nicely reflected in the current practice of handling interlinear-glossed text, one of the crucial resources produced in language documentation. Although large collections of glossed texts have been produced so far, the current practice of data handling makes data reuse difficult. In order to address this problem, we propose a first framework for the computer-assisted, sustainable handling of interlinear-glossed text resources. Building on recent standardization proposals for word lists and structural datasets, combined with state-of-the-art methods for automated sequence comparison in historical linguistics, we show how our workflow can be used to lift a collection of interlinear-glossed Qiang texts (an endangered language spoken in Sichuan, China), and how the lifted data can assist linguists in their research.

Updated Preprint for NoRaRe Article available

2021-05-04T12:00:00+00:00

We revised our manuscript for the article on "Linking Norms, Ratings, and Relations of Words and Concepts Across Multiple Language Varieties." An updated version of our preprint (Version 2) is now available on PsyArXiv (click here)

Tjuka, A., Forkel, R., & List, J. (2021, May 4). Linking Norms, Ratings, and Relations of Words and Concepts Across Multiple Language Varieties. https://doi.org/10.31234/osf.io/tgw3z

Privatdozent

2021-05-07T12:00:00+00:00

Having successfully defended my habilitation in February this year, I have now gained a new affiliation as a Privatdozent at the Institut für Orientalistik, Indogermanistik, Ur- und Frühgeschichtliche Archäologie at the Friedrich-Schiller-Universität Jena. This means that I will give at least one seminar per semester from now on at the FSU Jena, and my current seminar on lexical change has already started.

Furthermore, EvoBib Version 1.4.0 was published yesterday, containing some 200 more entries in the bibliography and numerous new quotes, all assembled during the past months.

News

2021-05-17T12:00:00+00:00

There are quite a few different kinds of news to share today. First, I published my May blogpost in German, titled "Offene Forschung als Praxis guter Wissenschaft" (URL: wub.hypotheses.org/1292), discussing open science and why it should become part of the good scientific practice. Then, I am very happy to announce that Frederic Blum published a blog post on "Data Gathering in Times of a Pandemic: Upcycling Constenla Umaña’s Data on the Chibchan, Lencan and Misumalpam Language Families" (URL: calc.hypotheses.org/2751 in our CALC blog, which illustrates how data can be converted to our CLDF formats and nicely shows how accessible these formats are already by now. Here's the abstract:

While searching for the topic of a small research project about the linguistic history of South America, I realized that a lot of data that is crucial for assessing central arguments is not openly available, but new data is difficult to come by these days. And when it is, it is not usually presented in data format that allows for easy reuse. Guided by these thoughts, I decided to turn towards the upcycling of previously published data

Finally, a new paper appeared, titled "The uses and abuses of tree thinking in cultural evolution" (DOI: 10.1098/rstb.2020.0056 by Cara L. Evans, Simon J. Greenhill, Joseph Wats, myself, Carlos A. Botero, Russell D. Gray, and Kathryn R. Kirby. Here is the abstract:

Modern phylogenetic methods are increasingly being used to address questions about macro-level patterns in cultural evolution. These methods can illuminate the unobservable histories of cultural traits and identify the evolutionary drivers of trait change over time, but their application is not without pitfalls. Here, we outline the current scope of research in cultural tree thinking, highlighting a toolkit of best practices to navigate and avoid the pitfalls and ‘abuses' associated with their application. We emphasize two principles that support the appropriate application of phylogenetic methodologies in cross-cultural research: researchers should (1) draw on multiple lines of evidence when deciding if and which types of phylogenetic methods and models are suitable for their cross-cultural data, and (2) carefully consider how different cultural traits might have different evolutionary histories across space and time. When used appropriately phylogenetic methods can provide powerful insights into the processes of evolutionary change that have shaped the broad patterns of human history.

Manuscript Paper on Annotating Cognates in Phylogenetic Studies of South-East Asian Languages

2021-05-19T12:00:00+00:00

Yesterday, I submitted a manuscript to a journal. The manuscript introduces a new annotation framework which deal with compounding and derivation in Southeast Asian (SEA) languages. We compare words in 19 Chinese dialect varieties in order to determine the relationships between these languages. Since compounding is the primary strategy to enlarge SEA languages' lexicons, the morphemes of the compounds need to be taken into account, as well as the cross-linguistic relationships between these morphemes. In our study, we annotate the meanings and functions of the morphemes using a new annotation format. We also present four conversion methods to transform the annotation into formats from which the family can then be derived. We show that using different conversion methods will drastically change the trees’ topologies. In conclusion, we encourage linguists to consider also the relationships between morphemes rather than full words.

The author's version of the paper can be found here.

News, News, News

2021-06-09T12:00:00+00:00

Last week, a new post in our tutorial blog on Computer-Assisted Language Comparison in Practice appeared. In this blog, I presented initial ideas on How to Share Data and Code when Submitting Papers to a Journal.

At the same time, we finally managed to publish version 2.5 of the CLLD Concepticon which offers now 392 different concept lists which make up for a total of about 100 000 concept labels.

New Blog Post and Update on NoRaRe

2021-06-14T12:00:00+00:00

In a new blog post, I present a multilingual concept list consisting of 28 body part concepts across 15 languages (e.g., English, German, Wolof, Vietnamese, Czech). The list was elicited in an urban fieldwork study for my master's thesis (Tjuka 2019). The blog post is available here.

We also received good news for our NoRaRe article, which has been accepted for publication in Behavior Research Methods! The preprint is available here.

New Blog Post

2021-06-16T12:00:00+00:00

Yesterday, a new German blogpost appeared, focusing on predatory journals and paper mills in science.

LingRex Released in Version 1.0.0

2021-06-21T12:00:00+00:00

Yesterday, I released LingRex in version 1.0.0. While the earlier version, which was used in three recent papers (List 2019, Wu et al. 2020, and Bodt and List 2019), was never properly tested, the new version has a test coverage of 99% with respect to unit tests. This means we can expect more stability in the application of the major functions provided by the library. These consist in the code for correspondence pattern detection (List 2019), the search for cross-semantic cognates (Wu et al. 2020), a new template-based alignment method (Wu et al. 2020), and the prediction of words (Bodt and List 2019 and 2021).

In addition, our blog collection for all blogs published in 2020 in our Computer-Assisted Language Comparison in Progress (CALCiP) blog has now been published with Humanities Commons in a new design (see 10.17613/2qq4-y417). In the future, we hope to publish stable PDF versions of blog posts individually after their appearance, instead of publishing them only one time per year.

Habilitation

2021-06-28T12:00:00+00:00

Last week, my habilitation thesis officially appeared online, it is titled "Computer-assisted approaches to historical language comparison" and features 12 articles which I wrote in the past years on the topic of "computer-assisted language comparison". It can be found online under the DOI 10.22032/dbt.49007.

New Blog Post

2021-07-06T12:00:00+00:00

Yesterday, I wrote and published a new German blog post, which points to the problem that linguistis often think that language is a perfect tool for communication without any flaws. I argue that there are still many situations in which we have problems to express ourselves with our languages, and that language is not perfect to describe these. The post, titled "Worüber man nicht reden kann..." can be found here.

New Preprint

2021-07-16T12:00:00+00:00

Yesterday, a new preprint appeared, which was submitted to a new open access journal with post-publication peer review. The study, together with Robert Forkel, proposes a new method for the automated detection of borrowings in multi-lingual wordlists (DOI: 10.12688/openreseurope.13843.1).

Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification of borrowings in lexical datasets. Moreover, none of the solutions which have been proposed so far identify borrowings across multiple languages. This study proposes a new method for the task and tests it on a newly compiled large comparative dataset of 48 South-East Asian languages. The method yields very promising results, while it is conceptually straightforward and easy to apply. This makes the approach a perfect candidate for computer-assisted exploratory studies on lexical borrowing in contact areas.

---

2021-08-03T12:00:00+00:00

title: New Blogpost on Data Sharing type: News

Yesterday, a new tutorial blogpost appeared which discusses how data should be shared when submitting a study to a journal for peer review. The blogpost, titled Transparent Data emphasizes the importance of data transparency in order to make replication and reuse of data easy.

New Preprint on Language Evolution

2021-08-05T12:00:00+00:00

I just submitted a new preprint to Humanities Commons (DOI), titled "Evolutionary Aspects of Language Evolution". The study is a contribution to a forthcoming anthology on "Evolutionary Thinking Across Disciplines", ed. by Agathe du Crest, Martina Valkovic, Philippe Huneman, and Thomas A. C. Reydon, which will appear in the Synthese Library (Springer) some time next year. This is the abstract:

While it has been known for a long time that human languages can change in various ways, it was only in the early 19th century that scholars realized that certain aspects of language change proceed in a surprisingly regular manner, allowing us to reconstruct historical stages of languages which have never been documented in written sources. The findings led to the establishment of historical linguistics as a scientific discipline, devoted to the investigation of how languages change and why. Although evolutionary thinking plays a major role in historical linguistics, practitioners often have the tendency to emphasize the peculiarities of language evolution rather than the commonalities with other kinds of evolution. In part, this seems to be justified by some phenomena for which it is difficult to find counterparts in different disciplines. In part, however, this may also due to a communication problem that is characteristic for interdisciplinary research, since scholars lack a common terminology. As a result, it is difficult for linguists to explain their particular evolutionary views on language change to practitioners from other disciplines, while evolutionary terminology from disciplines such as biology is difficult to grasp for linguists. In the study, I will try to present some important evolutionary aspects of language change for which it is hard to find counterparts in other disciplines and then point to current challenges of evolutionary studies in historical linguistics which have to deal with these aspects.

NoRaRe article published

2021-08-06T12:00:00+00:00

I'm happy to announce that our article "Linking norms, ratings, and relations of words and concepts across multiple language varieties" was published today in Behavior Research Methods. The article is open access and available here. In the article, we present the Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe). The data can be accessed via a web interface or GitHub.

New beginner's guide for NoRaRe

2021-08-11T12:00:00+00:00

In a new blog post, I describe how to add word lists to NoRaRe. The instructions are aimed at people new to Python who want to add their own dataset to the NoRaRe database (Tjuka et al 2021). The post can be found here.

New Paper Accepted and New Blogpost

2021-08-16T12:00:00+00:00

After a relatively short period of open reviewes, our study presenting a new method for automatic borrowing detection with Robert Forkel was no formally accepted by Open Research Europe (see here for the current draft). We will still have to revise the study a bit more, before the final version will appear, so the study itself is now listed as "forthcoming", but the final version will appear soon.

Furthermore, I published another German blog post, this time discussing those cases in scientific research, where I argue that it would be good if scientists did not push their research results too fast, but rather really consider twice, whether the results are sound and whether it is important to share them, providing examples where linguistic studies and studies in machine learning have not done, but rather insisted on publishing not only half-baked but also particularly problematic results. This German blog post can be found here.

Paper on Borrowing Detection Published

2021-08-25T12:00:00+00:00

Our study presenting a new method for automatic borrowing detection with Robert Forkel was now published by Open Research Europe (see here). The revision contains a very detailed investigation of thresholds used in automatic cognate detection approaches, which reveals that thresholds needed to infer borrowings differ quite substantially from thresholds needed to infer cognates language-family-internally. This justifies why we use two thresholds in our new approach.

Lexibank

2021-09-03T12:00:00+00:00

After more than one year in which we were busily finalizing datasets and writing new applications for the analysis, we have finally managed to submit the paper presenting the Lexibank wordlist collection, and a preprint presenting the database is already available online from ResearchSquare.

The past decades have seen substantial growth in digital data on the world's languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, the majority of published datasets lack standardization which makes their comparison difficult. Here, we present the first step to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that increase the FAIRness of linguistic data. We test the Lexibank workflow on a collection of 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.

The data is also available now from GitHub (lexibank/lexibank-analysed).

EvoBib

2021-09-05T12:00:00+00:00

On Saturday, a new version of EvoBib was released, now containing 4000 bibliographic references and more than 6000 different quotes from the literature. The web-interface (with additional information on the original data) can be found at https://digling.org/evobib/.

Keynote Talk and New Blog Post

2021-09-08T12:00:00+00:00

Today, a new blog post showing how data can be added to our Lexibank repository appeared. In this post, I show how data can be converted to Lexibank CLDF formats, using a recently published dataset on Vietic languages by Sidwell and Alwes as an example. The post can be found here.

I also gave a keynote at this year's KONVENS conference in Düsseldorf in virtual form. Since I recorded the keynote before, an earlier version is also available online, which you can find here.

New Blog Post

2021-09-10T12:00:00+00:00

Today, a new German blog post discussing how much we know about the parts of which words are composed has appeared online. The post can be found here.

New Preprint

2021-09-11T12:00:00+00:00

Today, a new preprint with Cormac Anderson, Tiago Tresoldi, Simon J. Greenhill, Robert Forkel, and Russell D. Gray, titled "Measuring variation in phoneme inventories" appeared online at Research Square (10.21203/rs.3.rs-891645/v1). The study systematically compares phoneme inventories and how they are coded in different datasets.

For over a century, the phoneme has played a central role in linguistic research. In recent years, collections of phoneme inventories, originally designed for cross-linguistic purposes, have increasingly been used in comparative studies involving neighbouring disciplines. Despite the extended application of this type of data, there has been no research into its comparability or tests of its reliability. In this study, we carry out a systematic comparison of four popular phoneme inventory collections. We render them comparable by linking them to standardised formats for the handling of cross-linguistic datasets and develop new measures to test both size and similarity. We find considerable differences in inventories supposedly representing the same language variety, both in terms of size and transcriptional choices. While some of these differences appear to be predic, reflecting design decisions in the different collections, much of the observed variation is unsystematic. These results should sound a note of caution for comparative studies based on phoneme inventories, which we suggest need to take the question of comparability more seriously. We make a number of proposals for improving the comparability of phoneme inventories.

New Blog Posts

2021-10-06T12:00:00+00:00

Yesterday, two new blog posts were published, one English blog post, discussing How to write a term paper in linguistics, and one German blog post, discussing the calculability of data.

New Paper

2021-10-08T12:00:00+00:00

Already on October 4, a new paper appeared (published online before print), in which I had the chance to contribute. This review by Joshua C. Jackson, Joseph Watts, myself, Curtis Puryear, Ryan Drabble, and Kristan A. Lindquvist, discusses From Text to Thought: How Analyzing Language Can Advance Psychological Science.

Humans have been using language for millennia but have only just begun to scratch the surface of what natural language can reveal about the mind. Here we propose that language offers a unique window into psychology. After briefly summarizing the legacy of language analyses in psychological science, we show how methodological advances have made these analyses more feasible and insightful than ever before. In particular, we describe how two forms of language analysis—natural-language processing and comparative linguistics—are contributing to how we understand topics as diverse as emotion, creativity, and religion and overcoming obstacles related to statistical power and culturally diverse samples. We summarize resources for learning both of these methods and highlight the best way to combine language analysis with more traditional psychological paradigms. Applying language analysis to large-scale and cross-cultural datasets promises to provide major breakthroughs in psychological science.

New Blog Post

2021-11-01T12:00:00+00:00

In a new blog post, I introduce a list of 192 concepts across the semantic domains of color, emotion, and human body. The post can be found here. The concept list is available on Zenodo: https://doi.org/10.5281/zenodo.5549847.

New Paper and New Blog post

2021-11-02T12:00:00+00:00

Yesterday, we were informed that our paper presenting "A digital, retro-standardized edition of the Tableaux Phonétiques des Patois Suisses Romands (TPPSR)", by Hans Geisler, Robert Forkel, and myself, has finally appeared in print. The study presents an online edition of the TPPSR, a dialect atlas of the Suisse romande, collected in the early 20th century, which has already been published online at https://tppsr.clld.org. The study presenting this database itself will mainly appear in print, but for now, offprints are also available here.

Additionally, a new German blog post discussing ghost writers, predatory journals, and agencies searching for ghost writers, has appeared. The post, titled "Von schreibenden Geistern und vertretenen Stellen" can be found at https://wub.hypotheses.org/1370.

New blog post related to WoW conference presentation

2021-11-24T12:00:00+00:00

In a new blog post, I describe how to compare NoRaRe data sets in R. The post is based on a study investigating arousal and valence ratings in English, Dutch, and Spanish, which will be present at the [WoW Conference] (https://wordsintheworld.ca/wow-conference/) on Saturday.

The post can be found here. The presentation slides are available here: https://pad.gwdg.de/p/KLgI9TLrP#/.

New Blog post

2021-12-06T12:00:00+00:00

Today, I published a new blog post in German, discussing open science principles and how to evaluate studies when reviews have been submitted along with them. You can find the post at https://wub.hypotheses.org/1379.

New paper on semantic relations in word formation, borrowing, and semantic change

2021-12-16T12:00:00+00:00

This week, I submitted a paper for review on using computer-assisted approaches for studying semantic aspects of language change. In it, I investigate the etymologies of 480 German nouns of basic vocabulary. Findings include the various factors that contribute to the choice of the semantic relation utilized in coining new meanings (like part of speech, semantic field, and morphological aspects), and potential ways of improving semantic reconstruction. You can find the preprint at DOI:10.17613/03dk-tk62.

New Accepted Paper

2021-12-17T12:00:00+00:00

Today, I shared the author's version of a newly accepted paper called "Correcting a bias in TIGER rates resulting from high amounts of invariant and singleton cognate sets".

In a recent issue of the Journal of Language Evolution, Syrjänen et al. (2021) investigate the suitability of computing Cummins and McInerney’s (2011) TIGER rates for estimating the tree-likeness of linguistic datasets compiled for phylogenetic reconstruction. The authors test the TIGER rates on a diverse sample of simulated data, which by and large confirms the usefulness of TIGER rates as an analytic tool for investigating linguistic data, but they test them only on one real-world dataset of Uralic languages which turns out to behave quite differently from the simulated data. When testing the TIGER rates on additional datasets, I detected a bias in the computation which leads to an unnatural increase in those cases where a dataset contains many characters with invariant or singleton states. To overcome this problem, I suggest a modified variant of TIGER rates, which is provided in the form of a freely available Python package. Testing the modified TIGER scores on the simulated data of Syrjänen et al. shows that the corrected TIGER rates still readily distinguish between different degrees of tree-likeness. Testing them on a dataset in which the number of singletons and invariants was artificially increased further shows that the corrected TIGER rates are not influenced by the bias. A final tests on seven linguistic datasets shows the usefulness of the corrected TIGER rates on a larger variety of linguistic datasets and illustrates the importance to take specific aspects of linguistic data into account when using biological methods in the domain of language evolution.

The paper is accompanied by a small Python package that computes the new TIGER rates (https://pypi.org/project/pylotiger) and can be accessed from Humanities Commons (DOI: 10.17613/0n1n-3352).

CALC Blog Posts for 2021

2022-01-04T12:00:00+00:00

We published Volume 4 of Computer-Assisted Language Comparison in Practice. The volume contains PDF versions of all contributions published on the [CALC blog] (https://calc.hypotheses.org) in 2021 and is available on [Humanities Commons] (https://hcommons.org) at https://doi.org/10.17613/a0ew-0n98.

New Paper Appeared

2022-01-05T12:00:00+00:00

Today, a study on the managing of data in historical linguistics with the goal of reconstructing language phylogenies from lexical data appeared in print (DOI: 10.7551/mitpress/12200.003.0033). This study was collaborative work with Tiago Tresoldi, former post-doc in our CALC project, Christoph Rzymski, Robert Forkel, Simon J. Greenhill, and Russell D. Gray.

Beyond CALC

2022-01-07T12:00:00+00:00

Thanks to the generous funding by the MPG, the CALC project will be continued from April 2022 until March 2024. Under the title "Beyond CALC: Computer-Assisted Approaches to Human Prehistory, Linguistic Typology, and Human Cognition. (CALC³)", we will continue and expand our work on computer-assisted language comparison. Since our doctoral students have not yet finished their PhD, there won't be an abrupt change in our group but rather a transition with some people leaving us and other people joining us. More detailed news on the new project will be shared later this year.

New Blog Post on PySem

2022-01-10T12:00:00+00:00

Today, a new blog post appeared in our series of blog posts on Computer-Assisted Language Comparison, this time introducing How to Map Concepts with the PySem Library.

New Blog Post on the Concept "Shadow"

2022-01-17T12:00:00+00:00

Today, a new blog post appeared in my German blog, this time devoted to the concept "shadow" and the extended meanings it can take in German and other languages. This post, titled "Von der Ambivalenz des Schattens" can be found online at https://wub.hypotheses.org/1406.

New Paper on TIGER Rates

2022-01-21T12:00:00+00:00

TIGER rates are an interesting way to assess the tree-likeness of a dataset, originally proposed by Cummins and McInerney in 2011 and now also discussed for their suitability to be applied to linguistic data by Syrjänen et al. 2021. When reading both articles, I felt that there was something odd with the TIGER rates, and I found the reason in the handling of singletons and invariants. As a result, I wrote a small library in Python that computes both corrected and original TIGER rates and I also wrote a comment to the original article by Syrjänen et al. to illustrate the usefulness of the extended (or correctd) rates. The article has now been published under the title "Correcting a bias in TIGER rates resulting from high amounts of invariant and singleton cognate sets" (DOI: 10.1093/jole/lzab007).

New Preprint on Contact Layer Detection

2022-01-23T12:00:00+00:00

On Friday last week, a new study submitted to Open Research Europe appeared as preprint. The study titled "First steps towards the detection of contact layers in Bangime: a multi-disciplinary, computer-assisted appraoch" by Abbie Hantgan, Hiba Babiker, and myself, is now published as a preprint, waiting for open peer review on the Open Research Europe platform (DOI: 10.12688/openreseurope.14339.1).

New German Blog Post on Learning

2022-02-16T12:00:00+00:00

Today, I published my monthly German blogpost, this time discussing how we learn things and forget the difficulties when doing so. The post, titled "Über das Vergessen der Einstiegshürden" can be found here.

New Blog Post on Extended Concept List

2022-02-28T12:00:00+00:00

In a recent blog post, I introduced a list of color, emotion, and human body part concepts. An updated version of the list is now available that includes 28 additional emotion concepts. The blog post presenting the extended list can be found here. The concept list is available on Zenodo: https://doi.org/10.5281/zenodo.6226423.

New German Blog Post on Playing Wordle

2022-03-07T12:00:00+00:00

Today, I published my monthly German blogpost, this time discussing how the popular Wordle game requires different strategies when playing it in different languages. You can find the post here.

New English Blog Post on the Wagner-Fischer Algorithm

2022-03-08T12:00:00+00:00

Yesterday, another blogpost of mine appeared, this time in our tutorial blog for computer-assisted language comparison, presenting an animated version of the Wagner-Fischer algorithm. The blogpost can be found here.

ERC Consolidator Grant

2022-03-18T12:00:00+00:00

Yesterday, the ERC offially announced the winners of the Consolidator Grant applications from 2021. I am very proud that my project ProduSemy -- Productive Signs. A Computer-Assisted Analysis of Evolutionary, Typological, and Cognitive Dimensions of Word Families -- was among the projects that were selected for funding (see the announcement of the project in our institute here). The abstract of the project is given below:

All human languages have simple and complex words. Simple words refer to meanings regardless of their form, while complex words are formed from other words, and their formation can be semantically motivated. Since words can share lexical material, we can group them into families. Word families can vary greatly in size, ranging from small ones – comprising only a few members –, to large ones – spanning several hundred words –, but it is still unclear why some words are more productive than others in forming new words. Lexical composi- tionality has received some attention in historical linguistics, linguistic typology, and cognitive linguistics, but so far studies have mostly concentrated on the morphological complexity of individual words and languages, while the fact that words form families which interact during language change and language use has been typically ignored. As a result, many questions regarding word family formation remain unresolved, and we do not know (1) how word families evolve along language phylogenies, (2) which semantic processes underlying word family formation are universal, and (3) to what extent human cognition influences the productivity of lexical roots to form families. The project will tackle these three target questions by unifying evolutionary, typological, and cognitive insights into lexical compositionality. Building on a computer-assisted framework that reconciles classical and computational approaches in historical linguistics and linguistic typology, the project will design new models to standardize cross-linguistic data on word families, apply them to integrate data from historical linguistics, linguistic typology, and cognitive linguistics, and develop new methods for the computer-assisted inference of word families, their underlying motivation patterns, and their evolutionary histories in large datasets. In this way, the project will deepen the integration of cross-linguistic studies in cognitive and psychological sciences.

With a project start planned for October, and the CALC³ project starting already in April, the CALC lab, which lost some of its members recently, due to the end of the ERC Starting Grant that funded the first five years of the project, will welcome new members in the future and pursue the research on computer-assisted language comparison, this time with a specific focus on the compositionality of the human lexicon.

Student Assistant Position in the CALC³ Project

2022-03-30T12:00:00+00:00

We invite applications for a position as a student assistant in our CALC³ project. The student assistant will be preparing data for computer-assisted studies on historical and lexical language comparison.

Details and the application form can be found on the MPI website: https://www.eva.mpg.de/career/positions-available/job/535/Abteilung%20Sprach-%20und%20Kulturevolution/en/

New Blog Post on Body and Object Concept List

2022-04-04T12:00:00+00:00

In a new blog post, I introduce a list of body and object concepts. The concept list is the basis for an ongoing study and consists of 784 concepts divided into two groups: 134 body and 650 object concepts. The blog post can be found here. The concept list is available on Zenodo: https://doi.org/10.5281/zenodo.6365495.

LingRex

2022-04-07T12:00:00+00:00

Today, a new version of LingRex (https://pypi.org/project/lingrex, together with Robert Forkel) was published, version 1.2, which contains not only bugfixes to our code on borrowing detection, but also new code that can be used for automated word prediction or phonological reconstruction in a supervised fashion.

New Paper on Phonological Reconstruction

2022-04-12T12:00:00+00:00

A new paper was accepted, together with Nathan W. Hill and Robert Forkel, titled "A new framework for fast automated phonological reconstruction using trimmed alignments and sound correespondence patterns". It will appear in the proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, co-located with the ACL 2022 meeting in Dublin. In this study, we present a new framework for supervised phonological reconstruction which is quite simple and also fast and thus perfect as a baseline to be compared with more complex methods. The preprint of the study can be found online now.

Computational approaches in historical linguistics have been increasingly applied during the past decade and many new methods that implement parts of the traditional comparative method have been proposed. Despite these increased efforts, there are not many easy-to-use and fast approaches for the task of phonological reconstruction. Here we present a new framework that combines state-of-the-art techniques for automated sequence comparison with novel techniques for phonetic alignment analysis and sound correspondence pattern detection to allow for the supervised reconstruction of word forms in ancestral languages. We test the method on a new dataset covering six groups from three different language families. The results show that our method yields promising results while at the same time being not only fast but also easy to apply and expand.

New Paper and New Preprint

2022-04-22T12:00:00+00:00

A new paper has just been officially published, after having been officially endorsed by two open reviews. Common work with Abbie Hantgan and Hiba Babiker, this study sheds light on potential contact relations of the language isolate Bangime, spoken in Mali (full paper can be found here).

Bangime is a language isolate spoken among the Dogon, Mande, Atlantic, and Songhai language families in Central-Eastern Mali. Despite Dogon disapproval, the speakers of Bangime, the Bangande, claim an ethnic identity with the Dogon. The Bangande are geographically isolated and current genetic research denoted their genetic disparity. However, here we show evidence of shared vocabulary among the Bangime and neighboring language groups. We investigate the layers of contact using a computer-assisted, multidisciplinary approach in a series of steps. We use lexical automated comparisons taking into account the qualitative and quantitative measures and the correction of the findings. Within archeological and historical contexts from Central-Eastern Mali, our results show that the Bangime language was spoken before the Dogon Expansion in the Escarpment 1400c. AD. This work represents a great mark in computational linguistics for the study of language isolates and the paradox of their history.

Additionally, a new preprint is now available, a review of computational approaches to historical language comparison, which was submitted for the inclusion in the 2nd edition of the Routledge Handbook of Historical Linguistics. The preprint can be found here.

NoRaRe Article published in Print

2022-04-28T12:00:00+00:00

The article presenting the Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe) has finally appeared in print. In the article, we introduce an openly curated cross-linguistic database for studies in psychology and linguistics. NoRaRe (v0.2) currently contains 65 unique word and concept properties drawn from 98 different datasets in 40 languages. The article was first published online in the journal Behavior Research Methods in 2021. The citation for the printed version is:

Tjuka, Annika, Robert Forkel, and Johann-Mattis List. 2022. Linking norms, ratings, and relations of words and concepts across multiple language varieties. Behavior Research Methods 54, 864–884. https://doi.org/10.3758/s13428-021-01650-1

Blog Post Style Guide for Future Contributions

2022-05-09T12:00:00+00:00

In a new blog post, I introduce a style guide for contributions on our Computer-Assisted Language Comparison in Practice blog. We hope that the post will help our colleagues, not only those who work in our research group and department, but also external collaborators and scholars who would like to share their ideas. The post is available here: https://calc.hypotheses.org/4084.

New Blogpost and Paper Accepted

2022-05-10T12:00:00+00:00

A new German blog post has just been published online, this time discussing publishing and discussing in scientific research (you can find the post here).

Furthermore, our paper introducing the Lexibank repository has now been accepted with Scientific Data, and we are currently revising it for the final publication.

New Paper and Update to PySEM

2022-05-30T12:00:00+00:00

With the publication of Concepticon 2.6.0 (https://concepticon.clld.org) during the last week (together with Annika Tjuka and Robert Forkel as main collaborators on this project) the PySEM package (https://pypi.org/project/pysem) has now also been updated to version 0.5, which contains the data from the most recent Concepticon version.

In addition, our paper on supervised phonological reconstruction has now appeared. This study, common work with Nathan Hill and Robert Forkel, offers a new straightforward framework for phonological reconstruction and word prediction, which can serve as a fast baseline for future studies devoted to the task. This study can be found here.

Computational approaches in historical linguistics have been increasingly applied during the past decade and many new methods that implement parts of the traditional comparative method have been proposed. Despite these increased efforts, there are not many easy-to-use and fast approaches for the task of phonological reconstruction. Here we present a new framework that combines state-of-the-art techniques for automated sequence comparison with novel techniques for phonetic alignment analysis and sound correspondence pattern detection to allow for the supervised reconstruction of word forms in ancestral languages. We test the method on a new dataset covering six groups from three different language families. The results show that our method yields promising results while at the same time being not only fast but also easy to apply and expand.

New Member of our CALC³ Group

2022-06-03T12:00:00+00:00

Two days ago, Frederid Blum joined us as a doctoral student in our CALC³ project, funded by the Max Planck Society. Frederic will apply computer-assisted methods to study the history of the Pano-Tacanan language family in South America. We are very happy that Frederic joined our group and look forward to a fruitful collaboration in the future.

New German Blog Post and New Preprint

2022-06-13T12:00:00+00:00

Today, my monthly German blog post appeared, which discusses this time "gray zones" in scientific practice. You can find the post here.

Additionally, a new preprint by Hans Geisler and myself was just deposited online. It discusses metaphors about language history, both in the past, in the now, and in the future. This preprint, currently under review, can be found here.

For a long time, metaphors have played an important role in depicting language history. In this study, we contrast early metaphors on language history, such as the family tree or the wave model, with recent metaphors that were popularized after the quantitative turn, such as forests of trees or phylogenetic networks. Speculating about metaphors which could play a more important role in the future, we conclude that a vivid discussion about the usefulness and the concrete implications of metaphors plays an important role for the development of models for language history in historical linguistics.

New Paper on Lexibank Appeared

2022-06-17T12:00:00+00:00

Yesterday, our paper presenting the Lexibank repository (with Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes English, and Russell. D. Gray) finally appeared online, after almost 8 years of work on the topic. The paper can be found here.

The past decades have seen substantial growth in digital data on the world’s languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, most published datasets lack standardization which makes their comparison difficult. Here, we present a new approach to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that make these datasets more Findable, Accessible, Interoperable, and Reusable (FAIR). We test the Lexibank workflow on 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.

There is also a press release by our institute in English and German.

New Paper on Shared Task and New Accepted Paper

2022-06-28T12:00:00+00:00

Last week, our paper describing the SIGTYP 2022 Shared Task on Reflex Prediction appeared online and can be found here. Furthermore, our preprint titled "Annotating cognates in phylogenetic studies of South-East Asian languages" with Mei-Shin Wu was now accepted with Language Dynamics and Change. We already shared our final authors' copy with Humanities Commons, and you can find it online here.

New Members in the CALC Team

2022-07-15T12:00:00+00:00

There are new members in our CALC team by now. John Miller, doctoral student in Lima, with whom we were collaborating in the past already, now joined us as an external associate. In addition, Mathilda van Zantwijk and Carlos Barrientos Ugarte have now joined us as student assistants. We welcome all new members to the CALC group and hope that we will fruitfully collaborate.

In addition, two new papers appeared in this week. One study (in Spanish), titled "The languages of the Gran Chaco from the perspective of lexical semantics" by Nicolás Brid, in collaboration with Cristina Messineo and myself, appeared in LIAMES and can be accessed here. Another study, the paper presenting our shared task on cognate reflex prediction, common work with Ekaterina Vylomova, Nathan W. Hill, Robert Forkel, and Ryan Cotterell, has now also officially appeared and can be accessed from the ACL Website.

New Blog Post on Colexification Networks

2022-07-18T12:00:00+00:00

Today, a new blog post in our tutorial blog on Computer-Assisted Language Comparison appeared. This post illustrates how colexification networks can be reconstructed with the help of the CL Toolkit package. The post can be found here. In a follow-up post in August, I will show how the networks can be visualized interactively.

New German Blog Post

2022-07-20T12:00:00+00:00

Today, a new blog post in my German blog appeared, this time discussing the disappointment one can experience when getting insights into the real processes that happend "behind the stage". The post, titled "Von Einblicken in Sterneküchen" can be found here.

New Paper Presenting Database of Gran Chaco Languages

2022-07-22T12:00:00+00:00

Today, a new preprint appeared with Open Research Europe (common work with Nicolás Brid and Cristina Messineo), presenting our data base of the languages from the Gran Chaco area. The preprint can be found here and will be peer reviewed openly.

Home to more than twenty indigenous languages belonging to six linguistic families, the Gran Chaco has raised the interest of many linguists from different backgrounds. While some have focused on finding deeper genetic relations between different language groups, others have looked into similarities from the perspective of areal linguistics. In order to contribute to further research of areal and genetic features among these languages, we have compiled a comparative wordlist consisting of translational equivalents for 326 concepts — representing basic and ethnobiological vocabulary — for 26 language varieties. Since the data were standardized in various ways, they can be analyzed both quantitatively and qualitatively. In order to illustrate this in detail, we have carried out an initial computer-assisted analysis of parts of the data by searching for shared lexicosemantic patterns resulting from structural rather than direct borrowings.

New Paper on the History of Uto-Aztecan Languages

2022-08-06T12:00:00+00:00

I am very glad to announce that a study on Uto-Aztecan languages, led by Simon J. Greenhill and Hannah Haynie (with Robert Ross, Angela Chira, Lyle Campbell, Carlos Boter, Russell D. Gray, and myself) has now been accepted for publication in Language. The study will appear officially in 2023, but a preprint has now already been shared online, which can be accessed here.

The Uto-Aztecan language family is one of the largest language families in the Americas. However, there has been considerable debate about its origin and how it spread. Here we use Bayesian phylogenetic methods to analyze lexical data from 34 Uto-Aztecan varieties and 2 Kiowa-Tanoan languages. We infer the age of Proto-Uto-Aztecan to be around 4,100 years ago (3,258 - 5,025 years), and identify the most likely homeland to be near what is now southern California. We reconstruct the most probable subsistence strategy in the ancestral Uto-Aztecan society and infer no casual or intensive cultivation, an absence of cereal crops, and a primary subsistence mode of gathering (rather than agriculture). Our results therefore support the timing, geography, and cultural practices of a northern origin, and are inconsistent with alternative scenarios.

My own work in this study consisted in the design of specific methods that help to evaluate to which degree the manually annotated cognate sets would differ from automatically computed ones. Given that such an evaluation has not been done so far in such detail, I hope that we can apply this method in other cases in the future.

Two Blog Posts in August

2022-08-24T12:00:00+00:00

Two blog posts appeared in this week, a German blog post discussing etymologies, which you can find here and a post in our tutorial blog on CALC, which concludes a mini-series of three blog posts devoted to the creation and analysis of colexification networks. This post shows how colexification networks can be visualized and can be found here.

Two Blog Posts in September

2022-09-28T12:00:00+00:00

Two blog posts appeared in this week, a German blog post discussing different goals of scientific research, which you can find here and a post in our tutorial blog on CALC, in which I present the PyEDICTOR tool, which you can find here.

Welcoming Viktor Martinovic in our group

2022-10-07T12:00:00+00:00

With the beginning of this week, Viktor Martinovic joined our team. He is a PhD student in his final year from Vienna and will collaborate with our group on methods for the handling of lexical borrowing, concentrating specifically of ancient borrowings in a rule-based paradigm. Viktor is generously funded by the Department of Linguistic and Cultural Evolution and associated with the CALC³ group.

New Blog Post and New Accepted Paper

2022-10-16T12:00:00+00:00

Today, my monthly German blog post appeared, this time discussing certain similarities between the age by which children acquire the ability to pronounce certain sounds and the patterns by which sounds change in a language over time. This blog post can be found here.

Additionally, a review paper by Hans Geisler and myself, which we submitted earlier this year to the journal Moderna, has now been accepted for publication and will appear some time in the next year. A preprint of the study, titled "Of word families and language trees: New and old metaphors in studies on language history" and can be found here.

New Blog Post on Querying Data from Lexibank

2022-10-31T12:00:00+00:00

Today, the October blog post for our CALC blog appeared, this time concentrating on querying datasets with cognates from the Lexibank repository. The blog post can be found here.

Major Release of Concepticon 3.0 and NoRaRe 1.0

2022-11-14T12:00:00+00:00

We released Concepticon Version 3.0 (List et al. 2022) and NoRaRe Version 1.0 (Tjuka et al. 2022b). At this point, Concepticon includes 413 concept lists with 41 mapping languages and 3914 concept sets. NoRaRe contains 113 datasets with 75 word properties across 39 languages. With the major releases new data were added to both resources and they were published as CLDF datasets here: concepticon-cldf and norare-cldf. Furthermore, we updated the clld app for Concepticon at https://concepticon.clld.org and created a new one for NoRaRe at https://norare.clld.org.

New Blog Post on Language and Writing

2022-11-15T12:00:00+00:00

Today, the my November blog post in German appeared, this time discussing the relation between language and writing here.

New Member of the CALC³ Group

2022-11-19T12:00:00+00:00

Last week, Cristian Juarez finally joined our CALC³ group. He'll investigate the relationship between the Guaycuruan and Mataguayan languages in the South American Gran Chaco area, trying to find out if the attested similarities can be explained by contact or inheritance. We are very happy that Cristian joined our group and hope on a fruitful collaboration in the future.

New Blog Post on Custom Commands in CLDF

2022-11-28T12:00:00+00:00

Today, a new blog post has been published that describes the creation of Custom Commands for CLDF datasets which can be used from the command line. The tutorial uses as an example the creation of Nexus-files out of an existing Lexibank-dataset. The blog post can be found here.

New Versions of PySEM and EvoBib

2022-11-30T12:00:00+00:00

Yesterday, I released new versions of PySEM and EvoBib. PySEM now contains latest data from Concepticon 3.0 and EvoBib has been extended with more than 1000 additional quotes and dozens of new references.

New Paper on Body Part Suffixes in Panoan Languages

2022-12-09T12:00:00+00:00

Today, a new paper, led by Roberto Zariquiey, appeared in the journal Interface Focus, titled "Untangling the evolution of body-part terminology in Pano: conservative versus innovative traits in body-part lexicalization".

Although language-family specific traits which do not find direct counterparts outside a given language family are usually ignored in quantitative phylogenetic studies, scholars have made ample use of them in qualitative investigations, revealing their potential for identifying language relationships. An example of such a family specific trait are body-part expressions in Pano languages, which are often lexicalized forms, composed of bound roots (also called body-part prefixes in the literature) and non-productive derivative morphemes (called here body-part formatives). We use various statistical methods to demonstrate that whereas body-part roots are generally conservative, body-part formatives exhibit diverse chronologies and are often the result of recent and parallel innovations. In line with this, the phylogenetic structure of body-part roots projects the major branches of the family, while formatives are highly non-tree-like. Beyond its contribution to the phylogenetic analysis of Pano languages, this study provides significative insights into the role of grammatical innovations for language classification, the origin of morphological complexity in the Amazon and the phylogenetic signal of specific grammatical traits in language families.

The paper is open access and can be found here.

New Blogposts and Call for Workshop Abstracts

2022-12-14T12:00:00+00:00

Today, the final blog post for the year in our CALC blog appeared. Abbie Hantgan introduces her ERC project "The Small Bang", you can find the post here.

Already on Monday, I published my final blog in German for the year, this time discussing todo-lists. You can find the post here.

Last not least, our workshop proposal for the 26th International Conference of Historical Linguistics in Heidelberg in 2023 was accepted, and we are now inviting abstracts of one page related to the workshop's topic "". You can find a detailed call for abstracts with more information here. Deadline is the 1st of January in 2023. For questions, you can also contact the organizers (including myself) directly.

New article submitted to Open Research Europe

2022-12-19T12:00:00+00:00

We introduced the major release of Concepticon 3.0 (List et al. 2022) and NoRaRe 1.0 (Tjuka et al. 2022b) in an article which is now awaiting peer review at the journal Open Research Europe. The article is available here: https://doi.org/10.12688/openreseurope.15380.1.

CALC Lab Becomes CALC/MCL Lab at the University of Passau

2023-01-01T12:00:00+00:00

With the beginning of 2023, I am now a full professor of the University of Passau, leading the newly funded Chair of Multilingual Computational Linguistics. With this new position, the CALC Lab will also move from Leipzig to Passau, emerging into a new laboratory in which we will extend our work on computer-assisted language comparison (CALC) to the broader field of multilingual applications in computational linguistics (MCL). The transition won't be abrupt, however, as I will keep my affiliation with the Max Planck Institute for Evolutionary Anthropology as well as my position as a leader of the CALC³ group until 2024. The new position means that our group will keep growing in the future, since new positions funded by the University of Passau will become available and hopefully filled soon. It also means that the so far rather targeted research group devoted to the field of computer-assisted language comparison will extend its scope further, concentrating more broadly on multilingual approaches in computational linguistics in the future. I look forward to new fruitful collaborations with the new colleagues from the University of Passau and I am very happy to pursue our existing collaborations with several colleagues from all around the world.

New Paper on Partial Cognate Annotation

2023-01-09T12:00:00+00:00

Last week, a new paper on partial cognate annotation by Mei-Shin Wu and myself was published. In the study, which you can find here, we discuss the consequences of varying the ways in which partial cognates are annotated and later converted to statements of overall (word-level) cognacy for the purpose of phylogenetic reconstruction.

Compounding and derivation are frequent in many language families. As a consequence, words in different languages are often only partially cognate, sharing some but not all morphemes. While partial cognates do not constitute a problem for the phonological reconstruction of individual morphemes, they are problematic for phylogenetic reconstruction based on comparative word lists. We review current practices of preparing cognate-coded word lists and develop new approaches that make the process of cognate annotation more transparent. Comparing four methods by which partial cognate judgments can be converted to cognate judgments for whole words on a newly annotated data set of 19 Chinese dialect varieties, we find that the choice of conversion method has an impact on the inferred tree topologies that cannot be ignored. We conclude that scholars should take great care with cognate judgments in languages in which compounding and derivation are frequent and recommend always assigning cognates transparently.

New Paper on the internal classification of Quechua

2023-01-11T12:00:00+00:00

This week, we published the pre-print presenting an (undated) phylogeny for the internal classification of Quechua. The article, available here, was accepted for publication in Indiana. We relate the computational evidence for the different branches to the different hypotheses surrounding the expansions of the Quechua language family. Further, we show hat tree models are not incompatible with this data, and how low posterior values in a phylogeny can actually help us identifying complex historic scenarios.

We present a computational phylogeny for the internal classification of the Quechua language family. Based on a concept list of 150 lexical items, we manually analyzed data from 39 contemporaneous Quechua varieties for cognacy and computed a family tree using Bayesian phylogenetic methods. The results provide further evidence for the classification of individual varieties and compares the results to the existing hypotheses for the evolution of the Quechua language family.

CALC Blog Posts in 2022

2023-01-12T12:00:00+00:00

We published Volume 5 of Computer-Assisted Language Comparison in Practice. The volume contains PDF versions of all contributions published on the CALC blog in 2022 and is available on Humanities Commons at https://doi.org/10.17613/0df3-gm47.

New Blog Post on Cross-Linguistic Colexifications

2023-01-16T12:00:00+00:00

In the first blog post of 2023, I discuss the origins of cross-linguistic colexifications. The blog post is the first step towards a deeper exploration of this topic and explains the four processes that underlie cross-linguistic colexifications. The post is available here: https://calc.hypotheses.org/5001.

New Blog Posts

2023-01-18T12:00:00+00:00

Today, a new German blog post appeared, in which I look back at the scientific journey that ultimately brought me to Passau. You can find this post here.

New Paper Appeared

2023-01-19T12:00:00+00:00

Our paper (with Nicolás Brid and Cristina Messineo) on "A comparative wordlist for the languages of The Gran Chaco, South America" was now formally accepted by Open Research Europe and can thus be considered as fully "published". You can find the study here.

Home to more than twenty indigenous languages belonging to six linguistic families, the Gran Chaco has raised the interest of many linguists from different backgrounds. While some have focused on finding deeper genetic relations between different language groups, others have looked into similarities from the perspective of areal linguistics. In order to contribute to further research of areal and genetic features among these languages, we have compiled a comparative wordlist consisting of translational equivalents for 326 concepts — representing basic and ethnobiological vocabulary — for 26 language varieties. Since the data were standardized in various ways, they can be analyzed both quantitatively and qualitatively. In order to illustrate this in detail, we have carried out an initial computer-assisted analysis of parts of the data by searching for shared lexicosemantic patterns resulting from structural rather than direct borrowings.

ERC Portrait and Lecture Materials Online

2023-01-23T12:00:00+00:00

Today, an interview in the German National Contact Point's series of ERC Portraits appeared, in which I answer some questions on my ERC projects and how it was to apply for ERC grants in the past. The interview can be found here.

Already more than a week agod, I published the handouts accompanying my lecture in Amsterdam devoted to "Computational Historical Linguistics", which you can find here.

News and Preprints

2023-01-30T12:00:00+00:00

In the last week, an article presenting the major goals of the ProduSemy project appeared in the Passauer Neue Presse.

We also published a new preprint (common work with Nathan W. Hill, Xun Gong, and Seth Knights) on "Computer-Assisted Approaches to Rule-Based Phonological Reconstruction", which you can find here.

The formalization of sound changes as finite state transducers is implicit already in the Neogrammarians. For at least six decades scholars have recognized the potential of transducers for improving the speed and rigor of research in historical linguists, but almost no historical linguists actually use them. This article identifies the obstacles facing the concrete use of transducers and introduces a software package built to reconstruct Proto-Burmish using transducers.

A Preprint and a Forthcoming Study

2023-02-03T12:00:00+00:00

In this week, the preprint of a forthcoming study with John Miller appeared, accepted for the EACL conference. The study has been archived with arXiv. The study titled "Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists" tests some straightforward methods for borrowing detection.

Language contact is a pervasive phenomenon reflected in the borrowing of words from donor to recipient languages. Most computational approaches to borrowing detection treat all languages under study as equally important, even though dominant languages have a stronger impact on heritage languages than vice versa. We test new methods for lexical borrowing detection in contact situations where dominant languages play an important role, applying two classical sequence comparison methods and one machine learning method to a sample of seven Latin American languages which have all borrowed extensively from Spanish. All methods perform well, with the supervised machine learning system outperforming the classical systems. A review of detection errors shows that borrowing detection could be substantially improved by taking into account donor words with divergent meanings from recipient words.

A preprint of a study yet to be reviewed, titled "Inference of Partial Colexifications from Multilingual Wordlists", also appeared on arXiv. It proposes automated methods for the inference of different partial colexification networks.

The past years have seen a drastic rise in studies devoted to the investigation of colexification patterns in individual languages families in particular and the languages of the world in specific. Specifically computational studies have profited from the fact that colexification as a scientific construct is easy to operationalize, enabling scholars to infer colexification patterns for large collections of cross-linguistic data. Studies devoted to partial colexifications -- colexification patterns that do not involve entire words, but rather various parts of words--, however, have been rarely conducted so far. This is not surprising, since partial colexifications are less easy to deal with in computational approaches and may easily suffer from all kinds of noise resulting from false positive matches. In order to address this problem, this study proposes new approaches to the handling of partial colexifications by (1) proposing new models with which partial colexification patterns can be represented, (2) developing new efficient methods and workflows which help to infer various types of partial colexification patterns from multilingual wordlists, and (3) illustrating how inferred patterns of partial colexifications can be computationally analyzed and interactively visualized.

New Blog Post on Metaphor, Metonymy, Analogy

2023-02-06T12:00:00+00:00

In this blog post, I discuss ideas about metaphor and metonymy from linguistics that highlight the cognitive underpinnings of both notions, as well as a proposal from psychology about how analogical thinking can explain the processing of metaphors. The post is available here: https://calc.hypotheses.org/5234.

New Blog Post and Defended Dissertation

2023-02-14T12:00:00+00:00

Yesterday, a new German blog post appeared, discussing etymologies in the context of the concept of "paper". The post, titled "Vom Verteilen von Papier" can be found here.

Also yesterday, Mei-Shin Wu, former PhD student in our CALC project, successfully defended her PhD thesis, titled "Computer-Assisted Approach to the Comparison of Mainland Southeast Asian Languages". We are all very glad that Mei-Shin finished her thesis successfully and wish her all the best for the future.

New Paper on Uto-Aztecan Origins in Language

2023-02-22T12:00:00+00:00

Yesterday, a paper on the origins of Uto-Aztecan appeared in Language ahead of print.

The Uto-Aztecan language family is one of the largest language families in the Americas. However, there has been considerable debate about its origin and how it spread. Here we use Bayesian phylogenetic methods to analyze lexical data from thirty-four Uto-Aztecan varieties and two Kiowa-Tanoan languages. We infer the age of Proto-Uto-Aztecan to be around 4,100 years (3,258–5,025 years) and identify the most likely homeland to be near what is now Southern California. We reconstruct the most probable subsistence strategy in the ancestral Uto-Aztecan society and infer no casual or intensive cultivation, an absence of cereal crops, and a primary subsistence mode of gathering (rather than agriculture). Our results therefore support the timing, geography, and cultural practices of a northern origin and are inconsistent with alternative scenarios.

The contribution of CALC to this study was a thorough formal test of the quality of cognate judgments that showed that cognate judgments provided by experts basically increase the overall regularity of the amount of words that would exhibit systematic correspondences in the data.

The study can be found here, a preprint is also available in open access.

New Contribution to the How To Do X In Linguistics Series

2023-03-06T12:00:00+00:00

A new contribution to our How To series was published today. I offer an overview of how to organize literature and notes in Zotero. Specifically, I illustrate some of my own workflows for organizing the literature for my dissertation and discuss general features of Zotero. The blog post is available here: https://calc.hypotheses.org/5692.

New Blogpost on the Meanings of Sausage in German

2023-03-16T12:00:00+00:00

Today, a new German blogpost appeared, discussing the word family German Wurst "sausages" and its counterparts in some other languages. The blogpost can be found here.

New Concepticon Release and New Study Appeared

2023-04-03T12:00:00+00:00

There are a couple of news in different categories to be shared. First, a new version of the CLLD Concepticon was published last week. The new version adds several new concept lists and is crucial for the upcoming new version of Lexibank.

Then, another paper appeared in the journal Moderna, together with Hans Geisler, titled "Of word families and language trees. New and old metaphors in studies on language history" (DOI: 10.19272/202201902005).

For a long time, metaphors have played an important role in depicting language history. In this study, we contrast early metaphors on language history, such as the family tree or the wave model, with recent metaphors that were popularized after the quantitative turn, such as forests of trees or phylogenetic networks. Speculating about metaphors which could become important in the future, we conclude that a vivid discussion about the useful- ness and the concrete implications of metaphors plays a key role for the development of models for language history in historical linguistics.

This study is unfortunately under closed access, as the open access fees seemed too high to us for a study providing merely a review, but a preprint is freely available, and an author copy can be shared upon request.

New Paper on Trimming Phonetic Alignments Accepted

2023-04-04T12:00:00+00:00

Last week, we heard that a new study on the trimming of phonetic alignments to improve the inference of sound correspondence patterns was accepted to appear as part of the SIGTYP workshop organized as part of the EACL. The preprint of this study is now also available on on arXiv.

Sound correspondence patterns form the basis of cognate detection and phonological reconstruction in historical language comparison. Methods for the automatic inference of correspondence patterns from phonetically aligned cognate sets have been proposed, but their application to multilingual wordlists requires extremely well annotated datasets. Since annotation is tedious and time consuming, it would be desirable to find ways to improve aligned cognate data automatically. Taking inspiration from trimming techniques in evolutionary biology, which improve alignments by excluding problematic sites, we propose a workflow that trims phonetic alignments in comparative linguistics prior to the inference of correspondence patterns. Testing these techniques on a large standardized collection of ten datasets with expert annotations from different language families, we find that the best trimming technique substantially improves the overall consistency of the alignments. The results show a clear increase in the proportion of frequent correspondence patterns and words exhibiting regular cognate relations.

Two Upcoming Talks at 56th SLE Conference

2023-04-05T12:00:00+00:00

Talks by myself and Frederic Blum were accepted at the 56th Annual Meeting of the Societas Linguistica Europaea to be held in Athens. Frederic's talk will be on Re-examining the proposed genetic relationship(s) of Panoan and Tacanan and I myself will present Locative relations and valence extension: multifunctional locative markers in Mocoví (Guaycuruan, Argentina). The full abstracts will be posted soon here.

One more Paper at the SIGTYP Workshop Accepted

2023-04-11T12:00:00+00:00

Another paper, submitted to the SIGTYP workshop SIGTYP workshop organized as part of the EACL, was accepted last week. This study, common work by Julius Steuer, Badr. M. Abdullah, myself, and Dietrich Klakow, investigates information-theoretic aspects of vowel harmony reflected in multilingual wordlists. A first version of this study is now already available online.

We present a cross-linguistic study that aims to quantify vowel harmony using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have relied heavily on inflected word-forms in the analysis of vowel harmony. We instead train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists with a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

One more Paper at the SIGTYP Workshop Accepted

2023-04-11T12:00:00+00:00

We present a cross-linguistic study that aims to quantify vowel harmony using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have relied heavily on inflected word-forms in the analysis of vowel harmony. We instead train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists with a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

New Positions in our ProduSemy Project from October 2023

2023-04-12T12:00:00+00:00

In our ERC project "Productive Signs: A computer-assisted investigation of evolutionary, typological, and cognitive aspects of word families", we offer three doctooral positions (3 years with possible extension by one more year), deadline to apply is May 20. One position on cognitive aspects of word families, more information can be found here. One position on typological aspects of word families: here. One position on evolutionary aspects of word families, more information can be found here.

Additionally, our Chair of Multilingual Computational Linguistics is offering a position for an Akademischer Rat (research assistant) for 3 years with the possibility of extension by 3 more years. We look for a candidate who can teach topics in Multilingual Computational Linguistics with a specific focus on machine learning and data management. Deadline for application is May 16, more information can be found here.

Blogpost on Language in Specific

2023-04-14T12:00:00+00:00

Today, a new Gergman blog post appeared, titled "Von der Sprache im Speziellen", which you can find here. In the post, I discuss how people often view language, and how this conflicts with the linguistic perspective.

Two Three-Year Post-Doc Positions in the ERC-Project ProduSemy

2023-04-19T12:00:00+00:00

Two three-year post-doc positions in our ERC project Productive Signs available for three years, starting in October 2023, deadline for application is May 20, 2023.

One post-doc with a focus on the historical development of word families in the languages of the world:

https://www.uni-passau.de/fileadmin/dokumente/beschaeftigte/Stellenangebote/2023_10_Post_Doc_Prof_List_Projekt_Productive_Signs_I.pdf

The other post-doc focuses on typological aspects of word families:

https://www.uni-passau.de/fileadmin/dokumente/beschaeftigte/Stellenangebote/2023_10_Post_Doc_Prof_List_Projekt_Productive_Signs_II.pdf

English versions of the calls will also be published soon.

We are Hiring

2023-04-21T12:00:00+00:00

In our ProduSemy project and with the Chair of Multilingual Computational Linguistics at the University of Passau, there are several open positions for which doctoral students and post-docs can apply. We are looking both for people who are experienced in machine learning and computational linguistics as well as for people experienced in comparative linguistics (linguistic typology and historical linguistics) and cognitive linguistics and psycholinguistics.

Below is a summary for the positions we offer and where you can find more information on how to apply. Since English calls are not yet available at the moment, I kindly ask all those who do not speak German to contact me directly via mcl-admin@uni-passau.de in order to get more information on the positions and how to apply.

Position	Duration	Start Date	Speciality	Deadline	Link
Doctoral Student	3+1 years	October 2023	psycholinguistics	May 20	DE EN
Doctoral Student	3+1 years	October 2023	typology	May 20	DE EN
Doctoral Student	3+1 years	October 2023	historical linguistics	May 20	DE EN
Post-Doc	3+3 years	October 2023 or earlier	computational linguistics / machine learning	May 16	DE EN
Post-Doc	3 years	October 2023	historical linguistics	May 20	DE EN
Post-Doc	3 years	October 2023	typology	May 20	DE EN

New Blog Post on the Release of Concepticon 3.1

2023-04-26T12:00:00+00:00

In this blog post, I provide an overview of the improvements we integrated into the newest version of our Concepticon resource: Concepticon 3.1. After describing the new lists we added to Concepticon 3.1, I illustrate how we refined the concept relations and mappings and show how we deal with potential inconsistencies by use of an example of one list that proved to be inconsistent. The blog post is available here: https://calc.hypotheses.org/5915.

Invited Talk at Research Colloquium

2023-04-27T12:00:00+00:00

On Tuesday, I gave an invited talk at the Current Topics in General Linguistics Colloquium organized by Kilu von Prince at Heinrich Heine University Düsseldorf. I presented a study on body-object colexifications that illustrates workflows based on Lexibank (List et al. 2022). The slides are available here.

Three New Papers in the Context of EACL Appeared

2023-05-08T12:00:00+00:00

In the context of the EACL conference, three new papers have now been published. One study with John Miller, which made it into the main conference, tests new methods for the detection of borrowings from dominant donor languages and can be found under the link https://aclanthology.org/2023.eacl-main.190/. Two more papers appeared as part of the SIGTYP workshop organized by the special interest group for linguistic typology in NLP: One paper led by Julius Steuer tests new ways to investigate vowel harmony on wordlist data and can be found under the link https://aclanthology.org/2023.sigtyp-1.10. Another study by Frederic Blum and myself proposes a novel technique to handle phonetic alignments, which we call trimming. This study can be found under the link https://aclanthology.org/2023.sigtyp-1.6.

New German Blogpost

2023-05-15T12:00:00+00:00

Today, a new blogpost in German appeared, in which I discuss certain aspects of language, which we often think are "natural", but may turn out to much harder to detect as those if one approaches language from a mind that does not yet know how to speak. The blogpost, titled "Von der Sprache im Allgemeinen" is available under the link https://wub.hypotheses.org/1935.

TV Interview and New Paper Accepted

2023-05-24T12:00:00+00:00

This week, I was guest in the German TV show Planet Wissen, discussing the origin and the future of human language. The video can be found here.

Additionally, I learned that my paper on partial colexifications was accepted with Frontiers in Psychology and will soon appear online.

Final Version of Data Note Published

2023-05-25T12:00:00+00:00

Our data note on curating and extending lexical data has been published at Open Research Europe. The article is available here: https://doi.org/10.12688/openreseurope.15380.3

We present the major release of Concepticon 3.0 and NoRaRe 1.0. The article describes the underlying data and methods for maintaining the two resources.

New Paper on Partial Colexifications

2023-06-16T12:00:00+00:00

Today, a new paper was published, presenting a new method on the inference of partial colexifications from multilingual wordlists (DOI: 10.3389/fpsyg.2023.1156540). This is the first study to propose explicit methods to infer partial (as opposed to "full") colexifications from multilingual wordlists.

The past years have seen a drastic rise in studies devoted to the investigation of colexification patterns in individual languages families in particular and the languages of the world in specific. Specifically computational studies have profited from the fact that colexification as a scientific construct is easy to operationalize, enabling scholars to infer colexification patterns for large collections of cross-linguistic data. Studies devoted to partial colexifications—colexification patterns that do not involve entire words, but rather various parts of words—, however, have been rarely conducted so far. This is not surprising, since partial colexifications are less easy to deal with in computational approaches and may easily suffer from all kinds of noise resulting from false positive matches. In order to address this problem, this study proposes new approaches to the handling of partial colexifications by (1) proposing new models with which partial colexification patterns can be represented, (2) developing new efficient methods and workflows which help to infer various types of partial colexification patterns from multilingual wordlists, and (3) illustrating how inferred patterns of partial colexifications can be computationally analyzed and interactively visualized.

New Blog Post

2023-06-19T12:00:00+00:00

Today, a new German blog post, titled "Wer hat Angst vorm Chatprogramm" appeared (see wub.hypotheses.org/1978), in which I discuss a bit the potential implications but also potentially false fears from artificial intelligence and chat programs.

New Blog Post on a Dataset With Phonological Reconstructions in CLDF

2023-06-21T12:00:00+00:00

We present the digitization of a CLDF dataset that involves the reconstruction of Proto-Panoan in a new blogpost (Link: https://calc.hypotheses.org/6142). We discuss the challenges that arise with the parsing of text-based data, and also highlight some potential future use cases for machine-readble data that involves phonological reconstructions. You can access the release of the dataset that we discuss either on GitHub or Zenodo.

New Preprint

2023-06-22T12:00:00+00:00

Today, a new preprint by Yunfan Lai and myself appeared in Open Research Europe. Titled " Lexical data for the historical comparison of Rgyalrongic languages", we present a database on Rgyalrongic languages that is in part coded for partial cognates (article is available in open access, DOI: 10.12688/openreseurope.16017.1).

As one of the most morphologically conservative branches of the Sino-Tibetan language family, most of the Rgyalrongic languages are still understudied and poorly understood, not to mention their vulnerable or endangered status. It is therefore important for available data of these languages to be made accessible. The present lexical data sets provide comparative word lists of 20 modern and medieval Rgyalrongic languages, consisting of word lists from fieldwork carried out by the first author and other colleagues as well as published word lists by other authors. In particular, data of the two Khroskyabs varieties are collected by the first author from 2011 to 2016. Cognate identification is based on the authors' expertise in Rgyalrong historical linguistics through the neogrammarian comparative method. We curated the data by conducting phonemic segmantation and partial cognate annotation. The data sets can be used by historical linguists interested in the etymology and the phylogeny of the languages in question, and they can use them to answer questions regarding individual word histories or the subgrouping of languages in this important branch of Sino-Tibetan.

New Paper Appeared

2023-07-13T12:00:00+00:00

I just found out that a review study that I wrote for a book project published with Springer has now appeared online. The study, titled "Evolutionary Aspects of Language Change" discusses some parallels and differences between language change and biological evolution. Unfortunately, it is not available in open access (DOI: 10.1007/978-3-031-33358-3_6), but a preprint is available via Humanities Commons (DOI: 10.17613/ebas-hj26).

New Blogpost

2023-07-19T12:00:00+00:00

Today, I published a new German blog post, this time discussing how to quote in the humanities, arguing that we need a new debate on citation practice, given the influence of social media and preprint archives on our work (Vom grauen Zitieren).

New Blogpost

2023-08-02T12:00:00+00:00

Already last week, we published a new blog post with Zhenyang Liu, Guillaume Jacques, and myself, presenting a new comparative wordlist of Newari, one of the few Sino-Tibetan languages with a long written tradition (Creating a Standardized Comparative Wordlist of Newari Varieties).

New Blogpost

2023-08-17T12:00:00+00:00

Yesterday, a new blog post by Abbie Hantgan and myself appeared, introducing a standardized CLDF wordlist, created from the Dogon Comparative Wordlist by Heath et al. 2016 (Creating a CLDF Wordlist from Heath et al.'s Dogon Comparative Wordlist).

New Blogpost

2023-08-22T12:00:00+00:00

Yesterday, a new blog post in German appeared, discussing scientific practice in the context of referencing and structuring documents ("Strukturprobleme", URL).

Best Presentation Award

2023-09-05T12:00:00+00:00

I am happy to share that my presentation "Locative relations and valence extension: multifunctional markers in Mocoví" received the second prize for best conference paper by starting postdoctoral researchers at the last 56th Annual Meeting of the Societas Linguistica Europaea. This achivement would not have been possible without the kind support of our CALC group and the Linguistic and Cultural Evolution Department at MPI-EVA.

Focus Stream on Productive Signs at ICL 2024

2023-09-11T12:00:00+00:00

A first call for papers has been announced for the Focus Stream on "Productive Signs: Evolutionary, Typological, and Cognitive Dimensions of Word Families", organized as part of the 24th International Congress of Linguists, taking place in Poznán from September 8 to 14 2024.

The call can be found here: https://linguistlist.org/issues/34-2666/

An abstract of the call can be found here: https://ciplnet.com/wp-content/uploads/2023/07/FS-10-Productive-signs.pdf

New Blogpost

2023-09-15T12:00:00+00:00

Today, a new blog post in German appeared, discussing scientific practice in the context of what may be classified as "questionable research practices" (Etiquetten, URL).

New Blogpost on Orthography Profiles

2023-09-27T12:00:00+00:00

Today, a new blog post in out tutorial blog appeared. The blog, titled presents an implementation of the orthography profile algorithm in JavaScript (Sequence Manipulation with Orthography Profiles in JavaScript, URL).

Welcoming New Team Members

2023-10-09T12:00:00+00:00

Last week on Monday, five new members joined our team, Dr. Kellen Parker van Dam, as a chair assistant, Dr. Anna Di Natale and Dr. Matthias Pache as post-docs in our ProduSemy project, and Katja Bocklage and Arne Rubehn as doctoral students in the same project. We hope that all members will like the research atmosphere at our chair in specific and at the University of Passau in general and look forward to fruitful collaboration.

New Accepted Papers

2023-10-16T12:00:00+00:00

Two more papers have been accepted, first, a paper by Cormac Anderson at al., in which we measure variation in phoneme inventories, was accepted by the Journal of Language Evolution. Second, a paper by Yunfan Lai and myself presenting a database for Rgyalrongic languages was accepted by Open Research Europe. We hope that preprints and final versions of both papers will appear soon.

New Paper Published

2023-10-19T12:00:00+00:00

Our paper on Rgyalrongic languages with Yunfan Lai has now been published in a second version with Open Research Europe (DOI: 10.12688/openreseurope.16017.2).

As one of the most morphologically conservative branches of the Sino-Tibetan language family, most of the Rgyalrongic languages are still understudied and poorly understood, not to mention their vulnerable or endangered status. It is therefore important for available data of these languages to be made accessible. The lexical data sets the authors have assembled provide comparative word lists of 20 modern and medieval Rgyalrongic languages, consisting of word lists from fieldwork carried out by the first author and other colleagues as well as published word lists by other authors. In particular, data of the two Khroskyabs varieties were collected by the first author from 2011 to 2016. Cognate identification is based on the authors' expertise in Rgyalrong historical linguistics through application of the comparative method. We curated the data by conducting phonemic segmentation and partial cognate annotation. The data sets can be used by historical linguists interested in the etymology and the phylogeny of the languages in question, and they can use them to answer questions regarding individual word histories or the subgrouping of languages in this important branch of Sino-Tibetan.

New Paper Accepted

2023-10-27T12:00:00+00:00

A new study was accepted (common work with Nathan W. Hill, Robert Forkel, and Frederic Blum), titled "Representing and computing uncertainty in phonological reconstruction".

 Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data.

A preprint of this paper that will appear in December 2023 is now available from arXiv (DOI: 10.48550/arXiv.2310.12727).

New Blog Posts

2023-10-31T12:00:00+00:00

Two new blog posts were published in the past days. First, one blog post in German, discussing the role that preprints play nowadays ("Vom Vordrucken" https://wub.hypotheses.org/2111). Second, a blog post with Olena Shcherbakova devoted to the investigation of taste colexifications ("Retrieving and Analyzing Taste Colexifications from Lexibank" https://calc.hypotheses.org/6398).

Blog

2023-11-10T12:00:00+00:00

A new blog post was published on Wednesday, discussing the role that type setting plays in scientific work ("Auch das Auge liest mit", https://wub.hypotheses.org/2134).

New Blog Post

2023-11-16T12:00:00+00:00

A new blog post was published on Wednesday, presenting how transcription systems are modeled in the Cross-Linguistic Transcription Systems reference catalog ("Parsing IPA Transcriptions with CLTS", https://calc.hypotheses.org/6546).

New Article Preprint

2023-11-20T12:00:00+00:00

A new article just appeared in preprint with Open Research Europe, discussing "Open Problems in Computational Historical Linguistics" (DOI: 0.12688/openreseurope.16804.1).

Problems constitute the starting point of all scientific research. The essay reflects on the different kinds of problems that scientists address in their research and discusses a list of 10 problems for the field of computational historical linguistics, that was proposed throughout 2019 in a series of blog posts. In contrast to problems identified in different contexts, these problems were considered to be solvable, but no solution could be proposed back then. By discussing the problems in the light of developments that have been made in the field during the past five years, a modified list is proposed that takes new insights into account but also finds that the majority of the problems has not yet been solved.

New Article on Phoneme Inventories Published

2023-11-25T12:00:00+00:00

A new article just appeared in the Journal of Language evolution (common work with Cormac Anderson, Tiago Tresoldi, Robert Forkel, Simon Greenhill, and Russell Gray, DOI: 10.1093/jole/lzad011).

For over a century, the phoneme has played a central role in linguistic research. In recent years, collections of phoneme inventories, originally designed for cross-linguistic purposes, have increasingly been used in comparative studies involving neighbouring disciplines. Despite the extended application of this type of data, there has been no research into its comparability or tests of its reliability. In this study, we carry out a systematic comparison of nine popular phoneme inventory collections. We render them comparable by linking them to standardised formats for the handling of cross-linguistic datasets, develop new measures to test both size and similarity, and release the organised data in supplementary material. We find considerable differences in inventories supposedly representing the same language variety, both in terms of size and transcriptional choices. While some of these differences appear to be predictable, reflecting design decisions in the different collections, much of the observed variation is unsystematic. These results should sound a note of caution for comparative studies based on phoneme inventories, which we suggest need to take the question of comparability more seriously. We make a number of proposals for improving the comparability of phoneme inventories.

New Article on Uncertainty in Linguistic Reconstruction

2023-12-05T12:00:00+00:00

A new article just appeared in the proceedings of the workshop on language change, organized as part of the EMNLP conference in Singapur this year (with Nathan W. Hill, Robert Forkel, and Frederic Blum, URL: https://aclanthology.org/2023.lchange-1.3/).

Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data.

Final Blog Post for the Year

2023-12-21T12:00:00+00:00

Two days ago, my final blog post for the year was published. This time it was very short, discussing tones in Mandarin Chinese and how difficult it is to learn them ("Von pferdeschimpfenden Müttern", URL:https://wub.hypotheses.org/2159).

New Preprint on Body Part Colexification Study

2024-01-02T12:00:00+00:00

Just before the turn of the new year, we submitted our study on body part colexifications. The study presents the first large-scale analysis of body part vocabularies across 1,028 languages. A preprint is available on PsyArXiv: https://osf.io/preprints/psyarxiv/tu74k

Computer-Assisted Language Comparison in Practice

2024-01-17T12:00:00+00:00

The blog "Computer-Assisted Language Comparison in Practice" has been posting various tutorials and small articles on linguistic data since 2018. From 2024 on, the blog will also be available as a journal. The journal "Computer-Assisted Language Comparison in Practice" (available at https://ojs3.uni-passau.de/index.php/calcip/index) will offer digital object identifiers and PDF versions of all blog contributions. More information can be found in an editorial post (together with Annika Tjuka) in the blog (URL: https://calc.hypotheses.org/6651) and the new journal (DOI: 10.15475/calcip.2024.1.1).

New dataset paper published

2024-01-18T12:00:00+00:00

Today the paper "A comparative wordlist for investigating distant relations among languages in Lowland South America" appeared in Scientific Data. In this paper, we describe a new dataset on Panoan, Tacanan, and four other languages that have been claimed to be related to the former. We summarize the state-of-the-art of wordlist annotation and show how such CLDF datasets can easily be linked to others, such as Grambank. The article is available under the following DOI: 10.1038/s41597-024-02928-7

Anchor Points of Trust

2024-01-22T12:00:00+00:00

Today, my German blog for January appeared online, discussing "Ankerpunkte des Vertrauens in den Fluten digitaler Information" (URL).

New Preprint on Productive Signs

2024-01-26T12:00:00+00:00

A new preprint is now available with Humanities Commons, presenting some new ideas regarding the handling of word families in computer-assisted language comparison. The study, titled "Productive Signs: Towards a Computer-Assisted Analysis of Evolutionary, Typological, and Cognitive Dimensions of Word Families" can be accessed at 10.17613/zfwr-sn25.

New accepted papers

2024-02-06T12:00:00+00:00

Two new papers that have recently been accepted for publication (as part of the SIGTYP 2024 workshop), are now available as preprints.

The first paper is a study by Jessica Nieder and myself in which we present a new proposal to model mutual intelligibility computationally.

Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it. Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments. To study mutual intelligibility computationally, we propose a computer-assisted method using the Linear Discriminative Learner, a computational model developed to approximate the cognitive processes by which humans learn languages, which we expand with multilingual semantic vectors and multilingual sound classes. We test the model on cognate data from German, Dutch, and English, three closely related Germanic languages. We find that our model's comprehension accuracy depends on 1) the automatic trimming of inflections and 2) the language pair for which comprehension is tested. Our multilingual modelling approach does not only offer new methodological findings for automatic testing of mutual intelligibility across languages but also extends the use of Linear Discriminative Learning to multilingual settings.

Preprint is available on arXiv.

The second paper is a paper written with Luise Häuser, Gerhard Jäger, Taraka Rama, and Alexandros Stamatakis, discussing how well sound correspondences work in phylogenetic reconstruction:

In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

Preprint is available on arXiv.

New accepted paper

2024-02-15T12:00:00+00:00

My essay on Open problems in computational historical linguistics has now been accepted with the journal Open Research Europe. I'll now have to respond to the four reviews and try to work in their comments for the final version.

New Blog Post on Visualizing Networks

2024-02-19T12:00:00+00:00

Today, my blog post on how to visualize colexification networks in Cytoscape appeared. It is a tutorial for beginners who want to become familiar with the tool and learn how to get started. The post is available here: https://calc.hypotheses.org/6697

New Preprint

2024-02-20T12:00:00+00:00

A new preprint with Frederic Blum, Nathan Hill, and Cristian Juárez appeared online with Open Research Europe, awaiting open peer review. The study is titled "Grouping sounds into evolving units for the purpose of historical language comparison" (DOI: 10.12688/openreseurope.16839.1.

Computer-assisted approaches to historical language comparison have made great progress during the past two decades. Scholars can now routinely use computational tools to annotate cognate sets, align words, and search for regularly recurring sound correspondences. However, computational approaches still suffer from a very rigid sequence model of the form part of the linguistic sign, in which words and morphemes are segmented into fixed sound units which cannot be modified. In order to bring the representation of sound sequences in computational historical linguistics closer to the research practice of scholars who apply the traditional comparative method, we introduce improved sound sequence representations in which individual sound segments can be grouped into evolving sound units in order to capture language-specific sound laws more efficiently. We illustrate the usefulness of this enhanced representation of sound sequences in concrete examples and complement it by providing a small software library that allows scholars to convert their data from forms segmented into sound units to forms segmented into evolving sound units and vice versa. Additionally, we were informed that our paper (with Robert Forkel and Guillaume Ségerer), titled "Linguistic Survey of India and Polyglotta Africana: Two Retrostandardized Digital Editions of Large Historical Collections of Multilingual Wordlists" was accepted for the COLING-LREC conference in Torino in May.

New Preprint

2024-02-21T12:00:00+00:00

During the weekend, we submitted a new preprint (with Johann-Mattis List), which is now available with Humanities Commons, titled "Finding language-internal cognates in Old Chinese" (DOI: 10.17613/ftm2-3b58).

The investigation of language-internal cognates and word families in Chinese plays a central role in enhancing our understanding of Old Chinese phonology and morphology, as well as constituting a key element for fostering our knowledge of the history of Sino- Tibetan languages. Here we provide an overview of common challenges encountered when searching for language-internal cognates in Old Chinese. We identify three major problems in this endeavor. An epistemological problem arises from varying definitions of word families among scholars, a heuristic problem results from the scarcity of shared workflows for word family identification, and a representation problem follows from the absence of standards for data handling and analysis. While ultimate solutions remain elusive, three suggestions are proposed to enhance future research on Chinese word families. These include advocating for a stricter separation between words and their written representations in Chinese characters, investing time and collaborative efforts in establishing consistent annotation schemes for Chinese word families, and promoting the integration and standardization of data from neighboring languages.

Additionally, we were notified today, that our paper submitted to the LREC-COLING conference was accepted, titled "First Steps Towards the Integration of Resources on Historical Glossing Traditions in the History of Chinese: A Collection of Standardized Fǎnqiè Spellings from the Guǎngyùn" (also with Johann-Mattis List).

New Conference Paper

2024-02-27T12:00:00+00:00

Our conference paper (by Robert Forkel and myself), submitted to the DHd 2024 conference, co-organized by the Chair fo Multilingual Computational Linguistics, has appeared online now. In this paper, titled "Cross-Linguistic Data Formats (CLDF): D'où Venons Nous? Que Sommes Nous? Où Allons Nous?" (DOI: 10.5281/zenodo.10698325) we present how the Cross-Linguistic Data Formats were established and how we think they could further develop in the next years.

Seit nun mehr zehn Jahren entwickeln wir in Kollaboration mit einer Vielzahl von Forschenden im Bereich der vergleichenden Sprachwissenschaft die sogenannten Cross-Linguistic Data Formats (CLDF), eine Sammlung von Standards, die -- basierend auf tabellarischen Datenformaten -- dazu dient, den großen Wissenschatz, den die linguistische Forschung in den letzten 200 Jahren erschlossen hat, so aufzubereiten, dass er systematisch aggregiert, mit anderen Datensätzen integriert, und transparent analysiert werden kann. Trotz anfänglicher Schwierigkeiten hat sich unser Bemühen als sehr erfolgreich erwiesen, auch wenn manches, von dem wir zuerst dachten, es sei leicht zu realisieren, sich als äußerst komplziert herausgestellt hat. Heute schon liegen in CLDF die größten lexikalischen und typologisch-grammatischen Sammlungen an Sprachdaten vor, und ein Ende ist bisher nicht in Sicht. In unserer Studie stellen wir vor, wie CLDF zu dem wurde, was es heute ist, und wo wir die Standardformate in der Zukunft sehen.

New Preprint and Software Tool

2024-03-03T12:00:00+00:00

Yesterday, I deposited a new preprint that presents a new method that models sound change in ordered layers of simultaneous sound laws (DOI: 0.17613/4n5z-9y52).

In historical linguistics, sound change is typically modeled with the help of linearly arranged replacement rules that scan over an input sequence in a fixed order, converting the initial sequence in an iterative manner until all sound laws are exhausted. Arguing that this model of cascades of sound laws falls short in many regards, this study proposes a new model of sound change in which sound laws are grouped into linearly arranged layers in which sound change proceeds simultaneously. Illustrating how this model can be implemented with the help of an open, web-based tool, several examples are shown to prove the usefulness of the new model.

The tool presented in this study has also been published in a first stable version (Version 0.2) and can be accessed at https://misol.edictor.org.

New Preprint

2024-03-04T12:00:00+00:00

A new preprint (currently under review), titled "Directional Tendencies in Semantic Change" is now available from Humanities Commons (DOI: 10.17613/0y0r-f341). We investigate to which degree semantic motivation patterns in word formation reflect patterns in semantic change (common work with Anna Di Natale, Annika Tjuka and Johann-Mattis List).

Due to its complexity, scholars have often hesitated to establish relative chronologies in semantic change. While occasionally researchers postulated universal directions of semantic change, concrete proposals for inferring or estimating these from lexical data are rare. According to one hypothesis, however, cross-linguistic directional tendencies in semantic motivation underlying word formation could provide direct hints regarding directional tendencies of semantic change. We revisit this idea using new data from independent sources and new methods for data analysis and exploration. Our results show that there is only a small overlap with respect to concrete processes of semantic change and concrete processes of semantic motivation. For this small overlap, however, we find positive correlations when comparing weight ratios of semantic change and semantic motivation data, and we also receive precision values exceeding 0.5 when trying to predict directional tendencies of semantic change from semantic motivation patterns. This indicates that, while semantic change and semantic motivation are generally distinct processes, there are certain cross-linguistic tendencies in semantic motivation that can provide hints regarding directional tendencies of semantic change.

Article Published in Cognitive Science

2024-03-05T12:00:00+00:00

Together with my colleague Yoolim Kim, we published an article entitled "Cognitive Science From the Perspective of Linguistic Diversity" in Cognitive Science. The article is part of the letter series "Progress & Puzzles of Cognitive Science". We address the question of the comparability of word meanings in different languages and the neglect of an integrated approach to writing systems. The article is available here: https://doi.org/10.1111/cogs.13418

New Release of EvoBib

2024-03-08T12:00:00+00:00

Yesterday, I released a new version of EvoBib (Version 1.7.0). Since its last release in November 2022, the database has been extended further by adding several hundres of quotes and also expanding the literature.

New Papers

2024-03-19T12:00:00+00:00

Two new papers have now been officially published as part of the SIGTYP workshop in Malta.

First, there is a paper by Jessica Nieder and myself, discussing mutual intelligibility and proposing a way to model it computationally (URL https://aclanthology.org/2024.sigtyp-1.4/).

Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it. Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments. To study mutual intelligibility computationally, we propose a computer-assisted method using the Linear Discriminative Learner, a computational model developed to approximate the cognitive processes by which humans learn languages, which we expand with multilingual semantic vectors and multilingual sound classes. We test the model on cognate data from German, Dutch, and English, three closely related Germanic languages. We find that our model’s comprehension accuracy depends on 1) the automatic trimming of inflections and 2) the language pair for which comprehension is tested. Our multilingual modelling approach does not only offer new methodological findings for automatic testing of mutual intelligibility across languages but also extends the use of Linear Discriminative Learning to multilingual settings.

Then there is a paper by Luise Häuser, Gerhard Jäger, Taraka Rama, myself, and Alexandros Stamatakis, discussing the usefulness of phylogenetic reconstruction with sound correspondences (URL: https://aclanthology.org/2024.sigtyp-1.11/).

In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

Concepticon Release Version 3.2

2024-03-20T12:00:00+00:00

We released a new version of Concepticon. Version 3.2 contains 17 new concept lists and improvements to concept mappings. We also included an updated format for the representation of concept lists containing networks so that they are handled more uniformly. The CLDF dataset of Concepticon v3.2 is available here: https://github.com/concepticon/concepticon-cldf/tree/v3.2.0.

New Paper on Fǎnqiè Spellings

2024-03-22T12:00:00+00:00

Today, a new paper, titled "First Steps Towards the Integration of Resources on Historical Glossing Traditions in the History of Chinese: A Collection of Standardized Fǎnqiè Spellings from the Guǎngyùn" was published as a preprint with Humanities Commons (DOI: 10.17613/q3yt-pd95). It will appear in the proceedings of the LREC-COLING conference in Torino in May.

Due to the peculiar nature of the Chinese writing system, it is difficult to assess the pronunciation of historical varieties of Chinese. In order to reconstruct ancient pronunciations, historical glossing practices play a crucial role. However, although studied thoroughly by numerous scholars, most research has been carried out in a qualitative manner, and no attempt at providing integrated resources of historical glossing practices has been made so far. Here, we present a first step towards the integration of resources on historical glossing traditions in the history of Chinese. Our starting point are so-called fǎnqiè spellings in the Guǎngyùn, one of the early rhyme books in the history of Chinese, providing pronunciations for more than 20000 Chinese characters. By standardizing digital versions of the resource using tools from computational historical linguistics, we show that we can predict historical spellings with high precision and at the same time shed light on the precision of ancient glossing practices. Although a considerably small first step, our resource could be the starting point for an integrated, standardized collection that could ultimately shed new light on the history of Chinese.

New Blog Post

2024-03-27T12:00:00+00:00

My blog post for March was now published, this time dealing with literature for children and rhyme patterns, Von Falschen Rhymen.

New Preprint for Study on Partial Body-Object Colexifications

2024-03-28T12:00:00+00:00

We submitted a paper entitled "Partial Colexifications Reveal Directional Tendencies in Object Naming". The study represents the first cross-linguistic investigation of partial colexifications between body and object concepts. We address the question of how meaning is extended between two concrete domains. The preprint is available here: https://doi.org/10.31234/osf.io/hc3j5

New Accepted Papers and New Team Members

2024-04-09T12:00:00+00:00

This week, we welcome three new members to our team. First, Christian Bentz joined the chair as an independent research group leader with his ERC Starting Grant project EVINE. Then, Alžběta Kučerová joined our ProduSemy project as a PhD student, and finally, David Snee is now enrolled with us as an independent PhD student (he will join us officially as member of the ProduSemy project in October).

We also received notification that two more papers have been accepted. Our study on cognates in Chinese with Michele Pulini was accepted with the Bulletin of Chinese Linguistics, and my review study on Productive Signs was accepted for the edited volume accompanying the focus streams of the International Conference of Linguists in September 2024.

New Blog Post online

2024-04-24T12:00:00+00:00

Today, our new blog post about the representation of Zalizniak et al.'s (2024) Catalogue of Semantic Shifts in CLDF appeared. We explain step by step how we converted the data into the standardized format. You can find it here (DOI: 10.15475/calcip.2024.1.4).

New Blog Post

2024-04-29T12:00:00+00:00

My blog post for April was now published, this time dealing with deadlines, Von toten Linien.

New Study on Body Part Semantics Appeared

2024-05-08T12:00:00+00:00

Our study on body part semantics by Annika Tjuka, Robert Forkel, and myself has now appeared online with Scientific Reports (URL here).

Every human has a body. Yet, languages differ in how they divide the body into parts to name them. While universal naming strategies exist, there is also variation in the vocabularies of body parts across languages. In this study, we investigate the similarities and differences in naming two separate body parts with one word, i.e., colexifications. We use a computational approach to create networks of body part vocabularies across languages. The analyses focus on body part networks in large language families, on perceptual features that lead to colexifications of body parts, and on a comparison of network structures in different semantic domains. Our results show that adjacent body parts are colexified frequently. However, preferences for perceptual features such as shape and function lead to variations in body part vocabularies. In addition, body part colexification networks are less varied across language families than networks in the semantic domains of emotion and colour. The study presents the first large-scale comparison of body part vocabularies in 1,028 language varieties and provides important insights into the variability of a universal human domain.

New Preprint

2024-05-10T12:00:00+00:00

Our paper "Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats" (with Jessica Nieder, Robert Forkel, and Johann-Mattis List) has recently been accepted for the SCiL 2024 conference. The preprint is now available on arXiv: https://doi.org/10.48550/arXiv.2405.04271

When comparing speech sounds across languages, scholars often make use of feature representations of individual sounds in order to determine fine-grained sound similarities. Although binary feature systems for large numbers of speech sounds have been proposed, large-scale computational applications often face the challenges that the proposed feature systems -- even if they list features for several thousand sounds -- only cover a smaller part of the numerous speech sounds reflected in actual cross-linguistic data. In order to address the problem of missing data for attested speech sounds, we propose a new approach that can create binary feature vectors dynamically for all sounds that can be represented in the the standardized version of the International Phonetic Alphabet proposed by the Cross-Linguistic Transcription Systems (CLTS) reference catalog. Since CLTS is actively used in large data collections, covering more than 2,000 distinct language varieties, our procedure for the generation of binary feature vectors provides immediate access to a very large collection of multilingual wordlists. Testing our feature system in different ways on different datasets proves that the system is not only useful to provide a straightforward means to compare the similarity of speech sounds, but also illustrates its potential to be used in future cross-linguistic machine learning applications.

The presented Python package `soundvectors' can be installed via pip, the source code is available on GitHub.

New Papers Published

2024-05-21T12:00:00+00:00

Two new studies have now appeared officially as part of the joined LREC / COLING conference in Torino.

A paper by Robert Forkel, myself, Christoph Rzymski and Guillaume Ségerer presents Linguistic Survey of India and Polyglotta Africana: Two Retrostandardized Digital Editions of Large Historical Collections of Multilingual Wordlists.

The Linguistic Survey of India (LSI) and the Polyglotta Africana (PA) are two of the largest historical collections of multilingual wordlists. While the originally printed editions have long since been digitized and shared in various forms, no editions in which the original data is presented in standardized form, comparable with contemporary wordlist collections, have been produced so far. Here we present digital retro-standardized editions of both sources. For maximal interoperability with datasets such as Lexibank the two datasets have been converted to CLDF, the standard proposed by the Cross-Linguistic Data Formats initiative. In this way, an unambiguous identification of the three main constituents of wordlist data – language, concept and segments used for transcription – is ensured through links to the respective reference catalogs, Glottolog, Concepticon and CLTS. At this level of interoperability, legacy material such as LSI and PA may provide a reasonable complementary source for language documentation, filling in gaps where original documentation is not possible anymore.

A paper by Michele Pulini and myself presents First Steps Towards the Integration of Resources on Historical Glossing Traditions in the History of Chinese: A Collection of Standardized Fǎnqiè Spellings from the Guǎngyùn.

Due to the peculiar nature of the Chinese writing system, it is difficult to assess the pronunciation of historical varieties of Chinese. In order to reconstruct ancient pronunciations, historical glossing practices play a crucial role. However, although studied thoroughly by numerous scholars, most research has been carried out in a qualitative manner, and no attempt at providing integrated resources of historical glossing practices has been made so far. Here, we present a first step towards the integration of resources on historical glossing traditions in the history of Chinese. Our starting point are so-called fǎnqiè spellings in the Guǎngyùn, one of the early rhyme books in the history of Chinese, providing pronunciations for more than 20000 Chinese characters. By standardizing digital versions of the resource using tools from computational historical linguistics, we show that we can predict historical spellings with high precision and at the same time shed light on the precision of ancient glossing practices. Although a considerably small first step, our resource could be the starting point for an integrated, standardized collection that could ultimately shed new light on the history of Chinese.

New Study Published

2024-05-22T12:00:00+00:00

As part of the SIGUL 2024 workshop organized with the LREC / COLING conference in Torino this year, a new study by Frederic Blum, Johannes Englisch, Alba Hermida-Rodríguez, Rik van Gijn, and myself has now been published, presenting a new approach for the Resource Acquisition for Understudied Languages.

New blog post online

2024-05-27T12:00:00+00:00

Today, a new blog post title Implementing Fuzzy Spelling Search in Dictionaries of Under-Described Languages Lacking Standard Orthographies appeared. In this post, different approaches to implementing fuzzy string matching for an online dictionary are discussed, focusing on the issue of non-standard writing systems for under-resourced languages. A simple finite state transducer is presented as a good approach, with sample code and a minimal working example. It can be found here (DOI: 10.15475/calcip.2024.1.5).

New Blog Post

2024-05-29T12:00:00+00:00

On Sunday, a new blog post in German appeared, "Von feinen Zügen unterm Radar", discussing the phenomenon of aphantasia. You can find it here.

New Paper

2024-05-31T12:00:00+00:00

Having passed review successfully, my paper on Open Problems in Computational Historical Linguistics has now appeared in its final version (DOI).

Problems constitute the starting point of all scientific research. The essay reflects on the different kinds of problems that scientists address in their research and discusses a list of 10 problems for the field of computational historical linguistics, that was proposed throughout 2019 in a series of blog posts (see http://phylonetworks.blogspot.com/). In contrast to problems identified in different contexts, these problems were considered to be solvable, but no solution could be proposed back then. By discussing the problems in the light of developments that have been made in the field during the past five years, a modified list is proposed that takes new insights into account but also finds that the majority of the problems has not yet been solved.

New Blog Post

2024-06-24T12:00:00+00:00

Yesterday, a new German blog post appeared, this time discussing the redundancy in language, allowing us to encode the same message in multiple different ways. The post is titled "Fünf vor zwölf mit halbleerem Glas" (URL: https://wub.hypotheses.org/2384).

Grouping Sounds Paper Accepted

2024-06-25T12:00:00+00:00

Our Grouping Sounds paper with Frederic Blum, Nathan Hill and Cristian Juárez has now been officially accepted with Open Research Europe (DOI: 10.12688/openreseurope.16839.1). Having passed review means we will write one revision of the study in which we account for minor remarks by the reviewers in the next week.

Study on Sound Vectors Published

2024-07-01T12:00:00+00:00

Our paper "Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats" (together with Jessica Nieder, Robert Forkel, and Johann-Mattis List) has been published last week as part of the Proceedings of the 2024 Meeting of the Society for Computation in Linguistics (SCiL) (DOI: 10.7275/scil.2144).

Grouping Sounds Paper Accepted

2024-07-18T12:00:00+00:00

Yesterday, a new blog post was published in our CALC blog and journal. The post is titled "Converting an Artificial Proto-Language into Data for Testing Computational Approaches in Historical Linguistics" and shows how data for an artificially created language can be automatically retrieved and converted to formats that allow to compare the data with other datasets.

This small study shows how data for an artificially created language that was supposed to reflect features of “proto-languages”, predating modern languages by several thousand years, can be used in testing computational approaches in historical linguistics. In order to do so, computational workflow is described that retrieves the data automatically, creating a comparative wordlist compatible in format with software tools for historical linguistics, and then uses a baseline method for automatic cognate detection to compare an artificial language against a sample of Indo-European languages. The results show that artificial languages might help to fill a gap in testing that has so far been ignored in the literature.

The post can be found online here or in article form via its DOI.

New Blog Post Published

2024-08-06T12:00:00+00:00

Yesterday, a new blogpost for Computer-Assisted Language Comparison in Practice was published. It is titled "Generating Phonological Feature Vectors with SoundVectors and CLTS" and introduces the recently released Python library SoundVectors, briefly comparing it to PHOIBLE and PanPhon.

The recently published Python library soundvectors offers a simple and robust method to derive phonological feature vectors for any valid IPA sound via its canonical description. It is designed to interact neatly with the Cross-Linguistic Transcription Systems reference catalog (CLTS), which dynamically parses valid strings in phonetic transcription to describe speech sounds. This study illustrates how both systems can be used together to generate phonological feature vectors for all kinds of sounds without relying on a previously defined lookup table. Additionally, it compares the generated feature vectors with those obtained from two other prominent databases, PanPhon and PHOIBLE, showing how those systems can be accessed from the CLTS data via its Python API pyclts.

The post can be found online here or in article form via its DOI.

Paper in ACL Workshop Proceedings

2024-08-12T12:00:00+00:00

Today, with the beginning of the ACL conference in Bangkok, our paper presenting EDICTOR 3 appeared as part of the proceedings of the LChange workshop this year. The paper presents major features that made it into EDICTOR 3 and points to the major improvements that made it into the new version of the EDICTOR application.

Computer-assisted approaches to historical and typological language comparison have made great progress over the past two decades. Specifically for the classical tasks of historical language comparison, many computational methods have been published that mimic certain steps of the traditional workflow of the comparative method. In contrast to the diver- sity of new computational methods, there is only a limited number of interactive tools and interfaces that help scholars to curate and refine their data both before and after the ap- plication of computational methods. One of the few publicly available interfaces is EDICTOR (https://edictor.org), an interactive tool for computer-assisted language comparison. EDICTOR has been around for some time, and allows scholars to annotate and align cognate sets in various ways. With EDICTOR 3, the original tool has been enhanced, offering not only new features for data annotation, but also providing the possibility to use purely automatic methods for initial cognate detection, phonetic alignment, and correspondence pattern inference in an integrated workflow.

The paper can be accessed here. EDICTOR 3 is now also available as a Python package on PyPi.

Paper in ACL Workshop Proceedings

2024-08-19T12:00:00+00:00

Computer-assisted approaches to historical and typological language comparison have made great progress over the past two decades. Specifically for the classical tasks of historical language comparison, many computational methods have been published that mimic certain steps of the traditional workflow of the comparative method. In contrast to the diver- sity of new computational methods, there is only a limited number of interactive tools and interfaces that help scholars to curate and refine their data both before and after the ap- plication of computational methods. One of the few publicly available interfaces is EDICTOR (https://edictor.org), an interactive tool for computer-assisted language comparison. EDICTOR has been around for some time, and allows scholars to annotate and align cognate sets in various ways. With EDICTOR 3, the original tool has been enhanced, offering not only new features for data annotation, but also providing the possibility to use purely automatic methods for initial cognate detection, phonetic alignment, and correspondence pattern inference in an integrated workflow.

The paper can be accessed here. EDICTOR 3 is now also available as a Python package on PyPi.

New Blog Post

2024-08-19T12:00:00+00:00

Last week, my German blog post for August appeared, this time discussing citation practice in newspapers (see Kopflose Fußnoten).

New Paper Published

2024-08-21T12:00:00+00:00

Our paper that introduces how sounds can be grouped into evolving units (with Frederic Blum, Nathan W. Hill, and Cristian Juárez) has now appeared online in its final version with Open Research Europe (DOI: 10.12688/openreseurope.16839.2).

Computer-assisted approaches to historical language comparison have made great progress during the past two decades. Scholars can now routinely use computational tools to annotate cognate sets, align words, and search for regularly recurring sound correspondences. However, computational approaches still suffer from a very rigid sequence model of the form part of the linguistic sign, in which words and morphemes are segmented into fixed sound units which cannot be modified. In order to bring the representation of sound sequences in computational historical linguistics closer to the research practice of scholars who apply the traditional comparative method, we introduce improved sound sequence representations in which individual sound segments can be grouped into evolving sound units in order to capture language-specific sound laws more efficiently. We illustrate the usefulness of this enhanced representation of sound sequences in concrete examples and complement it by providing a small software library that allows scholars to convert their data from forms segmented into sound units to forms segmented into evolving sound units and vice versa.

In addition to the reviews, sound grouping has now also been fully integrated as a feature in EDICTOR 3.

New Blogpost Appeared

2024-09-03T12:00:00+00:00

Yesterday, a new blog post in our CALCiP appeared, titled "Adding Standardized Transcriptions to Panoan and Tacanan Languages in the Intercontinental Dictionary Series" (see either for its URL or the DOI).

In this study, we illustrate how standardized phonetic transcriptions can be added to the data for Panoan and Tacanan languages provided by the Intercontinental Dictionary Series. The result is presented as a new dataset that keeps reference to the original data and adds phonetic transcriptions for each word form in Panoan languages, Tacanan languages, as well as Spanish and Portuguese.

Interview About the ORE Languages and Literature Gateway

2024-09-05T12:00:00+00:00

Already in July, an interview in which I answered question on the Languages and Literature Gateway with Open Research Europe (ORE), appeared. The interview was published on the ORE blog and can be found here.

New Blog Post

2024-09-09T12:00:00+00:00

Yesterday, my monthly German blog post appeared, this time dealing with the question of bringing things in order and searching for them: Vom Ordnen und Suchen. In this post, I also introduce a new way to search for articles by typing key words from their titles into a search prompt, which I integrated in my professional website (see here).

Podcast about language evolution

2024-09-11T12:00:00+00:00

I was interviewed for a podcast (MDR Wissen: Große Fragen in zehn Minuten) about language evolution (in German). (https://www.ardaudiothek.de/episode/grosse-fragen-in-zehn-minuten-von-mdr-wissen/warum-ist-sprache-entstanden/mdr/13691975/)

New Paper Published

2024-09-20T12:00:00+00:00

Our paper (with Simonetta Montemagni and John Nerbonne) presenting an information-theoric method for detecting characteristic phonetic correspondences in dialectal data, including a case study on Tuscan, has now appeared online in its final version with Language Dynamics and Change (DOI: 10.1163/22105832-bja10034).

We present a novel approach to identifying individual pairs of phonetic correspondences in a dataset of dialect pronunciations. This continues work identifying shibboleths (i.e., characteristic features of a given dialect), a category that has interested dialectology and that dialectometrical research has examined mostly in the form of categorical data or entire phonetic transcriptions. This article reaches into segmental sequences (phonetic transcriptions) to identify individual phonetic correspondences. We follow earlier work in examining how distinctive and how representative a given phonetic correspondence is for a selected group of varieties. We proceed from string alignments, and innovate in characterizing the important notions via information theory. Despite minor problems, the method improves on the generality of competing approaches and can be shown to be useful in detecting characteristic phonetic correspondences in Tuscan varieties. We argue that this facilitates deeper investigation into the relation between aggregating approaches to dialectology and approaches proceeding from features.

New Paper Published

2024-09-24T12:00:00+00:00

Our paper showing that 'Consonant lengthening marks the beginning of words across a diverse sample of languages' has now appeared online with nature Human Behaviour (DOI: 10.1038/s41562-024-01988-4). In a sample of 51 diverse languagers, we have found that word-initial consonants are systematically longer than their counterparts in other positions. While the study only analyzes observational data, we think this might be one of several cues for segmenting the acoustic stream into words - present in possibly most of the world's languages.

Speech consists of a continuous stream of acoustic signals, yet humans can segment words and other constituents from each other with astonishing precision. The acoustic properties that support this process are not well understood and remain understudied for the vast majority of the world’s languages, in particular regarding their potential variation. Here we report cross-linguistic evidence for the lengthening of word-initial consonants across a typologically diverse sample of 51 languages. Using Bayesian multilevel regression, we find that on average, word-initial consonants are about 13 ms longer than word-medial consonants. The cross-linguistic distribution of the effect indicates that despite individual differences in the phonology of the sampled languages, the lengthening of word-initial consonants is a widespread strategy to mark the onset of words in the continuous acoustic signal of human speech. These findings may be crucial for a better understanding of the incremental processing of speech and speech segmentation.

New Team Members

2024-10-04T12:00:00+00:00

Our chair welcomes two new team members. David Snee, who has already started with his dissertation with us, has officially joined the ProduSemy project in September. From October on, Dr. Luca Ciucci will join the ProduSemy project, working on word families in South American languages. We welcome both team members in our team and hope for a fruitful collaboration.

New blog post online

2024-10-07T12:00:00+00:00

Today, a new blog post title Preparing Acoustic Pitch Data for Computational Analysis and Presentation appeared. This study presents the issues with using raw pitch data as Hertz values some historical efforts to resolve these issues, and two more appropriate solutions than some of the more widely used systems, with a way to easily calculate these alternative systems in a short Python script.. It can be found here (DOI: 10.15475/calcip.2024.2.4).

New Paper accepted for publication

2024-10-11T12:00:00+00:00

Our paper by Ruben van de Vijver, Adam Ussishkin and myself, submitted to Cognitive Science has been accepted for publication. In this paper, titled "Emerging roots: Investigating early access to meaning in Maltese auditory word recognition" (DOI: https://doi.org/10.31234/osf.io/hwumf) we investigate access to meaning in early word recognition in the Semitic language Maltese through a computational model.

In Semitic languages, the consonantal root is central to morphology, linking form and meaning. While psycholinguistic studies highlight its importance in language processing, the role of meaning in early lexical access and its representation remain unclear. This study investigates when meaning becomes accessible during the processing of Maltese verb forms, using a computational model based on the Discriminative Lexicon framework. Our model effectively comprehends and produces Maltese verbs, while also predicting response times in a masked auditory priming experiment. Results show that meaning is accessible early in lexical access and becomes more prominent after the target word is fully processed. This suggests that semantic information plays a critical role from the initial stages of lexical access, refining our understanding of real-time language comprehension. Our findings contribute to theories of lexical access and offer valuable insights for designing priming studies in psycholinguistics. Additionally, this study demonstrates the potential of computational models in investigating the relationship between form and meaning in language processing.

New German Blog Post

2024-10-21T12:00:00+00:00

Yesterday, my German blog post for October appeared, titled Von türstehenden Gutachtern, discussing scientific practice that keeps critical studies via negative reviews away from certain journals. You can find it here.

New Papers

2024-10-31T12:00:00+00:00

Three new papers have appeared across the last week.

First, a study by Dubi Nanda Dhakal, myself, and Seán G. Roberts appeared in the Journal of Language Evolution:

This study performs primary data collection, transcription, and cognate coding for eight South West Tibetic languages (Lowa, Gyalsumdo, Nubri, Tsum, Yohlmo, Kagate, Jirel, and Sherpa). This includes partial cognate coding, which analyses linguistic relations at the morpheme level. Prior resources and inferences are leveraged to conduct a Bayesian phylogenetic analysis. This helps estimate the extent to which the historical relationships between the languages represent a tree-like structure. We argue that small-scale projects like this are critical to wider attempts to reconstruct the cultural evolutionary history of Sino-Tibetan and other families.

The study can be found here.

Then, a study by Jessica Nieder, Ruben van de Vijver and Adam Ussishkin appeared in Cognitive Science:

In Semitic languages, the consonantal root is central to morphology, linking form and meaning. While psycholinguistic studies highlight its importance in language processing, the role of meaning in early lexical access and its representation remain unclear. This study investigates when meaning becomes accessible during the processing of Maltese verb forms, using a computational model based on the Discriminative Lexicon framework. Our model effectively comprehends and produces Maltese verbs, while also predicting response times in a masked auditory priming experiment. Results show that meaning is accessible early in lexical access and becomes more prominent after the target word is fully processed. This suggests that semantic information plays a critical role from the initial stages of lexical access, refining our understanding of real-time language comprehension. Our findings contribute to theories of lexical access and offer valuable insights for designing priming studies in psycholinguistics. Additionally, this study demonstrates the potential of computational models in investigating the relationship between form and meaning in language processing.

The study can be found here.

Then, a study by Kellen Parker van Dam and Thüküvelü Sakhamo appeared in teh proceedings of the First International Conference on Social Sciences.

This paper presents a brief outline of some important cultural and historical aspects of Porba Village, a Chokrimi community in Phek District, Nagaland. Information is given on the traditional history of the community, including the settlement, the importance of major cultural practices, and the structure of the society. Though societal norms have changed considerably these practices and heritage still hold importance to the residents of Porba village as a significant part of their culture and identity. In addition, the clan system, practices around the naming of babies, the system of inheritance, the economy system, education, and language are also disc

The study can be found here.

New Blog Post on Typing Special Characters

2024-11-04T12:00:00+00:00

Today, a new blog post on Typing Special Characters as a Key Skill for Linguists appeared:

Most linguists have to type special characters that are not available on an ordinary keyboard on a regular basis. Reflecting about the general problems involved in typing special characters, I review different solutions and argue that linguists should not only be able to type special characters on their computers, but that they should also have some basic knowledge about their technical aspects and know how to expand and customize them. In order to improve the training of young scholars, it is important to discuss special character typing more openly in linguistics, especially in the classroom and with doctoral students, sharing individual solutions openly.

The blog can be read online here, a PDF version can be downloaded from the corresponding journal website via its DOI (10.15475/calcip.2024.2.5).

New Paper

2024-11-07T12:00:00+00:00

The paper on partial colexifications of body and object terms by Annika Tjuka and myself appeared yesterday:

Expressions in which the word for a body part is also used for objects can be found in many languages. Some languages use body part terms to refer to object parts, while others have only a few idiosyncratic examples in their vocabulary. Studying the word forms referring to body and object concepts, i.e., colexifications, across languages, offers insights into cognitive principles facilitating such usage. Previous studies focused on full colexifications in which the same word form expresses two distinct concepts. Here, we utilize a new approach that allows us to analyze partial colexifications in which a concept is built out of the word forms for two separate concepts, like river mouth. Based on a large lexical database, we identified body and object concepts and analyzed 39 colexifications across 329 languages. The results show that word forms for body concepts are used slightly more frequently as a source for object names. However, the detailed examination of directional tendencies and colexifications of word forms between body and object concepts reveals linguistic variation. The study sheds light on meaning extensions between two concrete domains and showcases the synergies that arise through the combination of existing data and methods.

Unfortunately, it is not available as open access publication so far, but the preprint can be accessed here (DOI: 10.1515/gcla-2024-0005).

New Blog Post

2024-11-18T12:00:00+00:00

Today, my monthly German blog post appeared, this time dealing the phenomenon of ghosting in the context of science: Von Geistern verlassen.

New Blog Post

2024-12-11T12:00:00+00:00

Already on Sunday, my monthly German blog post appeared, this time dealing the boring tasks in scientific work: Vom Schaffen in der Wissenschaft.

New Blog Post on Using CLDFBench and PyLexibank on Windows

2024-12-18T12:00:00+00:00

Today, a new blog post on Using CLDFBench and PyLexibank on Windows appeared:

Due to idiosyncrasies in the Windows operating system, certain workarounds may be necessary to successfully execute the CLDF conversion workflow using CLDFBench and Pylexibank. This blog post illustrates how this workflow can be efficiently implemented on a Windows 10 operating system using the hattorijaponic dataset as an example.

The blog can be read online here, a PDF version can be downloaded from the corresponding journal website (DOI: 10.15475/calcip.2024.2.6).

New Paper on 'Cognate Reflex Prediction as hypothesis test for a genealogical relation between the Panoan and Takanan language families'

2024-12-29T12:00:00+00:00

Our paper Cognate Reflex Prediction as hypothesis test for a genealogical relation between the Panoan and Takanan language families has now appeared in Scientific Reports (DOI: 10.1038/s41598-024-82515-3). In the paper, we present a study where we converted reconstructions from one proto-language to stipulated reflexes in another, potentially related language, to test the hypothesized relationship between both. The amount of correct matches from our predictions leads us to consider them as further evidence for a genealogical relation both language families involved.

We present a novel approach for testing genealogical relations between language families. Our method, which has previously only been applied to closely related languages, makes predictions for cognate reflexes based on the regularity of proposed sound correspondences between language families that are hypothesized to be related. We test the hypothesis about a genealogical relation between Panoan and Takanan, two linguistic families of the Amazon. The workflow contributes to new ideas of hypothesis testing in historical linguistics and can likely be transferred to other language families. We predict 206 cognate reflexes from Shipibo-Konibo, a Panoan language, from independently proposed Proto-Takanan reconstructions and test our predictions in elicitation sessions with speakers of the language. We found 21 correct predictions from the core-set, as well as another 20 correct predictions from the extended set of predictions. In addition to confirming the previously established sound correspondence patterns, we find further evidence for additional patterns that suggest the reconstruction of three new phonemes for Proto-Pano-Takanan.

Interview on Language Universals

2025-01-07T12:00:00+00:00

I was interviewed by the German Press Agency (DPA) regarding recent studies on language universals. The article by Doreen Garud, titled "Was die Welt sprachlich verbindet", by was printed and published online in many venues, including the Passauer Neue Presse.

New Paper on Object Naming

2025-01-14T12:00:00+00:00

We are happy to announce that our paper, Kučerová, Alžběta and List, Johann-Mattis (2025): Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages, has been accepted and will appear in the Proceedings of the Global WordNet Conference 2025.

It is available under the following DOI: 10.48550/arXiv.2501.08312 and link, where where it can also be downloaded.

Object naming – the act of identifying an object with a word or a phrase – is a fundamental skill in interpersonal communication, relevant to many disciplines, such as psycholinguistics, cognitive linguistics, or language and vision research. Object naming datasets, which consist of concept lists with picture pairings, are used to gain insights into how humans access and select names for objects in their surroundings and to study the cognitive processes involved in converting visual stimuli into semantic concepts. Unfortunately, object naming datasets often lack transparency and have a highly idiosyncratic structure. Our study tries to make current object naming data transparent and comparable by using a multilingual, computer-assisted approach that links individual items of object naming lists to unified concepts. Our current sample links 17 object naming datasets that cover 30 languages from 10 different language families. We illustrate how the comparative dataset can be explored by searching for concepts that recur across the majority of datasets and comparing the conceptual spaces of covered object naming datasets with classical basic vocabulary lists from historical linguistics and linguistic typology. Our findings can serve as a basis for enhancing cross-linguistic object naming research and as a guideline for future studies dealing with object naming tasks.

New Blog Post in German

2025-01-23T12:00:00+00:00

Today, a new blog post in German appeared, titled "Geduld und Ungeduld in der Wissenschaft". The post is available here and discusses the role of patience and impatience in research.

A talk at the University of Regensburg

2025-01-24T12:00:00+00:00

On 23.01.2025, Alžběta Kučerová gave a talk on an experimental study into the perception of Czech speech at the Colloquium on Slavic and Albanian Linguistics at the University of Regensburg.

New Blog Post

2025-01-28T12:00:00+00:00

Yesterday, we published a new blogpost on How to Run EDICTOR 3 Locally. It introduced the workflow of setting up your local PC in order to run the new EDICTOR 3 release on your own computer instead of relying on online servers. It also introduces the necessary preprocessing, package installments, and configuration settings.

EDICTOR3 offers many ways of comparing language data with computer-assisted methods. This study offers a short overview of how to run EDICTOR3 locally, without the need for uploading the data to a server or being connected to the internet, while maintaining all the functionalities. In a first step, we will show how one can download a Lexibank dataset and create different types of files that one can use with EDICTOR. We will then proceed to present the possibility of running an EDICTOR server locally and to edit the dataset that one has downloaded.

The post can be found online here or in article form via its DOI.

New Preprint

2025-02-17T12:00:00+00:00

A new study, titled "Partial Colexifications Improve Concept Embeddings" (together with J.-M. List, currently under review), is now available as a preprint on arXiv.

While the embedding of words has revolutionized the field of Natural Language Processing, the embedding of concepts has received much less attention so far. A dense and meaningful representation of concepts, however, could prove useful for several tasks in computational linguistics, especially those involving cross-linguistic data or sparse data from low resource languages. First methods that have been proposed so far embed concepts from automatically constructed colexification networks. While these approaches depart from automatically inferred polysemies, attested across a larger number of languages, they are restricted to the word level, ignoring lexical relations that would only hold for parts of the words in a given language. Building on recently introduced methods for the inference of partial colexifications, we show how they can be used to improve concept embeddings in meaningful ways. The learned embeddings are evaluated against lexical similarity ratings, recorded instances of semantic shift, and word association data. We show that in all evaluation tasks, the inclusion of partial colexifications lead to improved concept representations and better results. Our results further show that the learned embeddings are able to capture and represent different semantic relationships between concepts.

New Preprint

2025-02-18T12:00:00+00:00

A preprint of a new study titled "From Isolates to Families: Using Neural Networks for Automated Language Affiliation" (together with Steffen Herbold and Johann-Mattis List, currently under review), has now appeared as a preprint on arXiv.

In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.

New Blog Post

2025-02-26T12:00:00+00:00

Together with Luise Häuser, we published a new blog post today in our CALC tutorial blog, presenting a new benchmark database for computational historical linguistics (URL: https://calc.hypotheses.org/8227, PDF: 10.15475/calcip.2025.1.2 ).

Computational approaches in historical linguistics have made great progress during the past two decades. As of now, it is much more common to propose subgroupings based on phylogenetic analyses than on traditional considerations using shared innovations. We have also seen a drastic increase in openly available datasets that share cognate judgments for various language families. Thanks to new standardization efforts providing facilitated access to several dozen comparative wordlists, it seems about time to work on on improved benchmarks of manually annotated cognates in computational historical linguistics. In this study, a first effort of this kind is undertaken, by presenting Lexibench, a preliminary gold standard for computational historical linguistics. Lexibench builds on the Lexibank repository to extract 63 multilingual wordlists, all manually annotated for cognacy, that can be used to assess the quality of cognate detection and phylogenetic reconstruction methods in computational historical linguistics.

New Preprint

2025-03-04T12:00:00+00:00

We are happy to announce that two new studies are now available as preprints on arXiv.

David Snee, Luca Ciucci, Arne Rubehn, Kellen Parker van Dam, and Johann-Mattis List (2025): "Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists".

Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.

Arne Rubehn, Christoph Rzymski, Luca Ciucci, Kellen Parker van Dam, Alžběta Kučerová, Katja Bocklage, David Snee, Abishek Stephen, and Johann-Mattis List (2025): "Annotating and Inferring Compositional Structures in Numeral Systems Across Languages".

Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

New Preprint

2025-03-17T12:00:00+00:00

A new preprint by Annika Tjuka, Robert Forkel, Christoph Rzymski and myself is available now, presenting an improved version of the Database of Cross-Linguistic Colexifications (DOI: 10.48550/arXiv.2503.11377).

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

New Accepted Paper

2025-03-18T12:00:00+00:00

A study by Michele Pulini and myself was accepted with the Second Workshop on Ancient Language Processing and is now available as a preprint on Humanities Commons (DOI: 10.17613/4wvf7-qva13).

Ancient Chinese documents written on bamboo slips more than 2000 years ago offer a rich resource for research in linguistics, paleography, and historiography. However, since most documents are only available in the form of scans, additional steps of analysis are needed to turn them into interactive digital editions, amenable both for manual and computational exploration. Here, we present a first attempt to establish a workflow for the annotation of ancient bamboo slips. Based on a recently rediscovered dialogue on warfare, we illustrate how a digital edition amenable for manual and computational exploration can be created by integrating standards originally designed for cross-linguistic data collections.

New Blog Post

2025-03-21T12:00:00+00:00

Two days a go, a new blog post appeared, this time discussing compounds in English and German that result from a process called "contamination". The post can be found here.

New Blog Post

2025-04-28T12:00:00+00:00

Today, a new blog post appeared in our CALCiP tutorial blog / journal. This time presenting with Luise Häuser and Robert Forkel the PyLexibench package(URL: https://calc.hypotheses.org/8267, DOI: 10.15475/calcip.2025.1.4).

With PyLexibench we introduce a small Python package that can be used to populate the Lexibench benchmark for computational historical linguistics with benchmark data. Here, we introduce the package and show how it helps to access and expand Lexibench. We also introduce new data for character matrices in various forms and formats and lay out how we intend to use the package to manage Lexibench releases in the future.

News, News, News

2025-05-02T12:00:00+00:00

Our team gladly and cordially welcomes two new team members. Jekaterina Mažara joins us in the position of an assistant to the chair (Akademische Rätin), pursuing independent research on psycholinguistics and teaching courses in the same area. Abishek Stephen visits us in the summer term as an independent doctoral student to collaborate on morpheme segmentation.

Yesterday, I released EvoBib 1.10 (https://evobib.digling.org, Data available via Zenodo at https://zenodo.org/). The data could be increased by about 100 new bibliographic entries and several hundred quotes.

At the same time, our study on a digital edition of the Cao Mo Zhi Zhen with Michele Pulini appeared in the proceedings of the 2nd Workshop on Ancient Language Processing (https://aclanthology.org/2025.alp-1.4/).

Ancient Chinese documents written on bam-boo slips more than 2000 years ago offer a rich resource for research in linguistics, paleogra-phy, and historiography. However, since most documents are only available in the form of scans, additional steps of analysis are needed to turn them into interactive digital editions, amenable both for manual and computational exploration. Here, we present a first attempt to establish a workflow for the annotation of an-cient bamboo slips. Based on a recently redis-covered dialogue on warfare, we illustrate how a digital edition amenable for manual and com-putational exploration can be created by inte-grating standards originally designed for cross-linguistic data collections.

Interview on Language Universals

2025-05-04T12:00:00+00:00

Today, an interview with SWR Kultur appeared, in which I talk with Julia Nestlen about those aspects that languages have in common and where they differ (interview is available here).

Interview on Computational Linguistics

2025-05-12T12:00:00+00:00

Today, a written interview with the Arbeitsstelle Kleine Fächer appeared, in which I answer general questions on Computational Linguistics as a scientific discipline. You can find the interview online (https://www.kleinefaecher.de/beitraege/blogbeitrag/computerlinguistik).

Two Papers on ACL

2025-05-16T12:00:00+00:00

Two papers were accepted for the ACL main conference, the paper by Frederic Blum, Steffen Herbold, and myself on language affiliation (From Isolates to Families: Using Neural Networks for Automated Language Affiliation), and the paper by Arne Rubehn and myself on concept embeddings (Partial Colexifications Improve Concept Embeddings). We are all of course very happy that we made it in this form to the main conference of the ACL and look forward to presenting our work in Vienna.

New German Blog Post

2025-05-21T12:00:00+00:00

A German blog post appeared on Monday, discussing how a specific kind of numerical annotation revolutionized the development of juggling patterns. The blog post in German can be found here.

New Paper on the Indigenous Languages of the Americas

2025-06-12T12:00:00+00:00

Today a paper by Marcin Kilarski and me, titled "Investigating the Indigenous languages of the Americas: History and prospects", appeared online in its final version, and it can be found here. The preprint version is available here.

In this paper, we address the state of the art in the study of the Indigenous languages of the Americas and reflect on the perspectives for future research. Since the first 16th-century grammatical descriptions, new data have contributed to the development of language study and the birth of modern linguistics and continue to inform linguistic theory. Ongoing documentation helps language preservation, while historical data improve our understanding of present-day languages and contribute to their revitalisation. Linguistic descriptions have also affected the perception of Indigenous languages and cultures over time, reflecting beliefs and prejudices about less familiar languages. We illustrate the genetic and typological diversity of American languages and reflect on their contribution to linguistic theory, which is illustrated with a few selected features. Finally, we offer some remarks about ongoing documentation.

New Blog Post

2025-06-20T12:00:00+00:00

Yesterday, I published my monthly German blog post, this time discussing the limits of scientific insights: Von der Vorläufigkeit der Erkenntnisse.

Lexibank Paper Published

2025-06-23T12:00:00+00:00

Three years after we published Lexibank the first time, we have now published its second version, enriched by more datasets and higher standards regarding data quality. The study by Blum et al. (2025), presenting the database, can be found on Open Research Europe (DOI: 10.12688/openreseurope.20216.2).

Large-scale lexical and grammatical datasets nowadays play an important role in comparative linguistics. However, the lack of standardization remains a challenge exacerbating extension and reuse of published data. We present an updated version of Lexibank, a large-scale lexical dataset, expanding on previous efforts to standardize and unify cross-linguistic data. This new version includes over 3,100 languages and more than one-and-a-half million word forms, substantially broadening the scope and utility of the previous resource. Our dataset has been systematically curated using a dedicated computer-assisted workflow designed specifically for the lifting of published wordlist data to the standards recommended by the Cross-Linguistic Data Formats initiative. The expanded dataset features standardized references to language varieties, standardized semantic glosses that reference the concepts expressed by individual word forms, and standardized phonetic transcriptions for all word forms that our repository contains. Based on those standardizations we pre-compute semantic and phonological features, which can be used to carry out extensive automated analyses. We illustrate this potential by providing dedicated database queries to (1) infer words that are similar in pronunciation and meaning, (2) identify concepts that are colexified across languages in our sample, and (3) assess the semantic diversity of etymologically related words. These queries are not only fast to execute but also global in their scope, due to the largescale coverage provided by Lexibank 2. The queries are also easy to extend, thus having the potential to contribute to various studies in historical linguistics, linguistic typology, and related disciplines. The updated dataset is a substantial step forward in the effort to create comprehensive, standardized, and accessible linguistic resources.

ACL Oral Presentation

2025-07-07T12:00:00+00:00

Our paper on language affiliation (From Isolates to Families: Using Neural Networks for Automated Language Affiliation) was invited to give a oral presentation at the ACL Conference in Vienna. Since less than 10% of the accepted papers are considered for this, we are very proud of this. The paper presents a supervised neural network approach to the affiliation of individual languages to families using Lexibank and Grambank datasets.

Frederic Blum, Steffen Herbold, and Johann-Mattis List. 2025. From Isolates to Families: Using Neural Networks for Automated Language Affiliation, https://arxiv.org/abs/2502.11688.

New Blog Post on Attitudes in Science

2025-07-15T12:00:00+00:00

Already last week, my monthly German blog post appeared, this time discussing attitudes towards scientific work and the merits of scientific research in humanities and natural sciences (https://wub.hypotheses.org/2895).

Two Long Papers at ACL in Vienna

2025-07-23T12:00:00+00:00

Two papers that we submitted as long papers to the ACL conference in Vienna have been accepted and have now appeared officially in print.

The first paper by Frederic Blum, Steffen Herbold, and myself, presents our study on automated language affiliation ("From Isolates to Families: Using Neural Networks for Automated Language Affiliation", URL).

In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,200 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.

The second study by Arne Rubehn and myself presents our work on concept embeddings ("Partial Colexifications Improve Concept Embeddings", URL).

While the embedding of words has revolutionized the field of Natural Language Processing, the embedding of concepts has received much less attention so far. A dense and meaningful representation of concepts, however, could prove useful for several tasks in computational linguistics, especially those involving cross-linguistic data or sparse data from low resource languages. First methods that have been proposed so far embed concepts from automatically constructed colexification networks. While these approaches depart from automatically inferred polysemies, attested across a larger number of languages, they are restricted to the word level, ignoring lexical relations that would only hold for parts of the words in a given language. Building on recently introduced methods for the inference of partial colexifications, we show how they can be used to improve concept embeddings in meaningful ways. The learned embeddings are evaluated against lexical similarity ratings, recorded instances of semantic shift, and word association data. We show that in all evaluation tasks, the inclusion of partial colexifications lead to improved concept representations and better results. Our results further show that the learned embeddings are able to capture and represent different semantic relationships between concepts.

In addition, Kellen Parker van Dam today published a study in our Blog / Journal on Computer-Assisted Language Comparison in Practice ("Digitizing Legacy Lexical Data of Muishaung for Computer-Assisted Language Comparison", DOI).

This study describes the process of digitizing legacy materials into a computer-readable format for the purposes of computational typology and computer-assisted historical reconstruction. It presents a comparative wordlist that is made available in the formats recommended by the Cross-Linguistic Data Formats initiative.

Two Papers at the SIGTYP Workshop at ACL in Vienna

2025-07-31T12:00:00+00:00

Two papers that we submitted to the SIGTYP workshop at the ACL conference in Vienna have been accepted and have now appeared officially in print.

The first paper by David Snee et al. ("Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists", URL) presents robustness tests on concept translation in multilingual wordlists.

Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.

The second study by Arne Rubehn et al. presents our work on numeral annotation ("Annotating and Inferring Compositional Structures in Numeral Systems Across Languages", URL).

Numeral systems across the world’s languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

New Blog Post on Ambiguities in German

2025-08-07T12:00:00+00:00

A new blog post discussing ambiguities in German compounds appeared today, titled "Von subjektiven und objektiven Fällen" (URL: https://wub.hypotheses.org/2928).

New Blog Post on Templates for NoRaRe

2025-08-26T12:00:00+00:00

A new blog post presenting templates for the Database of Norms, Ratings, and Relations appeared already yesterday (URL: https://calc.hypotheses.org/8723).

This study introduces a collection of templates that can be used to contribute data to the Database of Norms, Ratings, and Relations (NoRaRe) of words and concepts. The templates are intended to facilitate the process of dataset conversion and serve as a starting point for those who are interested to contribute data to the catalog. A first template structure with two sample datasets is introduced and discussed in more detail, pointing to those aspects of data curation that may lead to confusion among users who contribute the first time to the NoRaRe database.

New Blog Post on Historical Linguistics

2025-09-15T12:00:00+00:00

A new German blog post that appeared today discusses some interesting parallels regarding the introduction of digital and computational approaches in mathematics and historical linguistics ("Schon gesehen", URL: https://wub.hypotheses.org/3018).

New Blog Post on Semantic Embeddings in NoRaRe

2025-09-17T12:00:00+00:00

A new blog post presenting workflows for integrating and retrieving semantic embeddings with the Database of Norms, Ratings, and Relations has appeared today (URL: https://calc.hypotheses.org/8723).

This study illustrates how semantic embeddings can be added to and retrieved from NoRaRe. By that, it provides a template for handling vector data and makes popular methodology in semantic modeling available for cross-linguistic comparison.

A new book on non-verbal predication in the world's languages

2025-09-22T12:00:00+00:00

A new book on non-verbal predication, co-edited by Luca Ciucci, has just been published in the series Comparative Handbooks of Linguistics. Its 33 chapters, written by international experts, present a new typological framework for the study of non-verbal predication and provide detailed descriptions from selected languages and families across Eurasia, the Americas, Africa, and Oceania. Particular attention is given to languages from traditionally little-described families.

This work is the result of the collaboration of 40 scholars over more than five years and consists of two volumes with a total of about 1,300 pages. It is intended to serve as a reference work on non-verbal predication for years to come.

Bertinetto, Pier Marco, Luca Ciucci & Denis Creissels (eds.). 2025. Non-verbal predication in the world’s languages: A typological survey. Volume 1: Eurasia, North America, South America. Berlin & Boston: De Gruyter Mouton.

Bertinetto, Pier Marco, Luca Ciucci & Denis Creissels (eds.). 2025. Non-verbal predication in the world’s languages: A typological survey. Volume 2: Africa, Austronesia, Papunesia, Australia. Berlin & Boston: De Gruyter Mouton.

The volume is available so far in electronic version, the printed copy will appear on November 3rd.

News at the Chair

2025-10-01T12:00:00+00:00

Christian Bentz, who had joined the chair last year with his ERC research group, has left the chair for a professorship in Saarbrücken. While this is sad news for the chair, it is fantastic news for Christian. We wish Christian all the best for the new challenges that await him now, and we are very thankful that we had the chance to have him here with us for at least some time, as his work was always very inspiring for many of us. Given that distances can be easily bridged with modern communication, we will surely stay in contact and look forward to collaborating with Christian in the future.

New Preprint Submitted

2025-10-02T12:00:00+00:00

Today, we published a new preprint, by Katja Bocklage (with many other people from our chair helping with annotations) and Thanasis Georgakopoulos joining us in the analysis. In this study, we put partial affix colexifications to the test and find that they may be a useful way to address certain problems in lexical typology. The preprint can be found online now (DOI), the abstract is given below.

Cross-linguistic colexification patterns have proven useful for quantitative studies in lexical typology. While most studies focus on full colexification, where senses are co-expressed by the same word form, recent studies have proposed to compute partial colexifications, where senses are not colexificied by entire words, but only by parts of them. Among these, affix colexifications, where one word recurs in the end or the beginning of another word, show interesting properties, potentially reflecting word formation processes giving hints cross-linguistic motivation patterns. In order to test their potential, we conduct a detailed case study. Based on a large sample of cross-linguistic partial colexification patterns, computed from the Database of Cross-Linguistic Colexifications, we first check to which degree partial colexifications reflect true cases of lexical motivation and then carry out a detailed comparison of concept relations underlying frequent partial and full colexification patterns. Our results show that partial affix colexifications that recur across five and more language families tend to reflect true lexical motivation patterns in almost 90% of all cases. Furthermore, we find that majority of affix colexifications and full colexifications reflect contiguity relations. However, the proportion of contiguity relations in partial colexifications exceeds the proportion of contiguity relations in full colexifications (50% vs. 40%), showing that there are differences in the semantics reflected by both colexification types.

New Paper from IWCS in Düsseldorf Published

2025-10-04T12:00:00+00:00

Our paper presenting the CLICS⁴ database, that we released this year, has now been published as part of the IWCS conference in Düsseldorf. The study, by Annika Tjuka, Robert Forkel, Christoph Rzymski, and myself presents new innovations that we introduced along with the fourth installment of the Database of Cross-Linguistic Colexificaitons (paper available in open access, see this URL).

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

Gefühlte Tatsachen

2025-10-20T12:00:00+00:00

This months' German blogposts deals with the scientific construct and the topic of "perceived inflation" ("Gefühlte Tatsachen", URL).

New Preprint on Qù Tone Alternations

2025-10-22T12:00:00+00:00

A new preprint with Barbara Meisterernst appeared today at Open Research Europe, presenting "A database of qù-tone alternations in Ancient Chinese". The preprint is accessible from the journal, awaiting peer review now (DOI: 10.12688/openreseurope.21142.1). The database can also be inspected already (URL: https://qualternations.digling.org).

Alternations in the entering tone (qù-tone) in Ancient Chinese have for a long time fascinated scholars, since they seem to give hint to relics of morphology in the history of Chinese, contrasting strongly with the isolating structure of all modern varieties of Chinese. Here we present a transparently assembled collection of entering tone alternations in the history of Chinese, derived from Lù Démíngs historical annotation of the classics, the Jīngdiǎn Shìwén, which gives early hints on such alternations by means of historical fǎnqiè spellings. The database is available in two flavors. On the one hand, it can be accessed through a web interface that allows interested users to browse through the data in linked form. On the other hand, the data is available in the formats recommended by the Cross-Linguistic Data Formats initiative. This format does not only offer quick programmatic access to experienced users, but is also specifically apt for the purpose of archiving the data. In the study, we illustrate how the data was assembled and curated, and illustrate how the data can be put to concrete use.

New Blog Post on Lexibank FormSpec

2025-10-27T12:00:00+00:00

A new blog post appeared today, presenting the Lexibank FormSpec, a function in the PyLexibank library that can be used to handle forms in multilingual wordlists. The blog, titled "Manipulating Lexical Forms with the PyLexibank FormSpec", can be accessed online (URL: https://calc.hypotheses.org) or via our Journal landing page (DOI: https://doi.org/10.15475/calcip.2025.2.3)

Multilingual lexical data is typically stored in a wide variety of forms, based on many idiosyncratic decisions that vary from dataset to dataset. Here, a simple but efficient solution for the manipulation of lexical data in multilingual wordlists will be introduced. This solution, the PyLexibank FormSpec, was originall developed for the conversion of various kinds of lexical data to Cross-Linguistic Data Formats, but it can also be used as a standalone. This study offers a basic tutorial that illustrates how the FormSpec can be put to concrete use.

New Paper from GWC 2025 in Pavia Published

2025-11-10T12:00:00+00:00

A paper that we submitted to and presented at the 13th Global WordNet Conference in Pavia, has now appeared officially in print as part of the proceedings.

The study by Kučerová&List, "Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages" URL, presents a comparison of 17 object naming datasets from 30 distinct languages, and offers a novel, computer-assisted approach using Concepticon to assure transparency and comparability of naming datasets across languages and authors. It provides a foundation for more standardized and transparent future research on how people name objects.

Object naming -- the act of identifying an object with a word or a phrase -- is a fundamental skill in interpersonal communication, relevant to many disciplines, such as psycholinguistics, cognitive linguistics, or language and vision research. Object naming datasets, which consist of concept lists with picture pairings, are used to gain insights into how humans access and select names for objects in their surroundings and to study the cognitive processes involved in converting visual stimuli into semantic concepts. Unfortunately, object naming datasets often lack transparency and have a highly idiosyncratic structure. Our study takes the first steps towards making current object naming data transparent, and comparable by using a multilingual, computer-assisted approach that links individual items of object naming lists to unified concepts in order to make object naming datasets cross-linguistically comparable. Our current sample links 17 object naming datasets that cover 30 languages from 10 different language families. We illustrate how the comparative dataset can be explored by searching for concepts that recur across the majority of datasets and comparing the conceptual spaces of covered object naming datasets with classical basic vocabulary lists from historical linguistics. Our findings can serve as a basis for enhancing cross-linguistic object naming research and as a guideline for future studies dealing with object naming tasks.

New Preprint

2025-12-02T12:00:00+00:00

Today, we published a new preprint entitled "Integration of Linguistic Legacy Data Collections through Digital Scholarly Editions: A Case Study on Vanuatu Languages" (common work with Tihomir Rangelov, Luca Ciucci, John Burgess, Riccardo Rost and Johann-Mattis List). It is available on Humanities Commons [URL], the presented digital scholarly edition can be accessed under https://tvl.digling.org.

The past two decades have witnessed a substantial increase in computational methods for investigating language diversity and history. The amount of digital data in comparative linguistics, however, is still lagging behind, with existing digitization efforts still mostly relying on the cumbersome labor of typing off data manually. Available Optical Character Recognition tools for automating this task have received relatively little attention in digitizing legacy data in linguistics, even though they are routinely used in other disciplines. At the same time, the editorial work must go beyond plain digitalization to add various layers of analysis and standardization, while recording the full provenance of each data point. We present an efficient and transparent workflow for digitizing legacy data in comparative linguistics and integrating it with larger data collections. As a result, we present a digital scholarly edition of lexical data for Vanuatu languages published by Darrell T. Tryon in 1976.

Article About Book on Non-verbal Predication

2025-12-11T12:00:00+00:00

The recent multivolume work Non-verbal Predication in the World’s Languages, co-edited by Luca Ciucci (https://www.degruyterbrill.com/serial/chl-9-b/html), has just been featured in the Digital Research Magazine of the University of Passau.

In the article, Luca Ciucci explains to non-specialists what non-verbal predication is, tells the story behind the project, and presents some findings from the book. Read the full article here: https://www.digital.uni-passau.de/en/beitraege/2025/sprachwissenschaft

New Team Member and Science with AI

2025-12-15T12:00:00+00:00

Our project has a new team member. Dr. Carlo Meloni will join us as a post-doc in the ProduSemy project and help to investigate word family evolution in the Semitic language family. Welcome Carlo, we look forward to collaborating with you in the next two years!

My final blog post in German this year is devoted to the question if responsible research is possible when relying on current AI tools. My answer would probably be no at the time being, but I discuss this in the broader contexts of science as a personal and a general endeavor (URL: https://wub.hypotheses.org/3240).

New Blog Post on Conversion Tables for Semitic Languages

2025-12-17T12:00:00+00:00

A new blog post introduces a preliminary conversion table that links traditional Semitic transcription and transliteration systems to standardized IPA representations. It situates the table in the history of Semitic transcription practices and demonstrates its practical use with the LinSe software package, offering an open and extensible starting point for computational and comparative work on Semitic languages within the CLTS framework.

In this study we present a preliminary conversion table that can be used for transcriptions and transliterations across different Semitic languages. We introduce the basic idea behind the table, show how it can be used, and explain how we hope to expand it in the future.

The post can be found online here or in article form via its DOI.

New Preprint

2025-12-18T12:00:00+00:00

Today, we published a new preprint entitled "Variation in Language Phylogenies May Result From Variation in Concept Translation" (with David Snee, Luca Ciucci, and Johann-Mattis List). It is available on Humanities Commons [URL]. The paper highlights that concept translation variation during the compilation of comparative wordlists may yield variation in the resulting language phylogenies, both for distance-based and character-based approaches.

Phylogenetic reconstruction in historical linguistics now typically relies on cognates sets assembled from multilingual wordlists. While more and more scholars now trust in the robustness of the algorithms underlying the reconstruction of cognate-based language phylogenies, few studies have actually tested to which degree initial choices during data preparation can influence their outcome. Here, we provide first tests that focus on the role that concept translation – the initial stage of wordlist compilation, in which a list of concepts is translated into the target languages prior to identifying cognate sets – plays in phylogenetic reconstruction. Based on a newly compiled comparative dataset consisting of seven wordlists from five language families in which the same language varieties were coded by different authors, we investigate to which degree differences in concept translation lead to differences in phylogenetic reconstruction. Our results show that despite considerable differences in concept translation, lexical distances still show a considerably high correlation across all datasets. However, when comparing individual phylogenies reconstructed with the help of Bayesian inference, we find considerable differences ranging between 0.10 and 0.44 in Normalized Quartet Distances computed from posterior tree samples. An additional cluster analysis that we introduce shows that larger differences in phylogenies do not necessarily correspond to high disagreement in the larger subgroups. A detailed inspection of individual concept translation differences in the Indo-European and Tupian wordlists in our sample further confirms that concept translation differences may specifically impact subgrouping decisions in lower clades of the tree, while major groupings are often unaffected.