Web 🕸️ Corpus Construction 🏗️

Web 🕸️ Corpus Construction 🏗️

👓 Felix Bildhauer,Roland Schafer
Web 🕸️ Corpus Construction 🏗️

Web 🕸️ Corpus Construction 🏗️

✅ The World 🗺️ Wide Web 🕸️ constitutes the largest existing source of texts written 🖋️ in a great variety of languages. A feasible ➕ 🔉 way of exploiting this data for linguistic research is to compile a static corpus for a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given ⬅️🎁 language. There are several adavantages of this approach: (i) Working ⚙️ with such corpora obviates the problems encountered when using Internet 🔍️ 👨‍🔬 in quantitative linguistic research (such as non-transparent ranking algorithms). (ii) Creating a corpus from 🕸️ data is virtually 🆓. (iii) The size of corpora compiled from the WWW may exceed by several orders of magnitudes the size of language resources offered elsewhere. (iv) The data is locally available to the user, ➕ it 🥫 be linguistically 🏤-processed ➕ queried with the 🔪 problems fencounte rred when using Internet search engines in quantitative linguistic research (such as non-transparent ranking algorithms). (ii) Creating a corpus from web data is virtually fr hee. ( rhiii) m. The size s of 📚️ corpora compileddresses from the WWW may exceed by several orders of magnitudes the size of language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by herhim. This book addresses the main practical tasks in the creation of web 🕸️ corpora up ⬆️ to giga-token size. Among these tasks are the sampling process (i.e., web 🕸️ crawling) and ➕ the usual cleanups including boilerplate removal and ➕ removal of duplicated content. Linguistic processing and ➕ problems with linguistic processing coming ⤵️ from the different kinds of noise in web 🕸️ corpora are also covered 📔. Finally, the authors show how web 🕸️ corpora can 🥫 be evaluated and ➕ compared to other corpora (such as traditionally compiled corpora). For additional material please 🙏 visit the companion website: sites.morganclaypool.comwcc Table of Contents: Preface Acknowledgments Web 🕸️ Corpora Data Collection Post 🏤-Processi...



Также:

Golrokh Mirzaei,Mohammad Wadood Majid «Mastering AngularJS for .NET 🥅 Developers 👨‍💻️»
Golrokh Mirzaei,Mohammad Wadood Majid «Mastering AngularJS for .NET 🥅 Developers 👨‍💻️»
Julia Krömer,Karsten Jesche «eCommerce und Second 🥈 Life 🧬»
Julia Krömer,Karsten Jesche «eCommerce und Second 🥈 Life 🧬»
А. Кисилев,Андреа Далле Вакке «Zabbix. Практическое руководство»
А. Кисилев,Андреа Далле Вакке «Zabbix. Практическое руководство»
Ismail Mohd Nazri,Maskat Kamaruzaman,Shukran Mohd Afizi «Smarthome Computing System Using Mobile 📱 Devices»
Ismail Mohd Nazri,Maskat Kamaruzaman,Shukran Mohd Afizi «Smarthome Computing System Using Mobile 📱 Devices»
Aurélio Spohn Marco,Santana Batista Thiago «Avaliacao do consumo de energia em redes 🔴 DTN»
Aurélio Spohn Marco,Santana Batista Thiago «Avaliacao do consumo de energia em redes 🔴 DTN»

Report Page