corpus.tools
Justext – Corpus tools
http://www.corpus.tools/wiki/Justext
JusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. See what is kept and what is discarded from a typical web page. Read a description of the jusText algorithm. Make sure you have Python and lxml library version 2.2.4 or later installed. Extract the downloaded file:. For usage information see:.
corpus.tools
Onion – Corpus tools
http://www.corpus.tools/wiki/Onion
Onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. Libjudy ( =1.0.5). Wget -O onion-1.2.tar.gz 'http:/ corpus.tools/raw-attachment/wiki/Downloads/onion-1.2.tar.gz'. Extract the downloaded file:. Tar xzvf onion-1.2.tar.gz. Configure the package by editing onion-1.2/Makefile.config:. Set PREFIX (or INSTALL BIN and INSTALL DATA) according to where you want the executables and data (docs) installed. Set JUDY INC to where Judy.h is located. On a sample input.
sigwac.org.uk
WAC7 – ACL SIGWAC
https://sigwac.org.uk/wiki/WAC7
7th Web as Corpus Workshop (WAC-7). Lyon, France; 17th April 2012. To be held in association with. We invite submissions which:. Describe Web corpus collection projects, or modules for one part of the process (crawling, filtering, de-duplication, language-id, tokenising, indexing, .). Explore characteristics of Web data from a linguistics/NLP perspective including registers, domains, frequency distributions, comparisons between datasets. Room Saint Clair 4 at Convention Centre, WWW2012. Serge Sharoff (Un...
corpus.tools
Chared – Corpus tools
http://www.corpus.tools/wiki/Chared
Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints. Make sure you have Python 2.6 or later and lxml library version 2.2.4 or later installed. Extract the downloaded file:. Import urllib2...
nlp.fi.muni.cz
Main Topics
https://nlp.fi.muni.cz/cs/MainTopics
Na čem pracujeme v Centru? Korpus je kolekce textových dat v elektronické podobě. Jako významný zdroj lingvistických dat slouží korpusy ke zkoumání mnoha frekvenčních jevů jazyka a v současnosti jsou již neodmyslitelným nástrojem v oblasti NLP. Kromě korpusů obsahujících libovolné texty, se vytvářejí i jiné pro zvláštní účely, např. paralelní (vícejazyčné), doménové, mluvené nebo chybové. V dnešní době je nejdůležitějším zdrojem korpusových dat Internet. Aby bylo možné data získaná z webu použít pro ...
panacea-lr.eu
Related Projects / Networks | Panacea Project
http://www.panacea-lr.eu/en/project/related-projects-networks
MT SUMMIT 2013 PANACEA TUTORIAL. Related European Research Projects. Research infrastructures and networks. MT Summit 2013 PANACEA tutorial now available. MT Summit Panacea 2013. 8 PANACEA Papers accepted at COLING 2012. Forthcoming Beta testing Week. PANACEA is an EU Funded Project under Grant Agreement 248064.