python - How do I download NLTK data? -




updated answer:nltk works 2.7 well. had 3.2. uninstalled 3.2 , installed 2.7. works!!

i have installed nltk , tried download nltk data. did follow instrution on site: http://www.nltk.org/data.html

i downloaded nltk, installed it, , tried run following code:

>>> import nltk >>> nltk.download() 

it gave me error message below:

traceback (most recent call last):   file "<pyshell#6>", line 1, in <module>     nltk.download() attributeerror: 'module' object has no attribute 'download'  directory of c:\python32\lib\site-packages 

tried both nltk.download() , nltk.downloader(), both gave me error messages.

then used help(nltk) pull out package, shows following info:

name     nltk  package contents     align     app (package)     book     ccg (package)     chat (package)     chunk (package)     classify (package)     cluster (package)     collocations     corpus (package)     data     decorators     downloader     draw (package)     examples (package)     featstruct     grammar         inference (package)     internals     lazyimport     metrics (package)     misc (package)     model (package)     parse (package)     probability     sem (package)     sourcedstring     stem (package)     tag (package)     test (package)     text     tokenize (package)     toolbox     tree     treetransforms     util     yamltags  file     c:\python32\lib\site-packages\nltk 

i see downloader there, not sure why not work. python 3.2.2, system windows vista.

tl;dr

to download particular dataset/models, use nltk.download() function, e.g. if looking download punkt sentence tokenizer, use:

$ python3 >>> import nltk >>> nltk.download('punkt') 

if you're unsure of data/model need, can start out basic list of data + models with:

>>> import nltk >>> nltk.download('popular') 

it download list of "popular" resources, these includes:

<collection id="popular" name="popular packages">       <item ref="cmudict" />       <item ref="gazetteers" />       <item ref="genesis" />       <item ref="gutenberg" />       <item ref="inaugural" />       <item ref="movie_reviews" />       <item ref="names" />       <item ref="shakespeare" />       <item ref="stopwords" />       <item ref="treebank" />       <item ref="twitter_samples" />       <item ref="omw" />       <item ref="wordnet" />       <item ref="wordnet_ic" />       <item ref="words" />       <item ref="maxent_ne_chunker" />       <item ref="punkt" />       <item ref="snowball_data" />       <item ref="averaged_perceptron_tagger" />     </collection> 

edited

in case avoiding errors downloading larger datasets nltk, https://stackoverflow.com/a/38135306/610569

$ rm /users/<your_username>/nltk_data/corpora/panlex_lite.zip $ rm -r /users/<your_username>/nltk_data/corpora/panlex_lite $ python  >>> import nltk >>> dler = nltk.downloader.downloader() >>> dler._update_index() >>> dler._status_cache['panlex_lite'] = 'installed' # trick index treat panlex_lite it's installed. >>> dler.download('popular') 

and if wants find nltk_data directory, see https://stackoverflow.com/a/36383314/610569

and config nltk_data path, see https://stackoverflow.com/a/22987374/610569

updated

from v3.2.5, nltk has more informative error message when nltk_data resource not found, e.g.:

>>> nltk import word_tokenize >>> word_tokenize('x') traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "/users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize     sentences = [text] if preserve_line else sent_tokenize(text, language)   file "/users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))   file "/users/alvas/git/nltk/nltk/data.py", line 820, in load     opened_resource = _open(resource_url)   file "/users/alvas/git/nltk/nltk/data.py", line 938, in _open     return find(path_, path + ['']).open()   file "/users/alvas/git/nltk/nltk/data.py", line 659, in find     raise lookuperror(resource_not_found) lookuperror:  **********************************************************************   resource punkt not found.   please use nltk downloader obtain resource:    >>> import nltk   >>> nltk.download('punkt')    searched in:     - '/users/alvas/nltk_data'     - '/usr/share/nltk_data'     - '/usr/local/share/nltk_data'     - '/usr/lib/nltk_data'     - '/usr/local/lib/nltk_data'     - '' ********************************************************************** 




wiki

Comments

Popular posts from this blog

Asterisk AGI Python Script to Dialplan does not work -

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -