python - dimension mismatch error in CountVectorizer MultinomialNB -

- February 25, 2012

before lodge question, have i've thoroughly read more 15 similar topics on board, each somehow different recommendations, of them not me right.

ok, split 'spam email' text data (originally in csv format) training , test sets, using countvectorizer , 'fit_transform' function fit vocabulary of corpus , extracts word count features text. , applied multinomialnb() learn training set , predict on test set. here code (simplified):

from sklearn.feature_extraction.text import countvectorizer  sklearn.cross_validation import train_test_split  sklearn.naive_bayes import multinomialnb    # loading data   # data contains 2 columns ('text', 'target')    spam = pd.read_csv('spam.csv')  spam['target'] = np.where(spam_data['target']=='spam',1,0)    # split data  x_train, x_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0)     # fit vocabulary , extract word count features  cv = countvectorizer()  x_traincv = cv.fit_transform(x_train)    x_testcv = cv.fit_transform(x_test)    # learn , predict using multinomialnb  clfnb = multinomialnb(alpha=0.1)  clfnb.fit(x_traincv, y_train)    # far good, when predict on x_testcv  y_pred = algo.predict(x_testcv)    # python throws me error: dimension mismatch

the suggestions gleaned previous question threads (1) use .transform() on x_test, or (2) ascertain if each row in original spam data on string format (yes, are), or (3) nothing on x_test. of them didn't ring bell , python kept giving me 'dimension mismatch' error. after struggling 4 hours, had succumb stackoverflow. appreciated if enlighten me on this. want know goes wrong code , how dimension right.

thank you.

btw, original data entries this

_                                                                                       test   target  0 go until jurong point, crazy.. available    0  1 ok lar... joking wif u oni...                    0  2 free entry in 2 wkly comp win fa cup fina   1  3 u dun hor... u c   0  4 nah don't think goes usf, lives aro   0  5 freemsg hey there darling it's been 3 week's n   1  6 winner!! valued network customer have   1

your countvectorizer has been fitted training data. test data, want call transform(), not fit_transform().

otherwise, if use fit_transform() again on test data, different columns based on unique vocabulary of test data. fit once training.

x_testcv = cv.transform(x_test)

wiki

Search This Blog

tL

python - dimension mismatch error in CountVectorizer MultinomialNB -

Comments

Post a Comment

Popular posts from this blog

python - Read npy file directly from S3 StreamingBody -

Asterisk AGI Python Script to Dialplan does not work -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -