Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console -
i trying read xml document using beautiful soup on python 3.6.2, ipython 6.1.0, windows 10, , can't encoding right.
here's test xml, saved file in utf8-encoding:
<?xml version="1.0" encoding="utf-8"?> <root> <info name="愛よ">ÜÜÜÜÜÜÜ</info> <items> <item thing="ÖöÖö">"23Äßßß"</item> </items> </root>   first check xml using elementtree:
import xml.etree.elementtree et  def printxml(xml,indent=''):     print(indent+str(xml.tag)+': '+(xml.text if xml.text not none else '').replace('\n',''))     if len(xml.attrib) > 0:         k,v in xml.attrib.items():             print(indent+'\t'+k+' - '+v)     if xml.getchildren():         child in xml.getchildren():             printxml(child,indent+'\t')  xml0 = et.parse("test.xml").getroot() printxml(xml0)   the output correct:
root:          info: ÜÜÜÜÜÜÜ                 name - 愛よ         items:                  item: "23Äßßß"                         thing - ÖöÖö   now read same file beautiful soup , pretty-print it:
import bs4  open("test.xml") ff:     xml = bs4.beautifulsoup(ff,"html5lib") print(xml.prettify())   output:
<!--?xml version="1.0" encoding="utf-8"?--> <html>  <head>  </head>  <body>   <root>    <info name="愛よ">     ÜÜÜÜÜÜÜ    </info>    <items>     <item thing="ÖöÖö">      "23Äßßß"     </item>    </items>   </root>  </body> </html>   this wrong. doing call explicite encoding specified bs4.beautifulsoup(ff,"html5lib",from_encoding="utf-8") doesn't change result.
doing
print(xml.original_encoding)   outputs
none   so beautiful soup apparently unable detect original encoding though file encoded in utf8 (according notepad++) , header information says utf-8 well, , have chardet installed as doc recommends.
am making mistake here? causing this?
edit: when invoke code without html5lib warning:
userwarning: no parser explicitly specified, i'm using best available html parser system ("html5lib").  isn't problem, if run code on system, or in different virtual environment,  may use different parser , behave differently.  code caused warning on line 241 of file c:\users\my.name\appdata\local\continuum\anaconda2\envs\python3\lib\site-packages\spyder\utils\ipython\start_kernel.py.  rid of warning, change code looks this:   beautifulsoup(your_markup})  this:   beautifulsoup(your_markup, "html5lib")    markup_type=markup_type))   edit 2:
as suggested in comment tried bs4.beautifulsoup(ff,"html.parser"), problem remains.
then installed lxml , tried bs4.beautifulsoup(ff,"lxml-xml"), still same output.
what strikes me odd when specifying encoding bs4.beautifulsoup(ff,"lxml-xml",from_encoding='utf-8') value of xml.original_encoding none contrary written in doc.
edit 3:
i put xml contents string
xmlstring = "<?xml version=\"1.0\" encoding=\"utf-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"   and used bs4.beautifulsoup(xmlstring,"lxml-xml"), i'm getting correct output:
<?xml version="1.0" encoding="utf-8"?> <root>  <info name="愛よ">   ÜÜÜÜÜÜÜ  </info>  <items>   <item thing="ÖöÖö">    "23Äßßß"   </item>  </items> </root>   so seems wrong file after all.
found error, have specify encoding when opening file:
with open("test.xml",encoding='utf-8') ff:     xml = bs4.beautifulsoup(ff,"html5lib")   as i'm on python 3 thought value of encoding utf-8 default, turned out it's system-dependent , on system it's cp1252.
wiki
Comments
Post a Comment