Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console -
i trying read xml document using beautiful soup on python 3.6.2, ipython 6.1.0, windows 10, , can't encoding right.
here's test xml, saved file in utf8-encoding:
<?xml version="1.0" encoding="utf-8"?> <root> <info name="愛よ">ÜÜÜÜÜÜÜ</info> <items> <item thing="ÖöÖö">"23Äßßß"</item> </items> </root>
first check xml using elementtree:
import xml.etree.elementtree et def printxml(xml,indent=''): print(indent+str(xml.tag)+': '+(xml.text if xml.text not none else '').replace('\n','')) if len(xml.attrib) > 0: k,v in xml.attrib.items(): print(indent+'\t'+k+' - '+v) if xml.getchildren(): child in xml.getchildren(): printxml(child,indent+'\t') xml0 = et.parse("test.xml").getroot() printxml(xml0)
the output correct:
root: info: ÜÜÜÜÜÜÜ name - 愛よ items: item: "23Äßßß" thing - ÖöÖö
now read same file beautiful soup , pretty-print it:
import bs4 open("test.xml") ff: xml = bs4.beautifulsoup(ff,"html5lib") print(xml.prettify())
output:
<!--?xml version="1.0" encoding="utf-8"?--> <html> <head> </head> <body> <root> <info name="愛よ"> ÜÜÜÜÜÜÜ </info> <items> <item thing="ÖöÖö"> "23Äßßß" </item> </items> </root> </body> </html>
this wrong. doing call explicite encoding specified bs4.beautifulsoup(ff,"html5lib",from_encoding="utf-8")
doesn't change result.
doing
print(xml.original_encoding)
outputs
none
so beautiful soup apparently unable detect original encoding though file encoded in utf8 (according notepad++) , header information says utf-8 well, , have chardet
installed as doc recommends.
am making mistake here? causing this?
edit: when invoke code without html5lib
warning:
userwarning: no parser explicitly specified, i'm using best available html parser system ("html5lib"). isn't problem, if run code on system, or in different virtual environment, may use different parser , behave differently. code caused warning on line 241 of file c:\users\my.name\appdata\local\continuum\anaconda2\envs\python3\lib\site-packages\spyder\utils\ipython\start_kernel.py. rid of warning, change code looks this: beautifulsoup(your_markup}) this: beautifulsoup(your_markup, "html5lib") markup_type=markup_type))
edit 2:
as suggested in comment tried bs4.beautifulsoup(ff,"html.parser")
, problem remains.
then installed lxml
, tried bs4.beautifulsoup(ff,"lxml-xml")
, still same output.
what strikes me odd when specifying encoding bs4.beautifulsoup(ff,"lxml-xml",from_encoding='utf-8')
value of xml.original_encoding
none
contrary written in doc.
edit 3:
i put xml contents string
xmlstring = "<?xml version=\"1.0\" encoding=\"utf-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"
and used bs4.beautifulsoup(xmlstring,"lxml-xml")
, i'm getting correct output:
<?xml version="1.0" encoding="utf-8"?> <root> <info name="愛よ"> ÜÜÜÜÜÜÜ </info> <items> <item thing="ÖöÖö"> "23Äßßß" </item> </items> </root>
so seems wrong file after all.
found error, have specify encoding when opening file:
with open("test.xml",encoding='utf-8') ff: xml = bs4.beautifulsoup(ff,"html5lib")
as i'm on python 3 thought value of encoding
utf-8
default, turned out it's system-dependent , on system it's cp1252
.
wiki
Comments
Post a Comment