Hive hosts lots of different languages. If you want to categorize the content, one of the categorization points is the language used for the posts/comments.
Detecting the post's language was one of the requirements in my current project. It turns out it was pretty easy to do it, dropping here some experience for future reference.
What do we need?
- A function fetches the post body and removes the noise from the post body
- A function returns the language code detected in the post body
Clearing the noise
To feed the language detector, we need to filter out the HTML and Markdown tags. Passing clean data as much as possible will increase the language detection's correctness.
from bs4 import BeatifulSoup
from markdown import markdown
def clear_noise(post_body):
body_as_html = markdown(post_body)
soup = BeautifulSoup(body_as_html, "html.parser")
post_body = ''.join(soup.findAll(text=True))
post_body = re.sub('<.*?>', '', post_body)
post_body = re.sub(
'```.*?```', '', post_body, flags=re.MULTILINE | re.DOTALL)
return post_body
This simple snippet removes Markdown, HTML tags, and try to return only sentences/words.
Detecting the language
"There is a library for that" is the answer to most questions in Python world. I've used langdetect in the past and it was working considerably well.
from lighthive.client import Client
from langdetect import detect
def detect_language(author, permlink):
c = Client()
post_body = c.get_content(author, permlink)["body"]
post_body = clear_noise(post_body)
return detect(post_body)
We've used lighthive's Client to fetch the post body, cleared the post text with the clear_noise
and passed it to the language detector library.
Let's test it with some example posts; I've picked four different posts with four different languages:
posts = [
('themarkymark', 'stemgeek-s-first-hackathon'),
('emrebeyler', 'yeni-baslayanlar-icin-hive'),
('clayop', '4v8phu-9'),
('satren', 'virtuelles-dach-meetup-am-sonntag-dem-17-05-2020-18-00-q9r1qv'),
]
for author, permlink in posts:
language = detect_language(author, permlink)
print("Language of @%s/%s is %s" % (author, permlink, language))
Output:
Language of @themarkymark/stemgeek-s-first-hackathon is en
Language of @emrebeyler/yeni-baslayanlar-icin-hive is tr
Language of @clayop/4v8phu-9 is ko
Language of @satren/virtuelles-dach-meetup-am-sonntag-dem-17-05-2020-18-00-q9r1qv is de
Pretty accurate. You can try the script without installing anything at repl.it.
Bonus data: Language saturation on HIVE
This was for a small period of time, but probably gives a good idea on the actively used languages at HIVE.