Sunday coding sessions: #1 Detecting the language of HIVE content

Hive hosts lots of different languages. If you want to categorize the content, one of the categorization points is the language used for the posts/comments.

Detecting the post's language was one of the requirements in my current project. It turns out it was pretty easy to do it, dropping here some experience for future reference.

What do we need?

A function fetches the post body and removes the noise from the post body
A function returns the language code detected in the post body

Clearing the noise

To feed the language detector, we need to filter out the HTML and Markdown tags. Passing clean data as much as possible will increase the language detection's correctness.

from bs4 import BeatifulSoup
from markdown import markdown

def clear_noise(post_body):
    body_as_html = markdown(post_body)
    soup = BeautifulSoup(body_as_html, "html.parser")
    post_body = ''.join(soup.findAll(text=True))
    post_body = re.sub('<.*?>', '', post_body)
    post_body = re.sub(
    '```.*?```', '', post_body, flags=re.MULTILINE | re.DOTALL)
    return post_body

This simple snippet removes Markdown, HTML tags, and try to return only sentences/words.

Detecting the language

"There is a library for that" is the answer to most questions in Python world. I've used langdetect in the past and it was working considerably well.

from lighthive.client import Client
from langdetect import detect

def detect_language(author, permlink):
    c = Client()
    post_body = c.get_content(author, permlink)["body"]
    post_body = clear_noise(post_body)

    return detect(post_body)

We've used lighthive's Client to fetch the post body, cleared the post text with the clear_noise and passed it to the language detector library.

Let's test it with some example posts; I've picked four different posts with four different languages:

posts = [
    ('themarkymark', 'stemgeek-s-first-hackathon'),
    ('emrebeyler', 'yeni-baslayanlar-icin-hive'),
    ('clayop', '4v8phu-9'),
    ('satren', 'virtuelles-dach-meetup-am-sonntag-dem-17-05-2020-18-00-q9r1qv'),
]

for author, permlink in posts:
    language = detect_language(author, permlink)
    print("Language of @%s/%s is %s" % (author, permlink, language))

Output:

Language of @themarkymark/stemgeek-s-first-hackathon is en
Language of @emrebeyler/yeni-baslayanlar-icin-hive is tr
Language of @clayop/4v8phu-9 is ko
Language of @satren/virtuelles-dach-meetup-am-sonntag-dem-17-05-2020-18-00-q9r1qv is de

Pretty accurate. You can try the script without installing anything at repl.it.

Bonus data: Language saturation on HIVE

This was for a small period of time, but probably gives a good idea on the actively used languages at HIVE.

Screen Shot 2020-05-03 at 18.59.24.png

Notes

langdetect supports 55 languages and the returned code is ISO 639-1 language code.
There is also a langid library with a little bit less precision, but with faster detection times. Check it out if you have tight time constraints.

Sort:

Trending

[-]

mys (62) 4 years ago

Polish community in the top#10 😊

$0.00

1 vote

emrebeyler (75) 4 years ago

heh, it's a very small time frame so that may not show the real picture but yeah, pl looks strong.

doze (75) 4 years ago

@emrebeyler my friend, pt please! ;)
Cheers!

bashadow (69) 4 years ago

Interesting to see the number of languages being used, as hive Block Chain awareness grows it will be nice to see the number of native languages grow also. now all we need is an on-chain post translation button on some of the front ends instead of having to use google translate.

A translation widget that front end developers can add to their settings page and then a translate post button directly on the post the users are looking at. Adding one more item to the three dot, (ellipsis), function that peakd uses for example.

shmoogleosukami (72) 4 years ago

That's pretty neat!

tngflx (63) 4 years ago

Hmm was expecting korean to be second.. Perhaps it is on steem blockchain.

It might be the case that it was their sleeping time. This data is a timeframe of 4-5 hours only.

gitplait (51) 4 years ago

This is a fun and useful publication. We are picking it as one of the nice dev publication on the Hive chain for the day, and we will feature it on our front-end, gitplait.tech. Well done, and we wish you success on the project you are building.

rishi556 (70) 4 years ago

Damn it. This is exactly what I needed, but in JS. Time to go hunting for a library myself.