段落計數令牌

令牌有時也叫作標誌,在從源讀取文本時,有時我們需要找出有關所用單詞的一些統計信息。 這使得有必要計算單詞的數量以及計算給定文本中具有特定類型單詞的行數。 在下面的示例中,我們展示了使用兩種不同方法計算段落中單詞的程序。假設這個示例文本中包含好萊塢電影的摘要。

讀取文件

FileName = ("Path\GodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()
    print lines_in_file

當運行上面的程序時,得到以下輸出 -

Vito Corleone is the aging don (head) of the Corleone Mafia Family. His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer. This does not please Sollozzo, who has the Don shot down by some of his hit men. The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.

使用nltk計算單詞

接下來,使用nltk模塊來計算文本中的單詞。 請注意,(head)這個詞被算作3個單詞而不是1個單詞。

參考以下代碼 -

import nltk

FileName = ("Path\GodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()

    nltk_tokens = nltk.word_tokenize(lines_in_file)
    print nltk_tokens
    print "\n"
    print "Number of Words: " , len(nltk_tokens)

當運行上面的程序時,得到以下輸出 -

['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(', 'head', ')', 'of', 'the', 'Corleone', 'Mafia', 'Family', '.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', '(', 'Michael', "'s", 'sister', ')', 'to', 'Carlo', 'Rizzi', '.', 'All', 'of', 'Michael', "'s", 'family', 'is', 'involved', 'with', 'the', 'Mafia', ',', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life', '.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money', '.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it', ',', 'but', ',', 'much', 'against', 'the', 'advice', 'of', 'the', 'Don', "'s", 'lawyer', 'Tom', 'Hagen', ',', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs', ',', 'and', 'turns', 'down', 'the', 'offer', '.', 'This', 'does', 'not', 'please', 'Sollozzo', ',', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men', '.', 'The', 'Don', 'barely', 'survives', ',', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart', '.']

Number of Words:  167

使用Split函數計數單詞

接下來使用Split函數計算單詞,這裏單詞(head)被計爲單個單詞而不是3個單詞,就像使用nltk一樣。

FileName = ("Path\GodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()

    print lines_in_file.split()
    print "\n"
    print  "Number of Words: ", len(lines_in_file.split())

當運行上面的程序時,得到以下輸出 -

['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(head)', 'of', 'the', 'Corleone', 'Mafia', 'Family.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', "(Michael's", 'sister)', 'to', 'Carlo', 'Rizzi.', 'All', 'of', "Michael's", 'family', 'is', 'involved', 'with', 'the', 'Mafia,', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it,', 'but,', 'much', 'against', 'the', 'advice', 'of', 'the', "Don's", 'lawyer', 'Tom', 'Hagen,', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs,', 'and', 'turns', 'down', 'the', 'offer.', 'This', 'does', 'not', 'please', 'Sollozzo,', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men.', 'The', 'Don', 'barely', 'survives,', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart.']

Number of Words:  146