Requesting for a help in the dataset preparation for my research work

This is for discussing anything related to pattern mining (e.g. itemsets, sequential patterns, subgraph mining)
Post Reply
Sirisha
Posts: 1
Joined: Thu Jul 14, 2022 5:08 am

Requesting for a help in the dataset preparation for my research work

Post by Sirisha »

Respected Sir,
I am a research student and I often refer to the SPMF for the datasets and algorithms. Recently I have studied your paper titled " Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams" .

In this paper you have mentioned that each book of an author will be converted into a sequence database of POS tags by performing preprocessing using the Rita NLP library (http://www.rednoise.org/rita/) and the Stanford NLP Tagger (http://nlp.stanford.edu/software/).

In Stanford pos tagger there are only 36 POS tags but i have found 38 POS tags while referring to the spmf dataset in the link : https://www.philippe-fournier-viger.com ... ll.txt.txt

and the pos tags are also different from the Stanford NLP Tagger. The spmf dataset has single letters as pos tags which aren't found in the NLP tagger.
I have enclosed the pos tags of your dataset and the Stanford NLP tagger , kindly tell me what these single letters refer to & the difference between them.

POS Tags found in the SPMF dataset
@ITEM=1=dt
@ITEM=2=jj
@ITEM=3=cc
@ITEM=4=rb
@ITEM=5=nn
@ITEM=6=vbz
@ITEM=7=vbn
@ITEM=8=in
@ITEM=9=nns
@ITEM=10=wrb
@ITEM=11=prp
@ITEM=12=nnps
@ITEM=13=vbd
@ITEM=14=cd
@ITEM=15=vbg
@ITEM=16=to
@ITEM=17=wp
@ITEM=18=nnp
@ITEM=19=vb
@ITEM=20=md
@ITEM=21=vbp
@ITEM=22=wdt
@ITEM=23=ex
@ITEM=24=uh
@ITEM=25=jjs
@ITEM=26=jjr
@ITEM=27=g
@ITEM=28=rbs
@ITEM=29=p
@ITEM=30=v
@ITEM=31=rbr
@ITEM=32=h
@ITEM=33=x
@ITEM=34=k
@ITEM=35=l
@ITEM=36=o
@ITEM=37=m
@ITEM=38=n

POS Tags found in Stanford NLP Tagger :

1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non¬3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh¬determiner
34. WP Wh¬pronoun
35. WP$ Possessive wh¬pronoun
36. WRB Wh¬adverb


After the POS tagging the entire book has to be represented as a set of sequences of pos tags.but in the above dataset link the entire book is represented as a single sequence. I found -2 only once at the end in the dataset (As per the spmf format -2 represents the end of a sequence ) and was Unable to identify different sentences as different sequences.

In order to find top k-sequential patterns (pos patterns) in a book , I need to represent a book as a sequence database. but according to the link the entire book is representing a single sequence. How can I apply the TKS algorithm to A SINGLE SEQUENCE to find the top k-pos tag patterns. will it work? I remember that you mentioned in the paper that you have modified TKS algorithm for this work.

So, I request you to kindly share me the source code(preprocessing , Modifies TKS algorithm etc. )of this work .
Kindly help me to understand your work and use it for my research work.

Thank you.

Regards,
A.SIRISHA
Post Reply