Удаление стоп-слов из строки в Java

Question

Удаление стоп-слов из строки в Java

у меня есть строка с большим количеством слов и у меня есть текстовый файл, который содержит стоп-слова, которые мне нужно удалить из моей строки. Допустим у меня есть строка

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

после удаления стоп-слов строка должна быть такой:

"love phone, super fast much cool jelly bean....but recently bugs."

я смог достичь этого, но проблема, с которой я сталкиваюсь, заключается в том, что whenver есть соседние стоп-слова в строке его удаление только первого, и я получаю результат как :

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"

вот мой stopwordslist.txt-файл : стоп

как я могу решить эту проблему. Вот что я сделал до сих пор :

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

7

java stop-words string

автор: JavaLearner

10 ответов

автор: alain.janinm · Accepted Answer · 2014-12-29 09:25:18

ошибка заключается в том, что вы удаляете элемент из списка, который вы повторяете. Пусть говорит у вас wordsList, содержащую |word0|word1|word2| Если ii равна 1 и если тест верно, то вы называете wordsList.remove(1);. После этого ваш список |word0|word2|. ii затем увеличивается и равна 2 и теперь это выше размера вашего списка, следовательно word2 никогда не будет испытана.

оттуда есть несколько решений. Например, вместо удаления значений можно задать значение "". Или создать специальный список "результат".

автор: robin · Accepted Answer · 2014-12-29 09:18:22

попробуйте программу ниже.

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

выход: любовь телефон, его супер быстро так много новых интересных вещей с jelly bean....но в последнее время я видел некоторые ошибки.

автор: geert3 · Accepted Answer · 2014-12-29 09:27:53

Это гораздо более элегантное решение (IMHO), использующее только регулярные выражения:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\b(I|this|its.....)\b\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

автор: Navnath Chinchore · Accepted Answer · 2014-12-29 10:54:45

вы можете использовать заменить все функции такой

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

автор: Vimal Bera · Accepted Answer · 2014-12-29 09:09:23

вместо этого почему бы вам не использовать ниже подход. Будет легче читать и понимать :

for(String word : words){
    s = s.replace(word+"\s*", "");
}
System.out.println(s);//It will print removed word string.

автор: Darshan Lila · Accepted Answer · 2014-12-29 09:14:00

вот попробуйте следующим образом:

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

таким образом, конечный результат будет без слов, которые вы не хотите в нем. Просто получите список стоп-слов в массиве и замените в требуемой строке.
вывод для моих стоп-слов:

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

автор: SMA · Accepted Answer · 2014-12-29 09:05:13

попробуйте использовать replaceAll api строки, как:

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

автор: Michal Lozinski · Accepted Answer · 2014-12-29 09:31:39

попробуйте сохранить стоп-слова в коллекции наборов,а затем обозначить строку в списке. После этого вы можете просто использовать "removeAll", чтобы получить результат.

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

нет для необходимых циклов - они обычно означают проблемы.

автор: Inquisitor · Accepted Answer · 2015-10-13 01:08:55

кажется, что вы делаете остановку одно стоп-слово удаляется в предложении перейти к другому стоп-слову: вам нужно удалить все стоп-слова в каждом предложении.

вы должны попробовать изменить свой код:

From:

for(int ii = 0; ii < wordsList.size(); ii++){
    for(int jj = 0; jj < k; jj++){
        if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
            wordsList.remove(ii);
            break;
        }
    }
}

к чему-то вроде:

for(int ii = 0; ii < wordsList.size(); ii++)
{
    for(int jj = 0; jj < k; jj++)
    {
        if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
        {
            wordsList.remove(ii);
        }
    }
}

отметим, что break удаляется и stopword.contains(word) изменено на word.contains(stopword).

автор: Uttesh Kumar · Accepted Answer · 2016-01-08 04:29:54

недавно один из проектов требовал функциональности для фильтрации остановки / stemm и ругательств из данного текста или файла после прохождения нескольких блогов и записей. создана простая библиотека для фильтрации данных / файлов и доступна в maven. надеюсь, это кому-то поможет.

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>